[Feature] Add support for hybrid search for pinecone vector database (#1259)

2024-02-15 13:20:14 -08:00
parent 0766a44ccf
commit 38b4e06963
18 changed files with 470 additions and 326 deletions
--- a/docs/components/data-sources/google-drive.mdx
+++ b/docs/components/data-sources/google-drive.mdx
@@ -25,4 +25,4 @@ app = App()

 url = "https://drive.google.com/drive/u/0/folders/xxx-xxx"
 app.add(url, data_type="google_drive")
-```
+```
--- a/docs/components/data-sources/overview.mdx
+++ b/docs/components/data-sources/overview.mdx
@@ -5,34 +5,35 @@ title: Overview
 Embedchain comes with built-in support for various data sources. We handle the complexity of loading unstructured data from these data sources, allowing you to easily customize your app through a user-friendly interface.

 <CardGroup cols={4}>
-  <Card title="📰 PDF file" href="/components/data-sources/pdf-file"></Card>
-  <Card title="📊 CSV file" href="/components/data-sources/csv"></Card>
-  <Card title="📃 JSON file" href="/components/data-sources/json"></Card>
-  <Card title="📝 Text" href="/components/data-sources/text"></Card>
-  <Card title="📁 Directory/ Folder" href="/components/data-sources/directory"></Card>
-  <Card title="🌐 HTML Web page" href="/components/data-sources/web-page"></Card>
-  <Card title="📽️ Youtube Channel" href="/components/data-sources/youtube-channel"></Card>
-  <Card title="📺 Youtube Video" href="/components/data-sources/youtube-video"></Card>
-  <Card title="📚 Docs website" href="/components/data-sources/docs-site"></Card>
-  <Card title="📝 MDX file" href="/components/data-sources/mdx"></Card>
-  <Card title="📄 DOCX file" href="/components/data-sources/docx"></Card>
-  <Card title="📓 Notion" href="/components/data-sources/notion"></Card>
-  <Card title="🗺️ Sitemap" href="/components/data-sources/sitemap"></Card>
-  <Card title="🧾 XML file" href="/components/data-sources/xml"></Card>
-  <Card title="❓💬 Q&A pair" href="/components/data-sources/qna"></Card>
-  <Card title="🙌 OpenAPI" href="/components/data-sources/openapi"></Card>
-  <Card title="📬 Gmail" href="/components/data-sources/gmail"></Card>
-  <Card title="📝 Github" href="/components/data-sources/github"></Card>
-  <Card title="🐘 Postgres" href="/components/data-sources/postgres"></Card>
-  <Card title="🐬 MySQL" href="/components/data-sources/mysql"></Card>
-  <Card title="🤖 Slack" href="/components/data-sources/slack"></Card>
-  <Card title="💬 Discord" href="/components/data-sources/discord"></Card>
-  <Card title="🗨️ Discourse" href="/components/data-sources/discourse"></Card>
-  <Card title="📝 Substack" href="/components/data-sources/substack"></Card>
-  <Card title="🐝 Beehiiv" href="/components/data-sources/beehiiv"></Card>
-  <Card title="💾 Dropbox" href="/components/data-sources/dropbox"></Card>
-  <Card title="🖼️ Image" href="/components/data-sources/image"></Card>
-  <Card title="⚙️ Custom" href="/components/data-sources/custom"></Card>
+  <Card title="PDF file" href="/components/data-sources/pdf-file"></Card>
+  <Card title="CSV file" href="/components/data-sources/csv"></Card>
+  <Card title="JSON file" href="/components/data-sources/json"></Card>
+  <Card title="Text" href="/components/data-sources/text"></Card>
+  <Card title="Directory" href="/components/data-sources/directory"></Card>
+  <Card title="Web page" href="/components/data-sources/web-page"></Card>
+  <Card title="Youtube Channel" href="/components/data-sources/youtube-channel"></Card>
+  <Card title="Youtube Video" href="/components/data-sources/youtube-video"></Card>
+  <Card title="Docs website" href="/components/data-sources/docs-site"></Card>
+  <Card title="MDX file" href="/components/data-sources/mdx"></Card>
+  <Card title="DOCX file" href="/components/data-sources/docx"></Card>
+  <Card title="Notion" href="/components/data-sources/notion"></Card>
+  <Card title="Sitemap" href="/components/data-sources/sitemap"></Card>
+  <Card title="XML file" href="/components/data-sources/xml"></Card>
+  <Card title="Q&A pair" href="/components/data-sources/qna"></Card>
+  <Card title="OpenAPI" href="/components/data-sources/openapi"></Card>
+  <Card title="Gmail" href="/components/data-sources/gmail"></Card>
+  <Card title="Google Drive" href="/components/data-sources/google-drive"></Card>
+  <Card title="GitHub" href="/components/data-sources/github"></Card>
+  <Card title="Postgres" href="/components/data-sources/postgres"></Card>
+  <Card title="MySQL" href="/components/data-sources/mysql"></Card>
+  <Card title="Slack" href="/components/data-sources/slack"></Card>
+  <Card title="Discord" href="/components/data-sources/discord"></Card>
+  <Card title="Discourse" href="/components/data-sources/discourse"></Card>
+  <Card title="Substack" href="/components/data-sources/substack"></Card>
+  <Card title="Beehiiv" href="/components/data-sources/beehiiv"></Card>
+  <Card title="Dropbox" href="/components/data-sources/dropbox"></Card>
+  <Card title="Image" href="/components/data-sources/image"></Card>
+  <Card title="Custom" href="/components/data-sources/custom"></Card>
 </CardGroup>

 <br/ >
--- a/docs/components/vector-databases.mdx
+++ b/docs/components/vector-databases.mdx
@@ -17,242 +17,4 @@ Utilizing a vector database alongside Embedchain is a seamless process. All you
  <Card title="Weaviate" href="#weaviate"></Card>
 </CardGroup>

-## ChromaDB
-
-<CodeGroup>
-
-```python main.py
-from embedchain import App
-
-# load chroma configuration from yaml file
-app = App.from_config(config_path="config1.yaml")
-```
-
-```yaml config1.yaml
-vectordb:
-  provider: chroma
-  config:
-    collection_name: 'my-collection'
-    dir: db
-    allow_reset: true
-```
-
-```yaml config2.yaml
-vectordb:
-  provider: chroma
-  config:
-    collection_name: 'my-collection'
-    host: localhost
-    port: 5200
-    allow_reset: true
-```
-
-</CodeGroup>
-
-
-## Elasticsearch
-
-Install related dependencies using the following command:
-
-```bash
-pip install --upgrade 'embedchain[elasticsearch]'
-```
-
-<Note>
-You can configure the Elasticsearch connection by providing either `es_url` or `cloud_id`. If you are using the Elasticsearch Service on Elastic Cloud, you can find the `cloud_id` on the [Elastic Cloud dashboard](https://cloud.elastic.co/deployments).
-</Note>
-
-You can authorize the connection to Elasticsearch by providing either `basic_auth`, `api_key`, or `bearer_auth`.
-
-<CodeGroup>
-
-```python main.py
-from embedchain import App
-
-# load elasticsearch configuration from yaml file
-app = App.from_config(config_path="config.yaml")
-```
-
-```yaml config.yaml
-vectordb:
-  provider: elasticsearch
-  config:
-    collection_name: 'es-index'
-    cloud_id: 'deployment-name:xxxx'
-    basic_auth:
-      - elastic
-      - <your_password>
-    verify_certs: false
-```
-</CodeGroup>
-
-## OpenSearch
-
-Install related dependencies using the following command:
-
-```bash
-pip install --upgrade 'embedchain[opensearch]'
-```
-
-<CodeGroup>
-
-```python main.py
-from embedchain import App
-
-# load opensearch configuration from yaml file
-app = App.from_config(config_path="config.yaml")
-```
-
-```yaml config.yaml
-vectordb:
-  provider: opensearch
-  config:
-    collection_name: 'my-app'
-    opensearch_url: 'https://localhost:9200'
-    http_auth:
-      - admin
-      - admin
-    vector_dimension: 1536
-    use_ssl: false
-    verify_certs: false
-```
-
-</CodeGroup>
-
-## Zilliz
-
-Install related dependencies using the following command:
-
-```bash
-pip install --upgrade 'embedchain[milvus]'
-```
-
-Set the Zilliz environment variables `ZILLIZ_CLOUD_URI` and `ZILLIZ_CLOUD_TOKEN` which you can find it on their [cloud platform](https://cloud.zilliz.com/).
-
-<CodeGroup>
-
-```python main.py
-import os
-from embedchain import App
-
-os.environ['ZILLIZ_CLOUD_URI'] = 'https://xxx.zillizcloud.com'
-os.environ['ZILLIZ_CLOUD_TOKEN'] = 'xxx'
-
-# load zilliz configuration from yaml file
-app = App.from_config(config_path="config.yaml")
-```
-
-```yaml config.yaml
-vectordb:
-  provider: zilliz
-  config:
-    collection_name: 'zilliz_app'
-    uri: https://xxxx.api.gcp-region.zillizcloud.com
-    token: xxx
-    vector_dim: 1536
-    metric_type: L2
-```
-
-</CodeGroup>
-
-## LanceDB
-
-_Coming soon_
-
-## Pinecone
-
-Install pinecone related dependencies using the following command:
-
-```bash
-pip install --upgrade 'embedchain[pinecone]'
-```
-
-In order to use Pinecone as vector database, set the environment variable `PINECONE_API_KEY` which you can find on [Pinecone dashboard](https://app.pinecone.io/).
-
-<CodeGroup>
-
-```python main.py
-from embedchain import App
-
-# load pinecone configuration from yaml file
-app = App.from_config(config_path="pod_config.yaml")
-# or
-app = App.from_config(config_path="serverless_config.yaml")
-```
-
-```yaml pod_config.yaml
-vectordb:
-  provider: pinecone
-  config:
-    metric: cosine
-    vector_dimension: 1536
-    index_name: my-pinecone-index
-    pod_config:
-      environment: gcp-starter
-      metadata_config:
-        indexed:
-          - "url"
-          - "hash"
-```
-
-```yaml serverless_config.yaml
-vectordb:
-  provider: pinecone
-  config:
-    metric: cosine
-    vector_dimension: 1536
-    index_name: my-pinecone-index
-    serverless_config:
-      cloud: aws
-      region: us-west-2
-```
-
-</CodeGroup>
-
-<br />
-<Note>
-You can find more information about Pinecone configuration [here](https://docs.pinecone.io/docs/manage-indexes#create-a-pod-based-index).
-You can also optionally provide `index_name` as a config param in yaml file to specify the index name. If not provided, the index name will be `{collection_name}-{vector_dimension}`.
-</Note>
-
-## Qdrant
-
-In order to use Qdrant as a vector database, set the environment variables `QDRANT_URL` and `QDRANT_API_KEY` which you can find on [Qdrant Dashboard](https://cloud.qdrant.io/).
-
-<CodeGroup>
-```python main.py
-from embedchain import App
-
-# load qdrant configuration from yaml file
-app = App.from_config(config_path="config.yaml")
-```
-
-```yaml config.yaml
-vectordb:
-  provider: qdrant
-  config:
-    collection_name: my_qdrant_index
-```
-</CodeGroup>
-
-## Weaviate
-
-In order to use Weaviate as a vector database, set the environment variables `WEAVIATE_ENDPOINT` and `WEAVIATE_API_KEY` which you can find on [Weaviate dashboard](https://console.weaviate.cloud/dashboard).
-
-<CodeGroup>
-```python main.py
-from embedchain import App
-
-# load weaviate configuration from yaml file
-app = App.from_config(config_path="config.yaml")
-```
-
-```yaml config.yaml
-vectordb:
-  provider: weaviate
-  config:
-    collection_name: my_weaviate_index
-```
-</CodeGroup>
-
 <Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/chromadb.mdx
+++ b/docs/components/vector-databases/chromadb.mdx
@@ -0,0 +1,35 @@
+---
+title: ChromaDB
+---
+
+<CodeGroup>
+
+```python main.py
+from embedchain import App
+
+# load chroma configuration from yaml file
+app = App.from_config(config_path="config1.yaml")
+```
+
+```yaml config1.yaml
+vectordb:
+  provider: chroma
+  config:
+    collection_name: 'my-collection'
+    dir: db
+    allow_reset: true
+```
+
+```yaml config2.yaml
+vectordb:
+  provider: chroma
+  config:
+    collection_name: 'my-collection'
+    host: localhost
+    port: 5200
+    allow_reset: true
+```
+
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/elasticsearch.mdx
+++ b/docs/components/vector-databases/elasticsearch.mdx
@@ -0,0 +1,39 @@
+---
+title: Elasticsearch
+---
+
+Install related dependencies using the following command:
+
+```bash
+pip install --upgrade 'embedchain[elasticsearch]'
+```
+
+<Note>
+You can configure the Elasticsearch connection by providing either `es_url` or `cloud_id`. If you are using the Elasticsearch Service on Elastic Cloud, you can find the `cloud_id` on the [Elastic Cloud dashboard](https://cloud.elastic.co/deployments).
+</Note>
+
+You can authorize the connection to Elasticsearch by providing either `basic_auth`, `api_key`, or `bearer_auth`.
+
+<CodeGroup>
+
+```python main.py
+from embedchain import App
+
+# load elasticsearch configuration from yaml file
+app = App.from_config(config_path="config.yaml")
+```
+
+```yaml config.yaml
+vectordb:
+  provider: elasticsearch
+  config:
+    collection_name: 'es-index'
+    cloud_id: 'deployment-name:xxxx'
+    basic_auth:
+      - elastic
+      - <your_password>
+    verify_certs: false
+```
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/opensearch.mdx
+++ b/docs/components/vector-databases/opensearch.mdx
@@ -0,0 +1,36 @@
+---
+title: OpenSearch
+---
+
+Install related dependencies using the following command:
+
+```bash
+pip install --upgrade 'embedchain[opensearch]'
+```
+
+<CodeGroup>
+
+```python main.py
+from embedchain import App
+
+# load opensearch configuration from yaml file
+app = App.from_config(config_path="config.yaml")
+```
+
+```yaml config.yaml
+vectordb:
+  provider: opensearch
+  config:
+    collection_name: 'my-app'
+    opensearch_url: 'https://localhost:9200'
+    http_auth:
+      - admin
+      - admin
+    vector_dimension: 1536
+    use_ssl: false
+    verify_certs: false
+```
+
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/pinecone.mdx
+++ b/docs/components/vector-databases/pinecone.mdx
@@ -0,0 +1,106 @@
+---
+title: Pinecone
+---
+
+## Overview
+
+Install pinecone related dependencies using the following command:
+
+```bash
+pip install --upgrade 'embedchain[pinecone]'
+```
+
+In order to use Pinecone as vector database, set the environment variable `PINECONE_API_KEY` which you can find on [Pinecone dashboard](https://app.pinecone.io/).
+
+<CodeGroup>
+
+```python main.py
+from embedchain import App
+
+# Load pinecone configuration from yaml file
+app = App.from_config(config_path="pod_config.yaml")
+# Or
+app = App.from_config(config_path="serverless_config.yaml")
+```
+
+```yaml pod_config.yaml
+vectordb:
+  provider: pinecone
+  config:
+    metric: cosine
+    vector_dimension: 1536
+    index_name: my-pinecone-index
+    pod_config:
+      environment: gcp-starter
+      metadata_config:
+        indexed:
+          - "url"
+          - "hash"
+```
+
+```yaml serverless_config.yaml
+vectordb:
+  provider: pinecone
+  config:
+    metric: cosine
+    vector_dimension: 1536
+    index_name: my-pinecone-index
+    serverless_config:
+      cloud: aws
+      region: us-west-2
+```
+
+</CodeGroup>
+
+<br />
+<Note>
+You can find more information about Pinecone configuration [here](https://docs.pinecone.io/docs/manage-indexes#create-a-pod-based-index).
+You can also optionally provide `index_name` as a config param in yaml file to specify the index name. If not provided, the index name will be `{collection_name}-{vector_dimension}`.
+</Note>
+
+## Usage
+
+### Hybrid search
+
+Here is an example of how you can do hybrid search using Pinecone as a vector database through Embedchain.
+
+```python
+import os
+
+from embedchain import App
+
+config = {
+    'app': {
+        "config": {
+            "id": "ec-docs-hybrid-search"
+        }
+    },
+    'vectordb': {
+        'provider': 'pinecone',
+        'config': {
+            'metric': 'dotproduct',
+            'vector_dimension': 1536,
+            'index_name': 'my-index',
+            'serverless_config': {
+                'cloud': 'aws',
+                'region': 'us-west-2'
+            },
+            'hybrid_search': True, # Remember to set this for hybrid search
+        }
+    }
+}
+
+# Initialize app
+app = App.from_config(config=config)
+
+# Add documents
+app.add("/path/to/file.pdf", data_type="pdf_file", namespace="my-namespace")
+
+# Query
+app.query("<YOUR QUESTION HERE>", namespace="my-namespace")
+```
+
+Under the hood, Embedchain fetches the relevant chunks from the documents you added by doing hybrid search on the pinecone index.
+If you have questions on how pinecone hybrid search works, please refer to their [offical documentation here](https://docs.pinecone.io/docs/hybrid-search).
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/qdrant.mdx
+++ b/docs/components/vector-databases/qdrant.mdx
@@ -0,0 +1,23 @@
+---
+title: Qdrant
+---
+
+In order to use Qdrant as a vector database, set the environment variables `QDRANT_URL` and `QDRANT_API_KEY` which you can find on [Qdrant Dashboard](https://cloud.qdrant.io/).
+
+<CodeGroup>
+```python main.py
+from embedchain import App
+
+# load qdrant configuration from yaml file
+app = App.from_config(config_path="config.yaml")
+```
+
+```yaml config.yaml
+vectordb:
+  provider: qdrant
+  config:
+    collection_name: my_qdrant_index
+```
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/weaviate.mdx
+++ b/docs/components/vector-databases/weaviate.mdx
@@ -0,0 +1,24 @@
+---
+title: Weaviate
+---
+
+
+In order to use Weaviate as a vector database, set the environment variables `WEAVIATE_ENDPOINT` and `WEAVIATE_API_KEY` which you can find on [Weaviate dashboard](https://console.weaviate.cloud/dashboard).
+
+<CodeGroup>
+```python main.py
+from embedchain import App
+
+# load weaviate configuration from yaml file
+app = App.from_config(config_path="config.yaml")
+```
+
+```yaml config.yaml
+vectordb:
+  provider: weaviate
+  config:
+    collection_name: my_weaviate_index
+```
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />
--- a/docs/components/vector-databases/zilliz.mdx
+++ b/docs/components/vector-databases/zilliz.mdx
@@ -0,0 +1,39 @@
+---
+title: Zilliz
+---
+
+Install related dependencies using the following command:
+
+```bash
+pip install --upgrade 'embedchain[milvus]'
+```
+
+Set the Zilliz environment variables `ZILLIZ_CLOUD_URI` and `ZILLIZ_CLOUD_TOKEN` which you can find it on their [cloud platform](https://cloud.zilliz.com/).
+
+<CodeGroup>
+
+```python main.py
+import os
+from embedchain import App
+
+os.environ['ZILLIZ_CLOUD_URI'] = 'https://xxx.zillizcloud.com'
+os.environ['ZILLIZ_CLOUD_TOKEN'] = 'xxx'
+
+# load zilliz configuration from yaml file
+app = App.from_config(config_path="config.yaml")
+```
+
+```yaml config.yaml
+vectordb:
+  provider: zilliz
+  config:
+    collection_name: 'zilliz_app'
+    uri: https://xxxx.api.gcp-region.zillizcloud.com
+    token: xxx
+    vector_dim: 1536
+    metric_type: L2
+```
+
+</CodeGroup>
+
+<Snippet file="missing-vector-db-tip.mdx" />