[Feature] Add support for hybrid search for pinecone vector database (#1259)

This commit is contained in:
Deshraj Yadav
2024-02-15 13:20:14 -08:00
committed by GitHub
parent 0766a44ccf
commit 38b4e06963
18 changed files with 470 additions and 326 deletions

View File

@@ -25,4 +25,4 @@ app = App()
url = "https://drive.google.com/drive/u/0/folders/xxx-xxx"
app.add(url, data_type="google_drive")
```
```

View File

@@ -5,34 +5,35 @@ title: Overview
Embedchain comes with built-in support for various data sources. We handle the complexity of loading unstructured data from these data sources, allowing you to easily customize your app through a user-friendly interface.
<CardGroup cols={4}>
<Card title="📰 PDF file" href="/components/data-sources/pdf-file"></Card>
<Card title="📊 CSV file" href="/components/data-sources/csv"></Card>
<Card title="📃 JSON file" href="/components/data-sources/json"></Card>
<Card title="📝 Text" href="/components/data-sources/text"></Card>
<Card title="📁 Directory/ Folder" href="/components/data-sources/directory"></Card>
<Card title="🌐 HTML Web page" href="/components/data-sources/web-page"></Card>
<Card title="📽️ Youtube Channel" href="/components/data-sources/youtube-channel"></Card>
<Card title="📺 Youtube Video" href="/components/data-sources/youtube-video"></Card>
<Card title="📚 Docs website" href="/components/data-sources/docs-site"></Card>
<Card title="📝 MDX file" href="/components/data-sources/mdx"></Card>
<Card title="📄 DOCX file" href="/components/data-sources/docx"></Card>
<Card title="📓 Notion" href="/components/data-sources/notion"></Card>
<Card title="🗺️ Sitemap" href="/components/data-sources/sitemap"></Card>
<Card title="🧾 XML file" href="/components/data-sources/xml"></Card>
<Card title="❓💬 Q&A pair" href="/components/data-sources/qna"></Card>
<Card title="🙌 OpenAPI" href="/components/data-sources/openapi"></Card>
<Card title="📬 Gmail" href="/components/data-sources/gmail"></Card>
<Card title="📝 Github" href="/components/data-sources/github"></Card>
<Card title="🐘 Postgres" href="/components/data-sources/postgres"></Card>
<Card title="🐬 MySQL" href="/components/data-sources/mysql"></Card>
<Card title="🤖 Slack" href="/components/data-sources/slack"></Card>
<Card title="💬 Discord" href="/components/data-sources/discord"></Card>
<Card title="🗨️ Discourse" href="/components/data-sources/discourse"></Card>
<Card title="📝 Substack" href="/components/data-sources/substack"></Card>
<Card title="🐝 Beehiiv" href="/components/data-sources/beehiiv"></Card>
<Card title="💾 Dropbox" href="/components/data-sources/dropbox"></Card>
<Card title="🖼️ Image" href="/components/data-sources/image"></Card>
<Card title="⚙️ Custom" href="/components/data-sources/custom"></Card>
<Card title="PDF file" href="/components/data-sources/pdf-file"></Card>
<Card title="CSV file" href="/components/data-sources/csv"></Card>
<Card title="JSON file" href="/components/data-sources/json"></Card>
<Card title="Text" href="/components/data-sources/text"></Card>
<Card title="Directory" href="/components/data-sources/directory"></Card>
<Card title="Web page" href="/components/data-sources/web-page"></Card>
<Card title="Youtube Channel" href="/components/data-sources/youtube-channel"></Card>
<Card title="Youtube Video" href="/components/data-sources/youtube-video"></Card>
<Card title="Docs website" href="/components/data-sources/docs-site"></Card>
<Card title="MDX file" href="/components/data-sources/mdx"></Card>
<Card title="DOCX file" href="/components/data-sources/docx"></Card>
<Card title="Notion" href="/components/data-sources/notion"></Card>
<Card title="Sitemap" href="/components/data-sources/sitemap"></Card>
<Card title="XML file" href="/components/data-sources/xml"></Card>
<Card title="Q&A pair" href="/components/data-sources/qna"></Card>
<Card title="OpenAPI" href="/components/data-sources/openapi"></Card>
<Card title="Gmail" href="/components/data-sources/gmail"></Card>
<Card title="Google Drive" href="/components/data-sources/google-drive"></Card>
<Card title="GitHub" href="/components/data-sources/github"></Card>
<Card title="Postgres" href="/components/data-sources/postgres"></Card>
<Card title="MySQL" href="/components/data-sources/mysql"></Card>
<Card title="Slack" href="/components/data-sources/slack"></Card>
<Card title="Discord" href="/components/data-sources/discord"></Card>
<Card title="Discourse" href="/components/data-sources/discourse"></Card>
<Card title="Substack" href="/components/data-sources/substack"></Card>
<Card title="Beehiiv" href="/components/data-sources/beehiiv"></Card>
<Card title="Dropbox" href="/components/data-sources/dropbox"></Card>
<Card title="Image" href="/components/data-sources/image"></Card>
<Card title="Custom" href="/components/data-sources/custom"></Card>
</CardGroup>
<br/ >

View File

@@ -17,242 +17,4 @@ Utilizing a vector database alongside Embedchain is a seamless process. All you
<Card title="Weaviate" href="#weaviate"></Card>
</CardGroup>
## ChromaDB
<CodeGroup>
```python main.py
from embedchain import App
# load chroma configuration from yaml file
app = App.from_config(config_path="config1.yaml")
```
```yaml config1.yaml
vectordb:
provider: chroma
config:
collection_name: 'my-collection'
dir: db
allow_reset: true
```
```yaml config2.yaml
vectordb:
provider: chroma
config:
collection_name: 'my-collection'
host: localhost
port: 5200
allow_reset: true
```
</CodeGroup>
## Elasticsearch
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[elasticsearch]'
```
<Note>
You can configure the Elasticsearch connection by providing either `es_url` or `cloud_id`. If you are using the Elasticsearch Service on Elastic Cloud, you can find the `cloud_id` on the [Elastic Cloud dashboard](https://cloud.elastic.co/deployments).
</Note>
You can authorize the connection to Elasticsearch by providing either `basic_auth`, `api_key`, or `bearer_auth`.
<CodeGroup>
```python main.py
from embedchain import App
# load elasticsearch configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: elasticsearch
config:
collection_name: 'es-index'
cloud_id: 'deployment-name:xxxx'
basic_auth:
- elastic
- <your_password>
verify_certs: false
```
</CodeGroup>
## OpenSearch
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[opensearch]'
```
<CodeGroup>
```python main.py
from embedchain import App
# load opensearch configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: opensearch
config:
collection_name: 'my-app'
opensearch_url: 'https://localhost:9200'
http_auth:
- admin
- admin
vector_dimension: 1536
use_ssl: false
verify_certs: false
```
</CodeGroup>
## Zilliz
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[milvus]'
```
Set the Zilliz environment variables `ZILLIZ_CLOUD_URI` and `ZILLIZ_CLOUD_TOKEN` which you can find it on their [cloud platform](https://cloud.zilliz.com/).
<CodeGroup>
```python main.py
import os
from embedchain import App
os.environ['ZILLIZ_CLOUD_URI'] = 'https://xxx.zillizcloud.com'
os.environ['ZILLIZ_CLOUD_TOKEN'] = 'xxx'
# load zilliz configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: zilliz
config:
collection_name: 'zilliz_app'
uri: https://xxxx.api.gcp-region.zillizcloud.com
token: xxx
vector_dim: 1536
metric_type: L2
```
</CodeGroup>
## LanceDB
_Coming soon_
## Pinecone
Install pinecone related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[pinecone]'
```
In order to use Pinecone as vector database, set the environment variable `PINECONE_API_KEY` which you can find on [Pinecone dashboard](https://app.pinecone.io/).
<CodeGroup>
```python main.py
from embedchain import App
# load pinecone configuration from yaml file
app = App.from_config(config_path="pod_config.yaml")
# or
app = App.from_config(config_path="serverless_config.yaml")
```
```yaml pod_config.yaml
vectordb:
provider: pinecone
config:
metric: cosine
vector_dimension: 1536
index_name: my-pinecone-index
pod_config:
environment: gcp-starter
metadata_config:
indexed:
- "url"
- "hash"
```
```yaml serverless_config.yaml
vectordb:
provider: pinecone
config:
metric: cosine
vector_dimension: 1536
index_name: my-pinecone-index
serverless_config:
cloud: aws
region: us-west-2
```
</CodeGroup>
<br />
<Note>
You can find more information about Pinecone configuration [here](https://docs.pinecone.io/docs/manage-indexes#create-a-pod-based-index).
You can also optionally provide `index_name` as a config param in yaml file to specify the index name. If not provided, the index name will be `{collection_name}-{vector_dimension}`.
</Note>
## Qdrant
In order to use Qdrant as a vector database, set the environment variables `QDRANT_URL` and `QDRANT_API_KEY` which you can find on [Qdrant Dashboard](https://cloud.qdrant.io/).
<CodeGroup>
```python main.py
from embedchain import App
# load qdrant configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: qdrant
config:
collection_name: my_qdrant_index
```
</CodeGroup>
## Weaviate
In order to use Weaviate as a vector database, set the environment variables `WEAVIATE_ENDPOINT` and `WEAVIATE_API_KEY` which you can find on [Weaviate dashboard](https://console.weaviate.cloud/dashboard).
<CodeGroup>
```python main.py
from embedchain import App
# load weaviate configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: weaviate
config:
collection_name: my_weaviate_index
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,35 @@
---
title: ChromaDB
---
<CodeGroup>
```python main.py
from embedchain import App
# load chroma configuration from yaml file
app = App.from_config(config_path="config1.yaml")
```
```yaml config1.yaml
vectordb:
provider: chroma
config:
collection_name: 'my-collection'
dir: db
allow_reset: true
```
```yaml config2.yaml
vectordb:
provider: chroma
config:
collection_name: 'my-collection'
host: localhost
port: 5200
allow_reset: true
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,39 @@
---
title: Elasticsearch
---
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[elasticsearch]'
```
<Note>
You can configure the Elasticsearch connection by providing either `es_url` or `cloud_id`. If you are using the Elasticsearch Service on Elastic Cloud, you can find the `cloud_id` on the [Elastic Cloud dashboard](https://cloud.elastic.co/deployments).
</Note>
You can authorize the connection to Elasticsearch by providing either `basic_auth`, `api_key`, or `bearer_auth`.
<CodeGroup>
```python main.py
from embedchain import App
# load elasticsearch configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: elasticsearch
config:
collection_name: 'es-index'
cloud_id: 'deployment-name:xxxx'
basic_auth:
- elastic
- <your_password>
verify_certs: false
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,36 @@
---
title: OpenSearch
---
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[opensearch]'
```
<CodeGroup>
```python main.py
from embedchain import App
# load opensearch configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: opensearch
config:
collection_name: 'my-app'
opensearch_url: 'https://localhost:9200'
http_auth:
- admin
- admin
vector_dimension: 1536
use_ssl: false
verify_certs: false
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,106 @@
---
title: Pinecone
---
## Overview
Install pinecone related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[pinecone]'
```
In order to use Pinecone as vector database, set the environment variable `PINECONE_API_KEY` which you can find on [Pinecone dashboard](https://app.pinecone.io/).
<CodeGroup>
```python main.py
from embedchain import App
# Load pinecone configuration from yaml file
app = App.from_config(config_path="pod_config.yaml")
# Or
app = App.from_config(config_path="serverless_config.yaml")
```
```yaml pod_config.yaml
vectordb:
provider: pinecone
config:
metric: cosine
vector_dimension: 1536
index_name: my-pinecone-index
pod_config:
environment: gcp-starter
metadata_config:
indexed:
- "url"
- "hash"
```
```yaml serverless_config.yaml
vectordb:
provider: pinecone
config:
metric: cosine
vector_dimension: 1536
index_name: my-pinecone-index
serverless_config:
cloud: aws
region: us-west-2
```
</CodeGroup>
<br />
<Note>
You can find more information about Pinecone configuration [here](https://docs.pinecone.io/docs/manage-indexes#create-a-pod-based-index).
You can also optionally provide `index_name` as a config param in yaml file to specify the index name. If not provided, the index name will be `{collection_name}-{vector_dimension}`.
</Note>
## Usage
### Hybrid search
Here is an example of how you can do hybrid search using Pinecone as a vector database through Embedchain.
```python
import os
from embedchain import App
config = {
'app': {
"config": {
"id": "ec-docs-hybrid-search"
}
},
'vectordb': {
'provider': 'pinecone',
'config': {
'metric': 'dotproduct',
'vector_dimension': 1536,
'index_name': 'my-index',
'serverless_config': {
'cloud': 'aws',
'region': 'us-west-2'
},
'hybrid_search': True, # Remember to set this for hybrid search
}
}
}
# Initialize app
app = App.from_config(config=config)
# Add documents
app.add("/path/to/file.pdf", data_type="pdf_file", namespace="my-namespace")
# Query
app.query("<YOUR QUESTION HERE>", namespace="my-namespace")
```
Under the hood, Embedchain fetches the relevant chunks from the documents you added by doing hybrid search on the pinecone index.
If you have questions on how pinecone hybrid search works, please refer to their [offical documentation here](https://docs.pinecone.io/docs/hybrid-search).
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,23 @@
---
title: Qdrant
---
In order to use Qdrant as a vector database, set the environment variables `QDRANT_URL` and `QDRANT_API_KEY` which you can find on [Qdrant Dashboard](https://cloud.qdrant.io/).
<CodeGroup>
```python main.py
from embedchain import App
# load qdrant configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: qdrant
config:
collection_name: my_qdrant_index
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,24 @@
---
title: Weaviate
---
In order to use Weaviate as a vector database, set the environment variables `WEAVIATE_ENDPOINT` and `WEAVIATE_API_KEY` which you can find on [Weaviate dashboard](https://console.weaviate.cloud/dashboard).
<CodeGroup>
```python main.py
from embedchain import App
# load weaviate configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: weaviate
config:
collection_name: my_weaviate_index
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />

View File

@@ -0,0 +1,39 @@
---
title: Zilliz
---
Install related dependencies using the following command:
```bash
pip install --upgrade 'embedchain[milvus]'
```
Set the Zilliz environment variables `ZILLIZ_CLOUD_URI` and `ZILLIZ_CLOUD_TOKEN` which you can find it on their [cloud platform](https://cloud.zilliz.com/).
<CodeGroup>
```python main.py
import os
from embedchain import App
os.environ['ZILLIZ_CLOUD_URI'] = 'https://xxx.zillizcloud.com'
os.environ['ZILLIZ_CLOUD_TOKEN'] = 'xxx'
# load zilliz configuration from yaml file
app = App.from_config(config_path="config.yaml")
```
```yaml config.yaml
vectordb:
provider: zilliz
config:
collection_name: 'zilliz_app'
uri: https://xxxx.api.gcp-region.zillizcloud.com
token: xxx
vector_dim: 1536
metric_type: L2
```
</CodeGroup>
<Snippet file="missing-vector-db-tip.mdx" />