[Docs] Revamp documentation (#1010)
This commit is contained in:
16
docs/components/data-sources/beehiiv.mdx
Normal file
16
docs/components/data-sources/beehiiv.mdx
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
title: "🐝 Beehiiv"
|
||||
---
|
||||
|
||||
To add any Beehiiv data sources to your app, just add the base url as the source and set the data_type to `beehiiv`.
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
# source: just add the base url and set the data_type to 'beehiiv'
|
||||
app.add('https://aibreakfast.beehiiv.com', data_type='beehiiv')
|
||||
app.query("How much is OpenAI paying developers?")
|
||||
# Answer: OpenAI is aggressively recruiting Google's top AI researchers with offers ranging between $5 to $10 million annually, primarily in stock options.
|
||||
```
|
||||
19
docs/components/data-sources/csv.mdx
Normal file
19
docs/components/data-sources/csv.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
---
|
||||
title: '📊 CSV'
|
||||
---
|
||||
|
||||
To add any csv file, use the data_type as `csv`. `csv` allows remote urls and conventional file paths. Headers are included for each line, so if you have an `age` column, `18` will be added as `age: 18`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv', data_type="csv")
|
||||
# Or add using the local file path
|
||||
# app.add('/path/to/file.csv', data_type="csv")
|
||||
|
||||
app.query("Summarize the air travel data")
|
||||
# Answer: The air travel data shows the number of flights for the months of July in the years 1958, 1959, and 1960. In July 1958, there were 491 flights, in July 1959 there were 548 flights, and in July 1960 there were 622 flights.
|
||||
```
|
||||
|
||||
Note: There is a size limit allowed for csv file beyond which it can throw error. This limit is set by the LLMs. Please consider chunking large csv files into smaller csv files.
|
||||
41
docs/components/data-sources/custom.mdx
Normal file
41
docs/components/data-sources/custom.mdx
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: '⚙️ Custom'
|
||||
---
|
||||
|
||||
When we say "custom", we mean that you can customize the loader and chunker to your needs. This is done by passing a custom loader and chunker to the `add` method.
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
import your_loader
|
||||
import your_chunker
|
||||
|
||||
app = App()
|
||||
loader = your_loader()
|
||||
chunker = your_chunker()
|
||||
|
||||
app.add("source", data_type="custom", loader=loader, chunker=chunker)
|
||||
```
|
||||
|
||||
<Note>
|
||||
The custom loader and chunker must be a class that inherits from the [`BaseLoader`](https://github.com/embedchain/embedchain/blob/main/embedchain/loaders/base_loader.py) and [`BaseChunker`](https://github.com/embedchain/embedchain/blob/main/embedchain/chunkers/base_chunker.py) classes respectively.
|
||||
</Note>
|
||||
|
||||
<Note>
|
||||
If the `data_type` is not a valid data type, the `add` method will fallback to the `custom` data type and expect a custom loader and chunker to be passed by the user.
|
||||
</Note>
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
from embedchain.loaders.github import GithubLoader
|
||||
|
||||
app = App()
|
||||
|
||||
loader = GithubLoader(config={"token": "ghp_xxx"})
|
||||
|
||||
app.add("repo:embedchain/embedchain type:repo", data_type="github", loader=loader)
|
||||
|
||||
app.query("What is Embedchain?")
|
||||
# Answer: Embedchain is a Data Platform for Large Language Models (LLMs). It allows users to seamlessly load, index, retrieve, and sync unstructured data in order to build dynamic, LLM-powered applications. There is also a JavaScript implementation called embedchain-js available on GitHub.
|
||||
```
|
||||
64
docs/components/data-sources/data-type-handling.mdx
Normal file
64
docs/components/data-sources/data-type-handling.mdx
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: 'Data type handling'
|
||||
---
|
||||
|
||||
## Automatic data type detection
|
||||
|
||||
The add method automatically tries to detect the data_type, based on your input for the source argument. So `app.add('https://www.youtube.com/watch?v=dQw4w9WgXcQ')` is enough to embed a YouTube video.
|
||||
|
||||
This detection is implemented for all formats. It is based on factors such as whether it's a URL, a local file, the source data type, etc.
|
||||
|
||||
### Debugging automatic detection
|
||||
|
||||
Set `log_level: DEBUG` in the config yaml to debug if the data type detection is done right or not. Otherwise, you will not know when, for instance, an invalid filepath is interpreted as raw text instead.
|
||||
|
||||
### Forcing a data type
|
||||
|
||||
To omit any issues with the data type detection, you can **force** a data_type by adding it as a `add` method argument.
|
||||
The examples below show you the keyword to force the respective `data_type`.
|
||||
|
||||
Forcing can also be used for edge cases, such as interpreting a sitemap as a web_page, for reading its raw text instead of following links.
|
||||
|
||||
## Remote data types
|
||||
|
||||
<Tip>
|
||||
**Use local files in remote data types**
|
||||
|
||||
Some data_types are meant for remote content and only work with URLs.
|
||||
You can pass local files by formatting the path using the `file:` [URI scheme](https://en.wikipedia.org/wiki/File_URI_scheme), e.g. `file:///info.pdf`.
|
||||
</Tip>
|
||||
|
||||
## Reusing a vector database
|
||||
|
||||
Default behavior is to create a persistent vector db in the directory **./db**. You can split your application into two Python scripts: one to create a local vector db and the other to reuse this local persistent vector db. This is useful when you want to index hundreds of documents and separately implement a chat interface.
|
||||
|
||||
Create a local index:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
naval_chat_bot = App()
|
||||
naval_chat_bot.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
|
||||
naval_chat_bot.add("https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")
|
||||
```
|
||||
|
||||
You can reuse the local index with the same code, but without adding new documents:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
naval_chat_bot = App()
|
||||
print(naval_chat_bot.query("What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"))
|
||||
```
|
||||
|
||||
## Resetting an app and vector database
|
||||
|
||||
You can reset the app by simply calling the `reset` method. This will delete the vector database and all other app related files.
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add("https://www.youtube.com/watch?v=3qHkcs3kG44")
|
||||
app.reset()
|
||||
```
|
||||
28
docs/components/data-sources/discord.mdx
Normal file
28
docs/components/data-sources/discord.mdx
Normal file
@@ -0,0 +1,28 @@
|
||||
---
|
||||
title: "💬 Discord"
|
||||
---
|
||||
|
||||
To add any Discord channel messages to your app, just add the `channel_id` as the source and set the `data_type` to `discord`.
|
||||
|
||||
<Note>
|
||||
This loader requires a Discord bot token with read messages access.
|
||||
To obtain the token, follow the instructions provided in this tutorial:
|
||||
<a href="https://www.writebots.com/discord-bot-token/">How to Get a Discord Bot Token?</a>.
|
||||
</Note>
|
||||
|
||||
```python
|
||||
import os
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
# add your discord "BOT" token
|
||||
os.environ["DISCORD_TOKEN"] = "xxx"
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("1177296711023075338", data_type="discord")
|
||||
|
||||
response = app.query("What is Joe saying about Elon Musk?")
|
||||
|
||||
print(response)
|
||||
# Answer: Joe is saying "Elon Musk is a genius".
|
||||
```
|
||||
44
docs/components/data-sources/discourse.mdx
Normal file
44
docs/components/data-sources/discourse.mdx
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: '🗨️ Discourse'
|
||||
---
|
||||
|
||||
You can now easily load data from your community built with [Discourse](https://discourse.org/).
|
||||
|
||||
## Example
|
||||
|
||||
1. Setup the Discourse Loader with your community url.
|
||||
```Python
|
||||
from embedchain.loaders.discourse import DiscourseLoader
|
||||
|
||||
dicourse_loader = DiscourseLoader(config={"domain": "https://community.openai.com"})
|
||||
```
|
||||
|
||||
2. Once you setup the loader, you can create an app and load data using the above discourse loader
|
||||
```Python
|
||||
import os
|
||||
from embedchain.pipeline import Pipeline as App
|
||||
|
||||
os.environ["OPENAI_API_KEY"] = "sk-xxx"
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("openai after:2023-10-1", data_type="discourse", loader=dicourse_loader)
|
||||
|
||||
question = "Where can I find the OpenAI API status page?"
|
||||
app.query(question)
|
||||
# Answer: You can find the OpenAI API status page at https:/status.openai.com/.
|
||||
```
|
||||
|
||||
NOTE: The `add` function of the app will accept any executable search query to load data. Refer [Discourse API Docs](https://docs.discourse.org/#tag/Search) to learn more about search queries.
|
||||
|
||||
3. We automatically create a chunker to chunk your discourse data, however if you wish to provide your own chunker class. Here is how you can do that:
|
||||
```Python
|
||||
|
||||
from embedchain.chunkers.discourse import DiscourseChunker
|
||||
from embedchain.config.add_config import ChunkerConfig
|
||||
|
||||
discourse_chunker_config = ChunkerConfig(chunk_size=1000, chunk_overlap=0, length_function=len)
|
||||
discourse_chunker = DiscourseChunker(config=discourse_chunker_config)
|
||||
|
||||
app.add("openai", data_type='discourse', loader=dicourse_loader, chunker=discourse_chunker)
|
||||
```
|
||||
14
docs/components/data-sources/docs-site.mdx
Normal file
14
docs/components/data-sources/docs-site.mdx
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: '📚 Code documentation'
|
||||
---
|
||||
|
||||
To add any code documentation website as a loader, use the data_type as `docs_site`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add("https://docs.embedchain.ai/", data_type="docs_site")
|
||||
app.query("What is Embedchain?")
|
||||
# Answer: Embedchain is a platform that utilizes various components, including paid/proprietary ones, to provide what is believed to be the best configuration available. It uses LLM (Language Model) providers such as OpenAI, Anthpropic, Vertex_AI, GPT4ALL, Azure_OpenAI, LLAMA2, JINA, and COHERE. Embedchain allows users to import and utilize these LLM providers for their applications.'
|
||||
```
|
||||
18
docs/components/data-sources/docx.mdx
Normal file
18
docs/components/data-sources/docx.mdx
Normal file
@@ -0,0 +1,18 @@
|
||||
---
|
||||
title: '📄 Docx file'
|
||||
---
|
||||
|
||||
### Docx file
|
||||
|
||||
To add any doc/docx file, use the data_type as `docx`. `docx` allows remote urls and conventional file paths. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add('https://example.com/content/intro.docx', data_type="docx")
|
||||
# Or add file using the local file path on your system
|
||||
# app.add('content/intro.docx', data_type="docx")
|
||||
|
||||
app.query("Summarize the docx data?")
|
||||
```
|
||||
50
docs/components/data-sources/github.mdx
Normal file
50
docs/components/data-sources/github.mdx
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: 📝 Github
|
||||
---
|
||||
|
||||
1. Setup the Github loader by configuring the Github account with username and personal access token (PAT). Check out [this](https://docs.github.com/en/enterprise-server@3.6/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token) link to learn how to create a PAT.
|
||||
```Python
|
||||
from embedchain.loaders.github import GithubLoader
|
||||
|
||||
loader = GithubLoader(
|
||||
config={
|
||||
"token":"ghp_xxxx"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
2. Once you setup the loader, you can create an app and load data using the above Github loader
|
||||
```Python
|
||||
import os
|
||||
from embedchain.pipeline import Pipeline as App
|
||||
|
||||
os.environ["OPENAI_API_KEY"] = "sk-xxxx"
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("repo:embedchain/embedchain type:repo", data_type="github", loader=loader)
|
||||
|
||||
response = app.query("What is Embedchain?")
|
||||
# Answer: Embedchain is a Data Platform for Large Language Models (LLMs). It allows users to seamlessly load, index, retrieve, and sync unstructured data in order to build dynamic, LLM-powered applications. There is also a JavaScript implementation called embedchain-js available on GitHub.
|
||||
```
|
||||
The `add` function of the app will accept any valid github query with qualifiers. It only supports loading github code, repository, issues and pull-requests.
|
||||
<Note>
|
||||
You must provide qualifiers `type:` and `repo:` in the query. The `type:` qualifier can be a combination of `code`, `repo`, `pr`, `issue`. The `repo:` qualifier must be a valid github repository name.
|
||||
</Note>
|
||||
|
||||
<Card title="Valid queries" icon="lightbulb" iconType="duotone" color="#ca8b04">
|
||||
- `repo:embedchain/embedchain type:repo` - to load the repository
|
||||
- `repo:embedchain/embedchain type:issue,pr` - to load the issues and pull-requests of the repository
|
||||
- `repo:embedchain/embedchain type:issue state:closed` - to load the closed issues of the repository
|
||||
</Card>
|
||||
|
||||
3. We automatically create a chunker to chunk your GitHub data, however if you wish to provide your own chunker class. Here is how you can do that:
|
||||
```Python
|
||||
from embedchain.chunkers.common_chunker import CommonChunker
|
||||
from embedchain.config.add_config import ChunkerConfig
|
||||
|
||||
github_chunker_config = ChunkerConfig(chunk_size=2000, chunk_overlap=0, length_function=len)
|
||||
github_chunker = CommonChunker(config=github_chunker_config)
|
||||
|
||||
app.add(load_query, data_type="github", loader=loader, chunker=github_chunker)
|
||||
```
|
||||
34
docs/components/data-sources/gmail.mdx
Normal file
34
docs/components/data-sources/gmail.mdx
Normal file
@@ -0,0 +1,34 @@
|
||||
---
|
||||
title: '📬 Gmail'
|
||||
---
|
||||
|
||||
To use GmailLoader you must install the extra dependencies with `pip install --upgrade embedchain[gmail]`.
|
||||
|
||||
The `source` must be a valid Gmail search query, you can refer `https://support.google.com/mail/answer/7190?hl=en` to build a query.
|
||||
|
||||
To load Gmail messages, you MUST use the data_type as `gmail`. Otherwise the source will be detected as simple `text`.
|
||||
|
||||
To use this you need to save `credentials.json` in the directory from where you will run the loader. Follow these steps to get the credentials
|
||||
|
||||
1. Go to the [Google Cloud Console](https://console.cloud.google.com/apis/credentials).
|
||||
2. Create a project if you don't have one already.
|
||||
3. Create an `OAuth Consent Screen` in the project. You may need to select the `external` option.
|
||||
4. Make sure the consent screen is published.
|
||||
5. Enable the [Gmail API](https://console.cloud.google.com/apis/api/gmail.googleapis.com)
|
||||
6. Create credentials from the `Credentials` tab.
|
||||
7. Select the type `OAuth Client ID`.
|
||||
8. Choose the application type `Web application`. As a name you can choose `embedchain` or any other name as per your use case.
|
||||
9. Add an authorized redirect URI for `http://localhost:8080/`.
|
||||
10. You can leave everything else at default, finish the creation.
|
||||
11. When you are done, a modal opens where you can download the details in `json` format.
|
||||
12. Put the `.json` file in your current directory and rename it to `credentials.json`
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
gmail_filter = "to: me label:inbox"
|
||||
app.add(gmail_filter, data_type="gmail")
|
||||
app.query("Summarize my email conversations")
|
||||
```
|
||||
44
docs/components/data-sources/json.mdx
Normal file
44
docs/components/data-sources/json.mdx
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: '📃 JSON'
|
||||
---
|
||||
|
||||
To add any json file, use the data_type as `json`. Headers are included for each line, so for example if you have a json like `{"age": 18}`, then it will be added as `age: 18`.
|
||||
|
||||
Here are the supported sources for loading `json`:
|
||||
|
||||
```
|
||||
1. URL - valid url to json file that ends with ".json" extension.
|
||||
2. Local file - valid url to local json file that ends with ".json" extension.
|
||||
3. String - valid json string (e.g. - app.add('{"foo": "bar"}'))
|
||||
```
|
||||
|
||||
<Tip>
|
||||
If you would like to add other data structures (e.g. list, dict etc.), convert it to a valid json first using `json.dumps()` function.
|
||||
</Tip>
|
||||
|
||||
## Example
|
||||
|
||||
<CodeGroup>
|
||||
|
||||
```python python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
# Add json file
|
||||
app.add("temp.json")
|
||||
|
||||
app.query("What is the net worth of Elon Musk as of October 2023?")
|
||||
# As of October 2023, Elon Musk's net worth is $255.2 billion.
|
||||
```
|
||||
|
||||
|
||||
```json temp.json
|
||||
{
|
||||
"question": "What is your net worth, Elon Musk?",
|
||||
"answer": "As of October 2023, Elon Musk's net worth is $255.2 billion, making him one of the wealthiest individuals in the world."
|
||||
}
|
||||
```
|
||||
</CodeGroup>
|
||||
|
||||
|
||||
14
docs/components/data-sources/mdx.mdx
Normal file
14
docs/components/data-sources/mdx.mdx
Normal file
@@ -0,0 +1,14 @@
|
||||
---
|
||||
title: '📝 Mdx file'
|
||||
---
|
||||
|
||||
To add any `.mdx` file to your app, use the data_type (first argument to `.add()` method) as `mdx`. Note that this supports support mdx file present on machine, so this should be a file path. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add('path/to/file.mdx', data_type='mdx')
|
||||
|
||||
app.query("What are the docs about?")
|
||||
```
|
||||
47
docs/components/data-sources/mysql.mdx
Normal file
47
docs/components/data-sources/mysql.mdx
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: '🐬 MySQL'
|
||||
---
|
||||
|
||||
1. Setup the MySQL loader by configuring the SQL db.
|
||||
```Python
|
||||
from embedchain.loaders.mysql import MySQLLoader
|
||||
|
||||
config = {
|
||||
"host": "host",
|
||||
"port": "port",
|
||||
"database": "database",
|
||||
"user": "username",
|
||||
"password": "password",
|
||||
}
|
||||
|
||||
mysql_loader = MySQLLoader(config=config)
|
||||
```
|
||||
|
||||
For more details on how to setup with valid config, check MySQL [documentation](https://dev.mysql.com/doc/connector-python/en/connector-python-connectargs.html).
|
||||
|
||||
2. Once you setup the loader, you can create an app and load data using the above MySQL loader
|
||||
```Python
|
||||
from embedchain.pipeline import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("SELECT * FROM table_name;", data_type='mysql', loader=mysql_loader)
|
||||
# Adds `(1, 'What is your net worth, Elon Musk?', "As of October 2023, Elon Musk's net worth is $255.2 billion.")`
|
||||
|
||||
response = app.query(question)
|
||||
# Answer: As of October 2023, Elon Musk's net worth is $255.2 billion.
|
||||
```
|
||||
|
||||
NOTE: The `add` function of the app will accept any executable query to load data. DO NOT pass the `CREATE`, `INSERT` queries in `add` function.
|
||||
|
||||
3. We automatically create a chunker to chunk your SQL data, however if you wish to provide your own chunker class. Here is how you can do that:
|
||||
``Python
|
||||
|
||||
from embedchain.chunkers.mysql import MySQLChunker
|
||||
from embedchain.config.add_config import ChunkerConfig
|
||||
|
||||
mysql_chunker_config = ChunkerConfig(chunk_size=1000, chunk_overlap=0, length_function=len)
|
||||
mysql_chunker = MySQLChunker(config=mysql_chunker_config)
|
||||
|
||||
app.add("SELECT * FROM table_name;", data_type='mysql', loader=mysql_loader, chunker=mysql_chunker)
|
||||
```
|
||||
20
docs/components/data-sources/notion.mdx
Normal file
20
docs/components/data-sources/notion.mdx
Normal file
@@ -0,0 +1,20 @@
|
||||
---
|
||||
title: '📓 Notion'
|
||||
---
|
||||
|
||||
To use notion you must install the extra dependencies with `pip install --upgrade embedchain[community]`.
|
||||
|
||||
To load a notion page, use the data_type as `notion`. Since it is hard to automatically detect, it is advised to specify the `data_type` when adding a notion document.
|
||||
The next argument must **end** with the `notion page id`. The id is a 32-character string. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("cfbc134ca6464fc980d0391613959196", data_type="notion")
|
||||
app.add("my-page-cfbc134ca6464fc980d0391613959196", data_type="notion")
|
||||
app.add("https://www.notion.so/my-page-cfbc134ca6464fc980d0391613959196", data_type="notion")
|
||||
|
||||
app.query("Summarize the notion doc")
|
||||
```
|
||||
22
docs/components/data-sources/openapi.mdx
Normal file
22
docs/components/data-sources/openapi.mdx
Normal file
@@ -0,0 +1,22 @@
|
||||
---
|
||||
title: 🙌 OpenAPI
|
||||
---
|
||||
|
||||
To add any OpenAPI spec yaml file (currently the json file will be detected as JSON data type), use the data_type as 'openapi'. 'openapi' allows remote urls and conventional file paths.
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("https://github.com/openai/openai-openapi/blob/master/openapi.yaml", data_type="openapi")
|
||||
# Or add using the local file path
|
||||
# app.add("configs/openai_openapi.yaml", data_type="openapi")
|
||||
|
||||
app.query("What can OpenAI API endpoint do? Can you list the things it can learn from?")
|
||||
# Answer: The OpenAI API endpoint allows users to interact with OpenAI's models and perform various tasks such as generating text, answering questions, summarizing documents, translating languages, and more. The specific capabilities and tasks that the API can learn from may vary depending on the models and features provided by OpenAI. For more detailed information, it is recommended to refer to the OpenAI API documentation at https://platform.openai.com/docs/api-reference.
|
||||
```
|
||||
|
||||
<Note>
|
||||
The yaml file added to the App must have the required OpenAPI fields otherwise the adding OpenAPI spec will fail. Please refer to [OpenAPI Spec Doc](https://spec.openapis.org/oas/v3.1.0)
|
||||
</Note>
|
||||
36
docs/components/data-sources/overview.mdx
Normal file
36
docs/components/data-sources/overview.mdx
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
title: Overview
|
||||
---
|
||||
|
||||
Embedchain comes with built-in support for various data sources. We handle the complexity of loading unstructured data from these data sources, allowing you to easily customize your app through a user-friendly interface.
|
||||
|
||||
<CardGroup cols={4}>
|
||||
<Card title="📰 PDF file" href="/components/data-sources/pdf-file"></Card>
|
||||
<Card title="📊 CSV file" href="/components/data-sources/csv"></Card>
|
||||
<Card title="📃 JSON file" href="/components/data-sources/json"></Card>
|
||||
<Card title="📺 Youtube" href="/components/data-sources/youtube-video"></Card>
|
||||
<Card title="📝 Text" href="/components/data-sources/text"></Card>
|
||||
<Card title="📚 Documentation website" href="/components/data-sources/docs-site"></Card>
|
||||
<Card title="📄 DOCX file" href="/components/data-sources/docx"></Card>
|
||||
<Card title="📝 MDX file" href="/components/data-sources/mdx"></Card>
|
||||
<Card title="📓 Notion" href="/components/data-sources/notion"></Card>
|
||||
<Card title="❓💬 Q&A pair" href="/components/data-sources/qna"></Card>
|
||||
<Card title="🗺️ Sitemap" href="/components/data-sources/sitemap"></Card>
|
||||
<Card title="🌐 Web page" href="/components/data-sources/web-page"></Card>
|
||||
<Card title="🧾 XML file" href="/components/data-sources/xml"></Card>
|
||||
<Card title="🙌 OpenAPI" href="/components/data-sources/openapi"></Card>
|
||||
<Card title="📬 Gmail" href="/components/data-sources/gmail"></Card>
|
||||
<Card title="🐘 Postgres" href="/components/data-sources/postgres"></Card>
|
||||
<Card title="🐬 MySQL" href="/components/data-sources/mysql"></Card>
|
||||
<Card title="🤖 Slack" href="/components/data-sources/slack"></Card>
|
||||
<Card title="🗨️ Discourse" href="/components/data-sources/discourse"></Card>
|
||||
<Card title="💬 Discord" href="/components/data-sources/discord"></Card>
|
||||
<Card title="📝 Github" href="/components/data-sources/github"></Card>
|
||||
<Card title="⚙️ Custom" href="/components/data-sources/custom"></Card>
|
||||
<Card title="📝 Substack" href="/components/data-sources/substack"></Card>
|
||||
<Card title="🐝 Beehiiv" href="/components/data-sources/beehiiv"></Card>
|
||||
</CardGroup>
|
||||
|
||||
<br/ >
|
||||
|
||||
<Snippet file="missing-data-source-tip.mdx" />
|
||||
17
docs/components/data-sources/pdf-file.mdx
Normal file
17
docs/components/data-sources/pdf-file.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
title: '📰 PDF file'
|
||||
---
|
||||
|
||||
To add any pdf file, use the data_type as `pdf_file`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add('https://arxiv.org/pdf/1706.03762.pdf', data_type='pdf_file')
|
||||
app.query("What is the paper 'attention is all you need' about?")
|
||||
# Answer: The paper "Attention Is All You Need" proposes a new network architecture called the Transformer, which is based solely on attention mechanisms. It suggests moving away from complex recurrent or convolutional neural networks and instead using attention mechanisms to connect the encoder and decoder in sequence transduction models.
|
||||
```
|
||||
|
||||
Note that we do not support password protected pdfs.
|
||||
64
docs/components/data-sources/postgres.mdx
Normal file
64
docs/components/data-sources/postgres.mdx
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: '🐘 Postgres'
|
||||
---
|
||||
|
||||
1. Setup the Postgres loader by configuring the postgres db.
|
||||
```Python
|
||||
from embedchain.loaders.postgres import PostgresLoader
|
||||
|
||||
config = {
|
||||
"host": "host_address",
|
||||
"port": "port_number",
|
||||
"dbname": "database_name",
|
||||
"user": "username",
|
||||
"password": "password",
|
||||
}
|
||||
|
||||
"""
|
||||
config = {
|
||||
"url": "your_postgres_url"
|
||||
}
|
||||
"""
|
||||
|
||||
postgres_loader = PostgresLoader(config=config)
|
||||
|
||||
```
|
||||
|
||||
You can either setup the loader by passing the postgresql url or by providing the config data.
|
||||
For more details on how to setup with valid url and config, check postgres [documentation](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING:~:text=34.1.1.%C2%A0Connection%20Strings-,%23,-Several%20libpq%20functions).
|
||||
|
||||
NOTE: if you provide the `url` field in config, all other fields will be ignored.
|
||||
|
||||
2. Once you setup the loader, you can create an app and load data using the above postgres loader
|
||||
```Python
|
||||
import os
|
||||
from embedchain.pipeline import Pipeline as App
|
||||
|
||||
os.environ["OPENAI_API_KEY"] = "sk-xxx"
|
||||
|
||||
app = App()
|
||||
|
||||
question = "What is Elon Musk's networth?"
|
||||
response = app.query(question)
|
||||
# Answer: As of September 2021, Elon Musk's net worth is estimated to be around $250 billion, making him one of the wealthiest individuals in the world. However, please note that net worth can fluctuate over time due to various factors such as stock market changes and business ventures.
|
||||
|
||||
app.add("SELECT * FROM table_name;", data_type='postgres', loader=postgres_loader)
|
||||
# Adds `(1, 'What is your net worth, Elon Musk?', "As of October 2023, Elon Musk's net worth is $255.2 billion.")`
|
||||
|
||||
response = app.query(question)
|
||||
# Answer: As of October 2023, Elon Musk's net worth is $255.2 billion.
|
||||
```
|
||||
|
||||
NOTE: The `add` function of the app will accept any executable query to load data. DO NOT pass the `CREATE`, `INSERT` queries in `add` function as they will result in not adding any data, so it is pointless.
|
||||
|
||||
3. We automatically create a chunker to chunk your postgres data, however if you wish to provide your own chunker class. Here is how you can do that:
|
||||
```Python
|
||||
|
||||
from embedchain.chunkers.postgres import PostgresChunker
|
||||
from embedchain.config.add_config import ChunkerConfig
|
||||
|
||||
postgres_chunker_config = ChunkerConfig(chunk_size=1000, chunk_overlap=0, length_function=len)
|
||||
postgres_chunker = PostgresChunker(config=postgres_chunker_config)
|
||||
|
||||
app.add("SELECT * FROM table_name;", data_type='postgres', loader=postgres_loader, chunker=postgres_chunker)
|
||||
```
|
||||
13
docs/components/data-sources/qna.mdx
Normal file
13
docs/components/data-sources/qna.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
title: '❓💬 Queston and answer pair'
|
||||
---
|
||||
|
||||
QnA pair is a local data type. To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add(("Question", "Answer"), data_type="qna_pair")
|
||||
```
|
||||
13
docs/components/data-sources/sitemap.mdx
Normal file
13
docs/components/data-sources/sitemap.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
title: '🗺️ Sitemap'
|
||||
---
|
||||
|
||||
Add all web pages from an xml-sitemap. Filters non-text files. Use the data_type as `sitemap`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add('https://example.com/sitemap.xml', data_type='sitemap')
|
||||
```
|
||||
54
docs/components/data-sources/slack.mdx
Normal file
54
docs/components/data-sources/slack.mdx
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: '🤖 Slack'
|
||||
---
|
||||
|
||||
## Pre-requisite
|
||||
- Download required packages by running `pip install --upgrade "embedchain[slack]"`.
|
||||
- Configure your slack bot token as environment variable `SLACK_USER_TOKEN`.
|
||||
- Find your user token on your [Slack Account](https://api.slack.com/authentication/token-types)
|
||||
- Make sure your slack user token includes [search](https://api.slack.com/scopes/search:read) scope.
|
||||
|
||||
## Example
|
||||
1. Setup the Slack loader by configuring the Slack Webclient.
|
||||
```Python
|
||||
from embedchain.loaders.slack import SlackLoader
|
||||
|
||||
os.environ["SLACK_USER_TOKEN"] = "xoxp-*"
|
||||
|
||||
loader = SlackLoader()
|
||||
|
||||
"""
|
||||
config = {
|
||||
'base_url': slack_app_url,
|
||||
'headers': web_headers,
|
||||
'team_id': slack_team_id,
|
||||
}
|
||||
|
||||
loader = SlackLoader(config)
|
||||
"""
|
||||
```
|
||||
|
||||
NOTE: you can also pass the `config` with `base_url`, `headers`, `team_id` to setup your SlackLoader.
|
||||
|
||||
2. Once you setup the loader, you can create an app and load data using the above slack loader
|
||||
```Python
|
||||
import os
|
||||
from embedchain.pipeline import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add("in:random", data_type="slack", loader=loader)
|
||||
question = "Which bots are available in the slack workspace's random channel?"
|
||||
# Answer: The available bot in the slack workspace's random channel is the Embedchain bot.
|
||||
```
|
||||
|
||||
3. We automatically create a chunker to chunk your slack data, however if you wish to provide your own chunker class. Here is how you can do that:
|
||||
```Python
|
||||
from embedchain.chunkers.slack import SlackChunker
|
||||
from embedchain.config.add_config import ChunkerConfig
|
||||
|
||||
slack_chunker_config = ChunkerConfig(chunk_size=1000, chunk_overlap=0, length_function=len)
|
||||
slack_chunker = SlackChunker(config=slack_chunker_config)
|
||||
|
||||
app.add(slack_chunker, data_type="slack", loader=loader, chunker=slack_chunker)
|
||||
```
|
||||
16
docs/components/data-sources/substack.mdx
Normal file
16
docs/components/data-sources/substack.mdx
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
title: "📝 Substack"
|
||||
---
|
||||
|
||||
To add any Substack data sources to your app, just add the main base url as the source and set the data_type to `substack`.
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
# source: for any substack just add the root URL
|
||||
app.add('https://www.lennysnewsletter.com', data_type='substack')
|
||||
app.query("Who is Brian Chesky?")
|
||||
# Answer: Brian Chesky is the co-founder and CEO of Airbnb.
|
||||
```
|
||||
17
docs/components/data-sources/text.mdx
Normal file
17
docs/components/data-sources/text.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
title: '📝 Text'
|
||||
---
|
||||
|
||||
### Text
|
||||
|
||||
Text is a local data type. To supply your own text, use the data_type as `text` and enter a string. The text is not processed, this can be very versatile. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add('Seek wealth, not money or status. Wealth is having assets that earn while you sleep. Money is how we transfer time and wealth. Status is your place in the social hierarchy.', data_type='text')
|
||||
```
|
||||
|
||||
Note: This is not used in the examples because in most cases you will supply a whole paragraph or file, which did not fit.
|
||||
13
docs/components/data-sources/web-page.mdx
Normal file
13
docs/components/data-sources/web-page.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
title: '🌐 Web page'
|
||||
---
|
||||
|
||||
To add any web page, use the data_type as `web_page`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add('a_valid_web_page_url', data_type='web_page')
|
||||
```
|
||||
17
docs/components/data-sources/xml.mdx
Normal file
17
docs/components/data-sources/xml.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
---
|
||||
title: '🧾 XML file'
|
||||
---
|
||||
|
||||
### XML file
|
||||
|
||||
To add any xml file, use the data_type as `xml`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
|
||||
app.add('content/data.xml')
|
||||
```
|
||||
|
||||
Note: Only the text content of the xml file will be added to the app. The tags will be ignored.
|
||||
13
docs/components/data-sources/youtube-video.mdx
Normal file
13
docs/components/data-sources/youtube-video.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
title: '📺 Youtube'
|
||||
---
|
||||
|
||||
|
||||
To add any youtube video to your app, use the data_type (first argument to `.add()` method) as `youtube_video`. Eg:
|
||||
|
||||
```python
|
||||
from embedchain import Pipeline as App
|
||||
|
||||
app = App()
|
||||
app.add('a_valid_youtube_url_here', data_type='youtube_video')
|
||||
```
|
||||
0
docs/components/retrieval-methods.mdx
Normal file
0
docs/components/retrieval-methods.mdx
Normal file
Reference in New Issue
Block a user