[Docs] Update docs for evaluation (#1160)

2024-01-11 22:53:16 +05:30
parent 68ec6615b1
commit 785929c502
2 changed files with 135 additions and 69 deletions
--- a/docs/components/evaluation.mdx
+++ b/docs/components/evaluation.mdx
@@ -4,44 +4,67 @@ title: 🔬 Evaluation

 ## Overview

-We provide out-of-the-box evaluation methods for your datasets. You can use them to evaluate your models and compare them with other models.
+We provide out-of-the-box evaluation metrics for your RAG application. You can use them to evaluate your RAG applications and compare against different settings of your production RAG application.

-Currently, we provide the following evaluation methods:
+Currently, we provide support for following evaluation metrics:

 <CardGroup cols={3}>
    <Card title="Context Relevancy" href="#context_relevancy"></Card>
    <Card title="Answer Relevancy" href="#answer_relevancy"></Card>
    <Card title="Groundedness" href="#groundedness"></Card>
-    <Card title="Custom" href="#custom"></Card>
+    <Card title="Custom Metric" href="#custom_metric"></Card>
 </CardGroup>

-More evaluation metrics are coming soon! 🏗️
+## Quickstart

-## Usage
+Here is a basic example of running evaluation:

-We have found that the best way to evaluate datasets is with the help of OpenAI's `gpt-4` model. Hence, we require you to set `OPENAI_API_KEY` as an environment variable. If you don't want to set it, you can pass it in the config argument of the respective evaluation class, as shown in the examples later below.
+```python example.py
+from embedchain import App

-<Accordion title="We will assume the following dataset for the examples below">
-<CodeGroup>
-```python main.py
+app = App()
+
+# Add data sources
+app.add("https://www.forbes.com/profile/elon-musk")
+
+# Run evaluation
+app.evaluate(["What is the net worth of Elon Musk?", "How many companies Elon Musk owns?"])
+# {'answer_relevancy': 0.9987286412340826, 'groundedness': 1.0, 'context_relevancy': 0.3571428571428571}
+```
+
+Under the hood, Embedchain does the following:
+
+1. Runs semantic search in the vector database and fetches context
+2. LLM call with question, context to fetch the answer
+3. Run evaluation on following metrics: `context relevancy`, `groundedness`, and `answer relevancy` and return result
+
+## Advanced Usage
+
+We use OpenAI's `gpt-4` model as default LLM model for automatic evaluation. Hence, we require you to set `OPENAI_API_KEY` as an environment variable.
+
+### Step-1: Create dataset
+
+In order to evaluate your RAG application, you have to setup a dataset. A data point in the dataset consists of `questions`, `contexts`, `answer`. Here is an example of how to create a dataset for evaluation:
+
+```python
 from embedchain.utils.eval import EvalData

 data = [
    {
        "question": "What is the net worth of Elon Musk?",
        "contexts": [
-            """Elon Musk PROFILEElon MuskCEO, ...""",
-            """a Twitter poll on whether the journalists' ...""",
-            """2016 and run by Jared Birchall.[335]...""",
+            "Elon Musk PROFILEElon MuskCEO, ...",
+            "a Twitter poll on whether the journalists' ...",
+            "2016 and run by Jared Birchall.[335]...",
        ],
        "answer": "As of the information provided, Elon Musk's net worth is $241.6 billion.",
    },
    {
        "question": "which companies does Elon Musk own?",
        "contexts": [
-            """of December 2023[update], ...""",
-            """ThielCofounderView ProfileTeslaHolds ...""",
-            """Elon Musk PROFILEElon MuskCEO, ...""",
+            "of December 2023[update], ...",
+            "ThielCofounderView ProfileTeslaHolds ...",
+            "Elon Musk PROFILEElon MuskCEO, ...",
        ],
        "answer": "Elon Musk owns several companies, including Tesla, SpaceX, Neuralink, and The Boring Company.",
    },
@@ -50,31 +73,66 @@ data = [
 dataset = []

 for d in data:
-    dataset.append(EvalData(question=d["question"], contexts=d["contexts"], answer=d["answer"]))
+    eval_data = EvalData(question=d["question"], contexts=d["contexts"], answer=d["answer"])
+    dataset.append(eval_data)
 ```
-</CodeGroup>
-</Accordion>

-## Context Relevancy <a id="context_relevancy"></a>
+### Step-2: Run evaluation

-Context relevancy is a metric to determine how relevant the context is to the question. We use OpenAI's `gpt-4` model to determine the relevancy of the context.
-We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
+Once you have created your dataset, you can run evaluation on the dataset by picking the metric you want to run evaluation on.

-context_relevance_score = (# of relevant sentences in context) $$\div$$ (total # of sentences in context)
+For example, you can run evaluation on context relevancy metric using the following code:
+
+```python
+from embedchain.eval.metrics import ContextRelevance
+metric = ContextRelevance()
+score = metric.evaluate(dataset)
+print(score)
+```
+
+You can choose a different metric or write your own to run evaluation on. You can check the following links:
+
+- [Context Relevancy](#context_relevancy)
+- [Answer relenvancy](#answer_relevancy)
+- [Groundedness](#groundedness)
+- [Build your own metric](#custom_metric)
+
+## Metrics
+
+### Context Relevancy <a id="context_relevancy"></a>
+
+Context relevancy is a metric to determine "how relevant the context is to the question". We use OpenAI's `gpt-4` model to determine the relevancy of the context. We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
+
+```
+context_relevance_score = num_relevant_sentences_in_context / num_of_sentences_in_context
+```
+
+#### Examples

 You can run the context relevancy evaluation with the following simple code:

 ```python
 from embedchain.eval.metrics import ContextRelevance
+
 metric = ContextRelevance()
-score = metric.evaluate(dataset)    # dataset from above
+score = metric.evaluate(dataset)  # 'dataset' is definted in the create dataset section
 print(score)
 # 0.27975528364849833
 ```
+In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `ContextRelevanceConfig` class.

-In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `ContextRelevanceConfig` class. 
+Here is a more advanced example of how to pass a custom evaluation config for evaluating on context relevance metric:

-### ContextRelevanceConfig
+```python
+from embedchain.config.eval.base import ContextRelevanceConfig
+from embedchain.eval.metrics import ContextRelevance
+
+eval_config = ContextRelevanceConfig(model="gpt-4", api_key="sk-xxx", language="en")
+metric = ContextRelevance(config=eval_config)
+metric.evaluate(dataset)
+```
+
+#### `ContextRelevanceConfig`

 <ParamField path="model" type="str" optional>
    The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
@@ -89,33 +147,45 @@ In the above example, we used sensible defaults for the evaluation. However, you
    The prompt to extract the relevant sentences from the context. Defaults to `CONTEXT_RELEVANCY_PROMPT`, which can be found at `embedchain.config.eval.base` path.
 </ParamField>

-```python
-openai_api_key = "sk-xxx"
-metric = ContextRelevance(config=ContextRelevanceConfig(model='gpt-4', api_key=openai_api_key, language="en"))
-print(metric.evaluate(dataset))
+
+### Answer Relevancy <a id="answer_relevancy"></a>
+
+Answer relevancy is a metric to determine how relevant the answer is to the question. We prompt the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
+
+```
+answer_relevancy_score = mean(cosine_similarity(generated_questions, original_question))
 ```

-
-## Answer Relevancy <a id="answer_relevancy"></a>
-
-Answer relevancy is a metric to determine how relevant the answer is to the question. We use OpenAI's `gpt-4` model to determine the relevancy of the answer.
-We achieve this by prompting the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
-
-answer_relevancy_score = mean(cosine_similarity(generated_questions, original_question))
+#### Examples

 You can run the answer relevancy evaluation with the following simple code:

 ```python
 from embedchain.eval.metrics import AnswerRelevance
+
 metric = AnswerRelevance()
-score = metric.evaluate(dataset)    # dataset from above
+score = metric.evaluate(dataset)
 print(score)
 # 0.9505334177461916
 ```

-In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `AnswerRelevanceConfig` class.
+In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `AnswerRelevanceConfig` class. Here is a more advanced example where you can provide your own evaluation config:

-### AnswerRelevanceConfig
+```python
+from embedchain.config.eval.base import AnswerRelevanceConfig
+from embedchain.eval.metrics import AnswerRelevance
+
+eval_config = AnswerRelevanceConfig(
+    model='gpt-4',
+    embedder="text-embedding-ada-002",
+    api_key="sk-xxx",
+    num_gen_questions=2
+)
+metric = AnswerRelevance(config=eval_config)
+score = metric.evaluate(dataset)
+```
+
+#### `AnswerRelevanceConfig`

 <ParamField path="model" type="str" optional>
    The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
@@ -133,21 +203,13 @@ In the above example, we used sensible defaults for the evaluation. However, you
    The prompt to extract the `num_gen_questions` number of questions from the provided answer. Defaults to `ANSWER_RELEVANCY_PROMPT`, which can be found at `embedchain.config.eval.base` path.
 </ParamField>

-```python
-openai_api_key = "sk-xxx"
-metric = AnswerRelevance(config=AnswerRelevanceConfig(model='gpt-4',
-                                                      embedder="text-embedding-ada-002",
-                                                      api_key=openai_api_key,
-                                                      num_gen_questions=2))
-print(metric.evaluate(dataset))
-```
-
 ## Groundedness <a id="groundedness"></a>

-Groundedness is a metric to determine how grounded the answer is to the context. We use OpenAI's `gpt-4` model to determine the groundedness of the answer.
-We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:
+Groundedness is a metric to determine how grounded the answer is to the context. We use OpenAI's `gpt-4` model to determine the groundedness of the answer. We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:

-groundedness_score = (sum of all verdicts) $$\div$$ (total # of claims)
+```
+groundedness_score = (sum of all verdicts) / (total # of claims)
+```

 You can run the groundedness evaluation with the following simple code:

@@ -159,9 +221,19 @@ print(score)
 # 1.0
 ```

-In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `GroundednessConfig` class.
+In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the `GroundednessConfig` class. Here is a more advanced example where you can configure the evaluation config:

-### GroundednessConfig
+```python
+from embedchain.config.eval.base import GroundednessConfig
+from embedchain.eval.metrics import Groundedness
+
+eval_config = GroundednessConfig(model='gpt-4', api_key="sk-xxx")
+metric = Groundedness(config=eval_config)
+score = metric.evaluate(dataset)
+```
+
+
+#### `GroundednessConfig`

 <ParamField path="model" type="str" optional>
    The model to use for the evaluation. Defaults to `gpt-4`. We only support openai's models for now.
@@ -176,14 +248,7 @@ In the above example, we used sensible defaults for the evaluation. However, you
    The prompt to get verdicts on the claims from the answer from the given context. Defaults to `GROUNDEDNESS_CLAIMS_INFERENCE_PROMPT`, which can be found at `embedchain.config.eval.base` path.
 </ParamField>

-```python
-openai_api_key = "sk-xxx"
-metric = Groundedness(config=GroundednessConfig(model='gpt-4',
-                                                api_key=openai_api_key))
-print(metric.evaluate(dataset))
-```
-
-## Custom <a id="custom"></a>
+## Custom <a id="custom_metric"></a>

 You can also create your own evaluation metric by extending the `BaseMetric` class. You can find the source code for the existing metrics at `embedchain.eval.metrics` path.

@@ -192,14 +257,15 @@ You must provide the `name` of your custom metric in the `__init__` method of yo
 </Note>

 ```python
-from embedchain.eval.metrics import BaseMetric
-from embedchain.utils.eval import EvalData
-from embedchain.config.base_config import BaseConfig
 from typing import Optional

-class CustomMetric(BaseMetric):
+from embedchain.config.base_config import BaseConfig
+from embedchain.eval.metrics import BaseMetric
+from embedchain.utils.eval import EvalData
+
+class MyCustomMetric(BaseMetric):
    def __init__(self, config: Optional[BaseConfig] = None):
-        super().__init__(name="custom_metric")
+        super().__init__(name="my_custom_metric")

    def evaluate(self, dataset: list[EvalData]):
        score = 0.0
--- a/embedchain/eval/metrics/groundedness.py
+++ b/embedchain/eval/metrics/groundedness.py
@@ -15,7 +15,7 @@ from embedchain.utils.eval import EvalData, EvalMetric

 class Groundedness(BaseMetric):
    """
-    Metric for groundedness (aka faithfulness) of answer from the given contexts.
+    Metric for groundedness of answer from the given contexts.
    """

    def __init__(self, config: Optional[GroundednessConfig] = None):
@@ -70,7 +70,7 @@ class Groundedness(BaseMetric):

    def _compute_score(self, data: EvalData) -> float:
        """
-        Compute the groundedness score (aka faithfulness) for a single data point.
+        Compute the groundedness score for a single data point.
        """
        answer_claims_prompt = self._generate_answer_claim_prompt(data)
        claim_statements = self._get_claim_statements(answer_claims_prompt)
@@ -90,7 +90,7 @@ class Groundedness(BaseMetric):
            for future in tqdm(
                concurrent.futures.as_completed(future_to_data),
                total=len(future_to_data),
-                desc="Evaluating groundedness (aka faithfulness)",
+                desc="Evaluating Groundedness",
            ):
                data = future_to_data[future]
                try: