Docs Update (#2591)

2025-04-29 08:15:25 -07:00
parent 6d13e83001
commit 393a4fd5a6
111 changed files with 2296 additions and 99 deletions
--- a/evaluation/Makefile
+++ b/evaluation/Makefile
@@ -0,0 +1,31 @@
+
+# Run the experiments
+run-mem0-add:
+	python run_experiments.py --technique_type mem0 --method add
+
+run-mem0-search:
+	python run_experiments.py --technique_type mem0 --method search --output_folder results/ --top_k 30
+
+run-mem0-plus-add:
+	python run_experiments.py --technique_type mem0 --method add --is_graph
+
+run-mem0-plus-search:
+	python run_experiments.py --technique_type mem0 --method search --is_graph --output_folder results/ --top_k 30
+
+run-rag:
+	python run_experiments.py --technique_type rag --chunk_size 500 --num_chunks 1 --output_folder results/
+
+run-full-context:
+	python run_experiments.py --technique_type rag --chunk_size -1 --num_chunks 1 --output_folder results/
+
+run-langmem:
+	python run_experiments.py --technique_type langmem --output_folder results/
+
+run-zep-add:
+	python run_experiments.py --technique_type zep --method add --output_folder results/
+
+run-zep-search:
+	python run_experiments.py --technique_type zep --method search --output_folder results/
+
+run-openai:
+	python run_experiments.py --technique_type openai --output_folder results/
--- a/evaluation/README.md
+++ b/evaluation/README.md
@@ -0,0 +1,192 @@
+# Mem0: Building Production‑Ready AI Agents with Scalable Long‑Term Memory
+
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/XXXX.XXXXX)
+[![Website](https://img.shields.io/badge/Website-Project-blue)](https://mem0.ai/research)
+
+This repository contains the code and dataset for our paper: **Mem0: Building Production‑Ready AI Agents with Scalable Long‑Term Memory**.
+
+## 📋 Overview
+
+This project evaluates Mem0 and compares it with different memory and retrieval techniques for AI systems:
+
+1. **Established LOCOMO Benchmarks**: We evaluate against five established approaches from the literature: LoCoMo, ReadAgent, MemoryBank, MemGPT, and A-Mem.
+2. **Open-Source Memory Solutions**: We test promising open-source memory architectures including LangMem, which provides flexible memory management capabilities.
+3. **RAG Systems**: We implement Retrieval-Augmented Generation with various configurations, testing different chunk sizes and retrieval counts to optimize performance.
+4. **Full-Context Processing**: We examine the effectiveness of passing the entire conversation history within the context window of the LLM as a baseline approach.
+5. **Proprietary Memory Systems**: We evaluate OpenAI's built-in memory feature available in their ChatGPT interface to compare against commercial solutions.
+6. **Third-Party Memory Providers**: We incorporate Zep, a specialized memory management platform designed for AI agents, to assess the performance of dedicated memory infrastructure.
+
+We test these techniques on the LOCOMO dataset, which contains conversational data with various question types to evaluate memory recall and understanding.
+
+## 🔍 Dataset
+
+The dataset is located in the `dataset/` directory:
+- `locomo10.json`: Original dataset
+- `locomo10_rag.json`: Dataset formatted for RAG experiments
+
+## 📁 Project Structure
+
+```
+.
+├── src/                  # Source code for different memory techniques
+│   ├── mem0/             # Implementation of the Mem0 technique
+│   ├── openai/           # Implementation of the OpenAI memory
+│   ├── zep/              # Implementation of the Zep memory
+│   ├── rag.py            # Implementation of the RAG technique
+│   └── langmem.py        # Implementation of the Language-based memory
+├── metrics/              # Code for evaluation metrics
+├── results/              # Results of experiments
+├── dataset/              # Dataset files
+├── evals.py              # Evaluation script
+├── run_experiments.py    # Script to run experiments
+├── generate_scores.py    # Script to generate scores from results
+└── prompts.py            # Prompts used for the models
+```
+
+## 🚀 Getting Started
+
+### Prerequisites
+
+Create a `.env` file with your API keys and configurations. The following keys are required:
+
+```
+# OpenAI API key for GPT models and embeddings
+OPENAI_API_KEY="your-openai-api-key"
+
+# Mem0 API keys (for Mem0 and Mem0+ techniques)
+MEM0_API_KEY="your-mem0-api-key"
+MEM0_PROJECT_ID="your-mem0-project-id"
+MEM0_ORGANIZATION_ID="your-mem0-organization-id"
+
+# Model configuration
+MODEL="gpt-4o-mini"  # or your preferred model
+EMBEDDING_MODEL="text-embedding-3-small"  # or your preferred embedding model
+ZEP_API_KEY="api-key-from-zep"
+```
+
+### Running Experiments
+
+You can run experiments using the provided Makefile commands:
+
+#### Memory Techniques
+
+```bash
+# Run Mem0 experiments
+make run-mem0-add         # Add memories using Mem0
+make run-mem0-search      # Search memories using Mem0
+
+# Run Mem0+ experiments (with graph-based search)
+make run-mem0-plus-add    # Add memories using Mem0+
+make run-mem0-plus-search # Search memories using Mem0+
+
+# Run RAG experiments
+make run-rag              # Run RAG with chunk size 500
+make run-full-context     # Run RAG with full context
+
+# Run LangMem experiments
+make run-langmem          # Run LangMem
+
+# Run Zep experiments
+make run-zep-add          # Add memories using Zep
+make run-zep-search       # Search memories using Zep
+
+# Run OpenAI experiments
+make run-openai           # Run OpenAI experiments
+```
+
+Alternatively, you can run experiments directly with custom parameters:
+
+```bash
+python run_experiments.py --technique_type [mem0|rag|langmem] [additional parameters]
+```
+
+#### Command-line Parameters:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--technique_type` | Memory technique to use (mem0, rag, langmem) | mem0 |
+| `--method` | Method to use (add, search) | add |
+| `--chunk_size` | Chunk size for processing | 1000 |
+| `--top_k` | Number of top memories to retrieve | 30 |
+| `--filter_memories` | Whether to filter memories | False |
+| `--is_graph` | Whether to use graph-based search | False |
+| `--num_chunks` | Number of chunks to process for RAG | 1 |
+
+### 📊 Evaluation
+
+To evaluate results, run:
+
+```bash
+python evals.py --input_file [path_to_results] --output_file [output_path]
+```
+
+This script:
+1. Processes each question-answer pair
+2. Calculates BLEU and F1 scores automatically
+3. Uses an LLM judge to evaluate answer correctness
+4. Saves the combined results to the output file
+
+### 📈 Generating Scores
+
+Generate final scores with:
+
+```bash
+python generate_scores.py
+```
+
+This script:
+1. Loads the evaluation metrics data
+2. Calculates mean scores for each category (BLEU, F1, LLM)
+3. Reports the number of questions per category
+4. Calculates overall mean scores across all categories
+
+Example output:
+```
+Mean Scores Per Category:
+         bleu_score  f1_score  llm_score  count
+category                                       
+1           0.xxxx    0.xxxx     0.xxxx     xx
+2           0.xxxx    0.xxxx     0.xxxx     xx
+3           0.xxxx    0.xxxx     0.xxxx     xx
+
+Overall Mean Scores:
+bleu_score    0.xxxx
+f1_score      0.xxxx
+llm_score     0.xxxx
+```
+
+## 📏 Evaluation Metrics
+
+We use several metrics to evaluate the performance of different memory techniques:
+
+1. **BLEU Score**: Measures the similarity between the model's response and the ground truth
+2. **F1 Score**: Measures the harmonic mean of precision and recall
+3. **LLM Score**: A binary score (0 or 1) determined by an LLM judge evaluating the correctness of responses
+4. **Token Consumption**: Number of tokens required to generate final answer.
+5. **Latency**: Time required during search and to generate response.
+
+## 📚 Citation
+
+If you use this code or dataset in your research, please cite our paper:
+
+```bibtex
+@article{mem0,
+  title={Mem0: Building Production‑Ready AI Agents with Scalable Long‑Term Memory},
+  author={---},
+  journal={arXiv preprint},
+  year={2025}
+}
+```
+
+## 📄 License
+
+[MIT License](LICENSE)
+
+## 👥 Contributors
+
+- [Prateek Chhikara](https://github.com/prateekchhikara)
+- [Dev Khant](https://github.com/Dev-Khant)
+- [Saket Aryan](https://github.com/whysosaket)
+- [Taranjeet Singh](https://github.com/taranjeet)
+- [Deshraj Yadav](https://github.com/deshraj)
+
--- a/evaluation/evals.py
+++ b/evaluation/evals.py
@@ -0,0 +1,81 @@
+import json
+import argparse
+from metrics.utils import calculate_metrics, calculate_bleu_scores
+from metrics.llm_judge import evaluate_llm_judge
+from collections import defaultdict
+from tqdm import tqdm
+import concurrent.futures
+import threading
+
+
+def process_item(item_data):
+    k, v = item_data
+    local_results = defaultdict(list)
+
+    for item in v:
+        gt_answer = str(item['answer'])
+        pred_answer = str(item['response'])
+        category = str(item['category'])
+        question = str(item['question'])
+
+        # Skip category 5
+        if category == '5':
+            continue
+
+        metrics = calculate_metrics(pred_answer, gt_answer)
+        bleu_scores = calculate_bleu_scores(pred_answer, gt_answer)
+        llm_score = evaluate_llm_judge(question, gt_answer, pred_answer)
+
+        local_results[k].append({
+            "question": question,
+            "answer": gt_answer,
+            "response": pred_answer,
+            "category": category,
+            "bleu_score": bleu_scores["bleu1"],
+            "f1_score": metrics["f1"],
+            "llm_score": llm_score
+        })
+
+    return local_results
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Evaluate RAG results')
+    parser.add_argument('--input_file', type=str,
+                        default="results/rag_results_500_k1.json",
+                        help='Path to the input dataset file')
+    parser.add_argument('--output_file', type=str,
+                        default="evaluation_metrics.json",
+                        help='Path to save the evaluation results')
+    parser.add_argument('--max_workers', type=int, default=10,
+                        help='Maximum number of worker threads')
+
+    args = parser.parse_args()
+
+    with open(args.input_file, 'r') as f:
+        data = json.load(f)
+
+    results = defaultdict(list)
+    results_lock = threading.Lock()
+
+    # Use ThreadPoolExecutor with specified workers
+    with concurrent.futures.ThreadPoolExecutor(max_workers=args.max_workers) as executor:
+        futures = [executor.submit(process_item, item_data) 
+                  for item_data in data.items()]
+
+        for future in tqdm(concurrent.futures.as_completed(futures), 
+                          total=len(futures)):
+            local_results = future.result()
+            with results_lock:
+                for k, items in local_results.items():
+                    results[k].extend(items)
+
+    # Save results to JSON file
+    with open(args.output_file, 'w') as f:
+        json.dump(results, f, indent=4)
+
+    print(f"Results saved to {args.output_file}")
+
+
+if __name__ == "__main__":
+    main()
--- a/evaluation/generate_scores.py
+++ b/evaluation/generate_scores.py
@@ -0,0 +1,41 @@
+import pandas as pd
+import json
+
+# Load the evaluation metrics data
+with open('evaluation_metrics.json', 'r') as f:
+    data = json.load(f)
+
+# Flatten the data into a list of question items
+all_items = []
+for key in data:
+    all_items.extend(data[key])
+
+# Convert to DataFrame
+df = pd.DataFrame(all_items)
+
+# Convert category to numeric type
+df['category'] = pd.to_numeric(df['category'])
+
+# Calculate mean scores by category
+result = df.groupby('category').agg({
+    'bleu_score': 'mean',
+    'f1_score': 'mean',
+    'llm_score': 'mean'
+}).round(4)
+
+# Add count of questions per category
+result['count'] = df.groupby('category').size()
+
+# Print the results
+print("Mean Scores Per Category:")
+print(result)
+
+# Calculate overall means
+overall_means = df.agg({
+    'bleu_score': 'mean',
+    'f1_score': 'mean',
+    'llm_score': 'mean'
+}).round(4)
+
+print("\nOverall Mean Scores:")
+print(overall_means)
--- a/evaluation/metrics/llm_judge.py
+++ b/evaluation/metrics/llm_judge.py
@@ -0,0 +1,127 @@
+from openai import OpenAI
+import json
+from collections import defaultdict
+import numpy as np
+import argparse
+
+client = OpenAI()
+
+ACCURACY_PROMPT = """
+Your task is to label an answer to a question as ’CORRECT’ or ’WRONG’. You will be given the following data:
+    (1) a question (posed by one user to another user), 
+    (2) a ’gold’ (ground truth) answer, 
+    (3) a generated answer
+which you will score as CORRECT/WRONG.
+
+The point of the question is to ask about something one user should know about the other user based on their prior conversations.
+The gold answer will usually be a concise and short answer that includes the referenced topic, for example:
+Question: Do you remember what I got the last time I went to Hawaii?
+Gold answer: A shell necklace
+The generated answer might be much longer, but you should be generous with your grading - as long as it touches on the same topic as the gold answer, it should be counted as CORRECT. 
+
+For time related questions, the gold answer will be a specific date, month, year, etc. The generated answer might be much longer or use relative time references (like "last Tuesday" or "next month"), but you should be generous with your grading - as long as it refers to the same date or time period as the gold answer, it should be counted as CORRECT. Even if the format differs (e.g., "May 7th" vs "7 May"), consider it CORRECT if it's the same date.
+
+Now it’s time for the real question:
+Question: {question}
+Gold answer: {gold_answer}
+Generated answer: {generated_answer}
+
+First, provide a short (one sentence) explanation of your reasoning, then finish with CORRECT or WRONG. 
+Do NOT include both CORRECT and WRONG in your response, or it will break the evaluation script.
+
+Just return the label CORRECT or WRONG in a json format with the key as "label".
+"""
+
+def evaluate_llm_judge(question, gold_answer, generated_answer):
+    """Evaluate the generated answer against the gold answer using an LLM judge."""
+    response = client.chat.completions.create(
+        model="gpt-4o-mini",
+        messages=[{
+            "role": "user", 
+            "content": ACCURACY_PROMPT.format(
+                question=question, 
+                gold_answer=gold_answer, 
+                generated_answer=generated_answer
+            )
+        }],
+        response_format={"type": "json_object"},
+        temperature=0.0
+    )
+    label = json.loads(response.choices[0].message.content)['label']
+    return 1 if label == "CORRECT" else 0
+
+
+def main():
+    """Main function to evaluate RAG results using LLM judge."""
+    parser = argparse.ArgumentParser(
+        description='Evaluate RAG results using LLM judge'
+    )
+    parser.add_argument(
+        '--input_file',
+        type=str,
+        default="results/default_run_v4_k30_new_graph.json",
+        help='Path to the input dataset file'
+    )
+
+    args = parser.parse_args()
+
+    dataset_path = args.input_file
+    output_path = f"results/llm_judge_{dataset_path.split('/')[-1]}"
+
+    with open(dataset_path, "r") as f:
+        data = json.load(f)
+
+    LLM_JUDGE = defaultdict(list)
+    RESULTS = defaultdict(list)
+
+    index = 0
+    for k, v in data.items():
+        for x in v:
+            question = x['question']
+            gold_answer = x['answer']
+            generated_answer = x['response']
+            category = x['category']
+
+            # Skip category 5
+            if int(category) == 5:
+                continue
+
+            # Evaluate the answer
+            label = evaluate_llm_judge(question, gold_answer, generated_answer)
+            LLM_JUDGE[category].append(label)
+
+            # Store the results
+            RESULTS[index].append({
+                "question": question,
+                "gt_answer": gold_answer,
+                "response": generated_answer,
+                "category": category,
+                "llm_label": label
+            })
+
+            # Save intermediate results
+            with open(output_path, "w") as f:
+                json.dump(RESULTS, f, indent=4)
+
+            # Print current accuracy for all categories
+            print("All categories accuracy:")
+            for cat, results in LLM_JUDGE.items():
+                if results:  # Only print if there are results for this category
+                    print(f"  Category {cat}: {np.mean(results):.4f} "
+                          f"({sum(results)}/{len(results)})")
+            print("------------------------------------------")
+        index += 1
+
+    # Save final results
+    with open(output_path, "w") as f:
+        json.dump(RESULTS, f, indent=4)
+
+    # Print final summary
+    print("PATH: ", dataset_path)
+    print("------------------------------------------")
+    for k, v in LLM_JUDGE.items():
+        print(k, np.mean(v))
+
+
+if __name__ == "__main__":
+    main()
--- a/evaluation/metrics/utils.py
+++ b/evaluation/metrics/utils.py
@@ -0,0 +1,224 @@
+"""
+Borrowed from https://github.com/WujiangXu/AgenticMemory/blob/main/utils.py
+
+@article{xu2025mem,
+    title={A-mem: Agentic memory for llm agents},
+    author={Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao 
+           and Zhang, Yongfeng},
+    journal={arXiv preprint arXiv:2502.12110},
+    year={2025}
+}
+"""
+
+import re
+import string
+import numpy as np
+from typing import List, Dict, Union
+import statistics
+from collections import defaultdict
+from rouge_score import rouge_scorer
+from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
+from bert_score import score as bert_score
+import nltk
+from nltk.translate.meteor_score import meteor_score
+from sentence_transformers import SentenceTransformer
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from openai import OpenAI
+# from load_dataset import load_locomo_dataset, QA, Turn, Session, Conversation
+from sentence_transformers.util import pytorch_cos_sim
+
+# Download required NLTK data
+try:
+    nltk.download('punkt', quiet=True)
+    nltk.download('wordnet', quiet=True)
+except Exception as e:
+    print(f"Error downloading NLTK data: {e}")
+
+# Initialize SentenceTransformer model (this will be reused)
+try:
+    sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
+except Exception as e:
+    print(f"Warning: Could not load SentenceTransformer model: {e}")
+    sentence_model = None
+
+def simple_tokenize(text):
+    """Simple tokenization function."""
+    # Convert to string if not already
+    text = str(text)
+    return text.lower().replace('.', ' ').replace(',', ' ').replace('!', ' ').replace('?', ' ').split()
+
+def calculate_rouge_scores(prediction: str, reference: str) -> Dict[str, float]:
+    """Calculate ROUGE scores for prediction against reference."""
+    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
+    scores = scorer.score(reference, prediction)
+    return {
+        'rouge1_f': scores['rouge1'].fmeasure,
+        'rouge2_f': scores['rouge2'].fmeasure,
+        'rougeL_f': scores['rougeL'].fmeasure
+    }
+
+def calculate_bleu_scores(prediction: str, reference: str) -> Dict[str, float]:
+    """Calculate BLEU scores with different n-gram settings."""
+    pred_tokens = nltk.word_tokenize(prediction.lower())
+    ref_tokens = [nltk.word_tokenize(reference.lower())]
+    
+    weights_list = [(1, 0, 0, 0), (0.5, 0.5, 0, 0), (0.33, 0.33, 0.33, 0), (0.25, 0.25, 0.25, 0.25)]
+    smooth = SmoothingFunction().method1
+    
+    scores = {}
+    for n, weights in enumerate(weights_list, start=1):
+        try:
+            score = sentence_bleu(ref_tokens, pred_tokens, weights=weights, smoothing_function=smooth)
+        except Exception:
+            print(f"Error calculating BLEU score: {e}")
+            score = 0.0
+        scores[f'bleu{n}'] = score
+    
+    return scores
+
+def calculate_bert_scores(prediction: str, reference: str) -> Dict[str, float]:
+    """Calculate BERTScore for semantic similarity."""
+    try:
+        P, R, F1 = bert_score([prediction], [reference], lang='en', verbose=False)
+        return {
+            'bert_precision': P.item(),
+            'bert_recall': R.item(),
+            'bert_f1': F1.item()
+        }
+    except Exception as e:
+        print(f"Error calculating BERTScore: {e}")
+        return {
+            'bert_precision': 0.0,
+            'bert_recall': 0.0,
+            'bert_f1': 0.0
+        }
+
+def calculate_meteor_score(prediction: str, reference: str) -> float:
+    """Calculate METEOR score for the prediction."""
+    try:
+        return meteor_score([reference.split()], prediction.split())
+    except Exception as e:
+        print(f"Error calculating METEOR score: {e}")
+        return 0.0
+
+def calculate_sentence_similarity(prediction: str, reference: str) -> float:
+    """Calculate sentence embedding similarity using SentenceBERT."""
+    if sentence_model is None:
+        return 0.0
+    try:
+        # Encode sentences
+        embedding1 = sentence_model.encode([prediction], convert_to_tensor=True)
+        embedding2 = sentence_model.encode([reference], convert_to_tensor=True)
+        
+        # Calculate cosine similarity
+        similarity = pytorch_cos_sim(embedding1, embedding2).item()
+        return float(similarity)
+    except Exception as e:
+        print(f"Error calculating sentence similarity: {e}")
+        return 0.0
+
+def calculate_metrics(prediction: str, reference: str) -> Dict[str, float]:
+    """Calculate comprehensive evaluation metrics for a prediction."""
+    # Handle empty or None values
+    if not prediction or not reference:
+        return {
+            "exact_match": 0,
+            "f1": 0.0,
+            "rouge1_f": 0.0,
+            "rouge2_f": 0.0,
+            "rougeL_f": 0.0,
+            "bleu1": 0.0,
+            "bleu2": 0.0,
+            "bleu3": 0.0,
+            "bleu4": 0.0,
+            "bert_f1": 0.0,
+            "meteor": 0.0,
+            "sbert_similarity": 0.0
+        }
+    
+    # Convert to strings if they're not already
+    prediction = str(prediction).strip()
+    reference = str(reference).strip()
+    
+    # Calculate exact match
+    exact_match = int(prediction.lower() == reference.lower())
+    
+    # Calculate token-based F1 score
+    pred_tokens = set(simple_tokenize(prediction))
+    ref_tokens = set(simple_tokenize(reference))
+    common_tokens = pred_tokens & ref_tokens
+    
+    if not pred_tokens or not ref_tokens:
+        f1 = 0.0
+    else:
+        precision = len(common_tokens) / len(pred_tokens)
+        recall = len(common_tokens) / len(ref_tokens)
+        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
+    
+    # Calculate all scores
+    rouge_scores = 0 #calculate_rouge_scores(prediction, reference)
+    bleu_scores = calculate_bleu_scores(prediction, reference)
+    bert_scores = 0 # calculate_bert_scores(prediction, reference)
+    meteor = 0 # calculate_meteor_score(prediction, reference)
+    sbert_similarity = 0 # calculate_sentence_similarity(prediction, reference)
+    
+    # Combine all metrics
+    metrics = {
+        "exact_match": exact_match,
+        "f1": f1,
+        # **rouge_scores,
+        **bleu_scores,
+        # **bert_scores,
+        # "meteor": meteor,
+        # "sbert_similarity": sbert_similarity
+    }
+
+    return metrics
+
+def aggregate_metrics(all_metrics: List[Dict[str, float]], all_categories: List[int]) -> Dict[str, Dict[str, Union[float, Dict[str, float]]]]:
+    """Calculate aggregate statistics for all metrics, split by category."""
+    if not all_metrics:
+        return {}
+    
+    # Initialize aggregates for overall and per-category metrics
+    aggregates = defaultdict(list)
+    category_aggregates = defaultdict(lambda: defaultdict(list))
+    
+    # Collect all values for each metric, both overall and per category
+    for metrics, category in zip(all_metrics, all_categories):
+        for metric_name, value in metrics.items():
+            aggregates[metric_name].append(value)
+            category_aggregates[category][metric_name].append(value)
+    
+    # Calculate statistics for overall metrics
+    results = {
+        "overall": {}
+    }
+    
+    for metric_name, values in aggregates.items():
+        results["overall"][metric_name] = {
+            'mean': statistics.mean(values),
+            'std': statistics.stdev(values) if len(values) > 1 else 0.0,
+            'median': statistics.median(values),
+            'min': min(values),
+            'max': max(values),
+            'count': len(values)
+        }
+    
+    # Calculate statistics for each category
+    for category in sorted(category_aggregates.keys()):
+        results[f"category_{category}"] = {}
+        for metric_name, values in category_aggregates[category].items():
+            if values:  # Only calculate if we have values for this category
+                results[f"category_{category}"][metric_name] = {
+                    'mean': statistics.mean(values),
+                    'std': statistics.stdev(values) if len(values) > 1 else 0.0,
+                    'median': statistics.median(values),
+                    'min': min(values),
+                    'max': max(values),
+                    'count': len(values)
+                }
+    
+    return results
--- a/evaluation/prompts.py
+++ b/evaluation/prompts.py
@@ -0,0 +1,147 @@
+ANSWER_PROMPT_GRAPH = """
+    You are an intelligent memory assistant tasked with retrieving accurate information from 
+    conversation memories.
+
+    # CONTEXT:
+    You have access to memories from two speakers in a conversation. These memories contain 
+    timestamped information that may be relevant to answering the question. You also have 
+    access to knowledge graph relations for each user, showing connections between entities, 
+    concepts, and events relevant to that user.
+
+    # INSTRUCTIONS:
+    1. Carefully analyze all provided memories from both speakers
+    2. Pay special attention to the timestamps to determine the answer
+    3. If the question asks about a specific event or fact, look for direct evidence in the 
+       memories
+    4. If the memories contain contradictory information, prioritize the most recent memory
+    5. If there is a question about time references (like "last year", "two months ago", 
+       etc.), calculate the actual date based on the memory timestamp. For example, if a 
+       memory from 4 May 2022 mentions "went to India last year," then the trip occurred 
+       in 2021.
+    6. Always convert relative time references to specific dates, months, or years. For 
+       example, convert "last year" to "2022" or "two months ago" to "March 2023" based 
+       on the memory timestamp. Ignore the reference while answering the question.
+    7. Focus only on the content of the memories from both speakers. Do not confuse 
+       character names mentioned in memories with the actual users who created those 
+       memories.
+    8. The answer should be less than 5-6 words.
+    9. Use the knowledge graph relations to understand the user's knowledge network and 
+       identify important relationships between entities in the user's world.
+
+    # APPROACH (Think step by step):
+    1. First, examine all memories that contain information related to the question
+    2. Examine the timestamps and content of these memories carefully
+    3. Look for explicit mentions of dates, times, locations, or events that answer the 
+       question
+    4. If the answer requires calculation (e.g., converting relative time references), 
+       show your work
+    5. Analyze the knowledge graph relations to understand the user's knowledge context
+    6. Formulate a precise, concise answer based solely on the evidence in the memories
+    7. Double-check that your answer directly addresses the question asked
+    8. Ensure your final answer is specific and avoids vague time references
+
+    Memories for user {{speaker_1_user_id}}:
+
+    {{speaker_1_memories}}
+
+    Relations for user {{speaker_1_user_id}}:
+
+    {{speaker_1_graph_memories}}
+
+    Memories for user {{speaker_2_user_id}}:
+
+    {{speaker_2_memories}}
+
+    Relations for user {{speaker_2_user_id}}:
+
+    {{speaker_2_graph_memories}}
+
+    Question: {{question}}
+
+    Answer:
+    """
+
+
+ANSWER_PROMPT = """
+    You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
+
+    # CONTEXT:
+    You have access to memories from two speakers in a conversation. These memories contain 
+    timestamped information that may be relevant to answering the question.
+
+    # INSTRUCTIONS:
+    1. Carefully analyze all provided memories from both speakers
+    2. Pay special attention to the timestamps to determine the answer
+    3. If the question asks about a specific event or fact, look for direct evidence in the memories
+    4. If the memories contain contradictory information, prioritize the most recent memory
+    5. If there is a question about time references (like "last year", "two months ago", etc.), 
+       calculate the actual date based on the memory timestamp. For example, if a memory from 
+       4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
+    6. Always convert relative time references to specific dates, months, or years. For example, 
+       convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory 
+       timestamp. Ignore the reference while answering the question.
+    7. Focus only on the content of the memories from both speakers. Do not confuse character 
+       names mentioned in memories with the actual users who created those memories.
+    8. The answer should be less than 5-6 words.
+
+    # APPROACH (Think step by step):
+    1. First, examine all memories that contain information related to the question
+    2. Examine the timestamps and content of these memories carefully
+    3. Look for explicit mentions of dates, times, locations, or events that answer the question
+    4. If the answer requires calculation (e.g., converting relative time references), show your work
+    5. Formulate a precise, concise answer based solely on the evidence in the memories
+    6. Double-check that your answer directly addresses the question asked
+    7. Ensure your final answer is specific and avoids vague time references
+
+    Memories for user {{speaker_1_user_id}}:
+
+    {{speaker_1_memories}}
+
+    Memories for user {{speaker_2_user_id}}:
+
+    {{speaker_2_memories}}
+
+    Question: {{question}}
+
+    Answer:
+    """
+
+
+ANSWER_PROMPT_ZEP = """
+    You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
+
+    # CONTEXT:
+    You have access to memories from a conversation. These memories contain
+    timestamped information that may be relevant to answering the question.
+
+    # INSTRUCTIONS:
+    1. Carefully analyze all provided memories
+    2. Pay special attention to the timestamps to determine the answer
+    3. If the question asks about a specific event or fact, look for direct evidence in the memories
+    4. If the memories contain contradictory information, prioritize the most recent memory
+    5. If there is a question about time references (like "last year", "two months ago", etc.), 
+       calculate the actual date based on the memory timestamp. For example, if a memory from 
+       4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
+    6. Always convert relative time references to specific dates, months, or years. For example, 
+       convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory 
+       timestamp. Ignore the reference while answering the question.
+    7. Focus only on the content of the memories. Do not confuse character 
+       names mentioned in memories with the actual users who created those memories.
+    8. The answer should be less than 5-6 words.
+
+    # APPROACH (Think step by step):
+    1. First, examine all memories that contain information related to the question
+    2. Examine the timestamps and content of these memories carefully
+    3. Look for explicit mentions of dates, times, locations, or events that answer the question
+    4. If the answer requires calculation (e.g., converting relative time references), show your work
+    5. Formulate a precise, concise answer based solely on the evidence in the memories
+    6. Double-check that your answer directly addresses the question asked
+    7. Ensure your final answer is specific and avoids vague time references
+
+    Memories:
+
+    {{memories}}
+
+    Question: {{question}}
+    Answer:
+    """
--- a/evaluation/run_experiments.py
+++ b/evaluation/run_experiments.py
@@ -0,0 +1,102 @@
+import os
+import json
+from src.memzero.add import MemoryADD
+from src.memzero.search import MemorySearch
+from src.utils import TECHNIQUES, METHODS
+import argparse
+from src.rag import RAGManager
+from src.langmem import LangMemManager
+from src.zep.search import ZepSearch
+from src.zep.add import ZepAdd
+from src.openai.predict import OpenAIPredict
+
+
+class Experiment:
+    def __init__(self, technique_type, chunk_size):
+        self.technique_type = technique_type
+        self.chunk_size = chunk_size
+
+    def run(self):
+        print(f"Running experiment with technique: {self.technique_type}, chunk size: {self.chunk_size}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Run memory experiments')
+    parser.add_argument('--technique_type', choices=TECHNIQUES, default='mem0',
+                        help='Memory technique to use')
+    parser.add_argument('--method', choices=METHODS, default='add',
+                        help='Method to use')
+    parser.add_argument('--chunk_size', type=int, default=1000,
+                        help='Chunk size for processing')
+    parser.add_argument('--output_folder', type=str, default='results/',
+                        help='Output path for results')
+    parser.add_argument('--top_k', type=int, default=30,
+                        help='Number of top memories to retrieve')
+    parser.add_argument('--filter_memories', action='store_true', default=False,
+                        help='Whether to filter memories')
+    parser.add_argument('--is_graph', action='store_true', default=False,
+                        help='Whether to use graph-based search')
+    parser.add_argument('--num_chunks', type=int, default=1,
+                        help='Number of chunks to process')
+
+    args = parser.parse_args()
+
+    # Add your experiment logic here
+    print(f"Running experiments with technique: {args.technique_type}, chunk size: {args.chunk_size}")
+
+    if args.technique_type == "mem0":
+        if args.method == "add":
+            memory_manager = MemoryADD(
+                data_path='dataset/locomo10.json',
+                is_graph=args.is_graph
+            )
+            memory_manager.process_all_conversations()
+        elif args.method == "search":
+            output_file_path = os.path.join(
+                args.output_folder,
+                f"mem0_results_top_{args.top_k}_filter_{args.filter_memories}_graph_{args.is_graph}.json"
+            )
+            memory_searcher = MemorySearch(
+                output_file_path,
+                args.top_k,
+                args.filter_memories,
+                args.is_graph
+            )
+            memory_searcher.process_data_file('dataset/locomo10.json')
+    elif args.technique_type == "rag":
+        output_file_path = os.path.join(
+            args.output_folder,
+            f"rag_results_{args.chunk_size}_k{args.num_chunks}.json"
+        )
+        rag_manager = RAGManager(
+            data_path="dataset/locomo10_rag.json",
+            chunk_size=args.chunk_size,
+            k=args.num_chunks
+        )
+        rag_manager.process_all_conversations(output_file_path)
+    elif args.technique_type == "langmem":
+        output_file_path = os.path.join(args.output_folder, "langmem_results.json")
+        langmem_manager = LangMemManager(dataset_path="dataset/locomo10_rag.json")
+        langmem_manager.process_all_conversations(output_file_path)
+    elif args.technique_type == "zep":
+        if args.method == "add":
+            zep_manager = ZepAdd(data_path="dataset/locomo10.json")
+            zep_manager.process_all_conversations("1")
+        elif args.method == "search":
+            output_file_path = os.path.join(args.output_folder, "zep_search_results.json")
+            zep_manager = ZepSearch()
+            zep_manager.process_data_file(
+                "dataset/locomo10.json",
+                "1",
+                output_file_path
+            )
+    elif args.technique_type == "openai":
+        output_file_path = os.path.join(args.output_folder, "openai_results.json")
+        openai_manager = OpenAIPredict()
+        openai_manager.process_data_file("dataset/locomo10.json", output_file_path)
+    else:
+        raise ValueError(f"Invalid technique type: {args.technique_type}")
+
+
+if __name__ == "__main__":
+    main()
--- a/evaluation/src/langmem.py
+++ b/evaluation/src/langmem.py
@@ -0,0 +1,193 @@
+from langgraph.checkpoint.memory import MemorySaver
+from langgraph.prebuilt import create_react_agent
+from langgraph.store.memory import InMemoryStore
+from langgraph.utils.config import get_store
+from langmem import (
+    create_manage_memory_tool,
+    create_search_memory_tool
+)
+import time
+import multiprocessing as mp
+import json
+from functools import partial
+import os
+from tqdm import tqdm
+from openai import OpenAI
+from collections import defaultdict
+from dotenv import load_dotenv
+from prompts import ANSWER_PROMPT
+
+load_dotenv()
+
+client = OpenAI()
+
+from jinja2 import Template
+
+ANSWER_PROMPT_TEMPLATE = Template(ANSWER_PROMPT)
+
+
+def get_answer(question, speaker_1_user_id, speaker_1_memories, speaker_2_user_id, speaker_2_memories):
+    prompt = ANSWER_PROMPT_TEMPLATE.render(
+        question=question,
+        speaker_1_user_id=speaker_1_user_id,
+        speaker_1_memories=speaker_1_memories,
+        speaker_2_user_id=speaker_2_user_id,
+        speaker_2_memories=speaker_2_memories
+    )
+
+    t1 = time.time()
+    response = client.chat.completions.create(
+        model=os.getenv("MODEL"),
+        messages=[{"role": "system", "content": prompt}],
+        temperature=0.0
+    )
+    t2 = time.time()
+    return response.choices[0].message.content, t2 - t1
+
+
+def prompt(state):
+    """Prepare the messages for the LLM."""
+    store = get_store()
+    memories = store.search(
+        ("memories",),
+        query=state["messages"][-1].content,
+    )
+    system_msg = f"""You are a helpful assistant.
+
+## Memories
+<memories>
+{memories}
+</memories>
+"""
+    return [{"role": "system", "content": system_msg}, *state["messages"]]
+
+
+class LangMem:
+    def __init__(self,):
+        self.store = InMemoryStore(
+            index={
+                "dims": 1536,
+                "embed": f"openai:{os.getenv('EMBEDDING_MODEL')}",
+            }
+        )
+        self.checkpointer = MemorySaver()  # Checkpoint graph state
+
+        self.agent = create_react_agent(
+            f"openai:{os.getenv('MODEL')}",
+            prompt=prompt,
+            tools=[
+                create_manage_memory_tool(namespace=("memories",)),
+                create_search_memory_tool(namespace=("memories",)),
+            ],
+            store=self.store,
+            checkpointer=self.checkpointer,
+        )
+
+    def add_memory(self, message, config):
+        return self.agent.invoke(
+            {"messages": [{"role": "user", "content": message}]},
+            config=config
+        )
+
+    def search_memory(self, query, config):
+        try:
+            t1 = time.time()
+            response = self.agent.invoke(
+                {"messages": [{"role": "user", "content": query}]},
+                config=config
+            )
+            t2 = time.time()
+            return response["messages"][-1].content, t2 - t1
+        except Exception as e:
+            print(f"Error in search_memory: {e}")
+            return "", t2 - t1
+
+
+class LangMemManager:
+    def __init__(self, dataset_path):
+        self.dataset_path = dataset_path
+        with open(self.dataset_path, 'r') as f:
+            self.data = json.load(f)
+
+    def process_all_conversations(self, output_file_path):
+        OUTPUT = defaultdict(list)
+
+        # Process conversations in parallel with multiple workers
+        def process_conversation(key_value_pair):
+            key, value = key_value_pair
+            result = defaultdict(list)
+
+            chat_history = value["conversation"]
+            questions = value["question"]
+
+            agent1 = LangMem()
+            agent2 = LangMem()
+            config = {"configurable": {"thread_id": f"thread-{key}"}}
+            speakers = set()
+
+            # Identify speakers
+            for conv in chat_history:
+                speakers.add(conv['speaker'])
+
+            if len(speakers) != 2:
+                raise ValueError(f"Expected 2 speakers, got {len(speakers)}")
+
+            speaker1 = list(speakers)[0]
+            speaker2 = list(speakers)[1]
+
+            # Add memories for each message
+            for conv in tqdm(chat_history, desc=f"Processing messages {key}", leave=False):
+                message = f"{conv['timestamp']} | {conv['speaker']}: {conv['text']}"
+                if conv['speaker'] == speaker1:
+                    agent1.add_memory(message, config)
+                elif conv['speaker'] == speaker2:
+                    agent2.add_memory(message, config)
+                else:
+                    raise ValueError(f"Expected speaker1 or speaker2, got {conv['speaker']}")
+
+            # Process questions
+            for q in tqdm(questions, desc=f"Processing questions {key}", leave=False):
+                category = q['category']
+
+                if int(category) == 5:
+                    continue
+
+                answer = q['answer']
+                question = q['question']
+                response1, speaker1_memory_time = agent1.search_memory(question, config)
+                response2, speaker2_memory_time = agent2.search_memory(question, config)
+
+                generated_answer, response_time = get_answer(
+                    question, speaker1, response1, speaker2, response2
+                )
+
+                result[key].append({
+                    "question": question,
+                    "answer": answer,
+                    "response1": response1,
+                    "response2": response2,
+                    "category": category,
+                    "speaker1_memory_time": speaker1_memory_time,
+                    "speaker2_memory_time": speaker2_memory_time,
+                    "response_time": response_time,
+                    'response': generated_answer
+                })
+
+            return result
+
+        # Use multiprocessing to process conversations in parallel
+        with mp.Pool(processes=10) as pool:
+            results = list(tqdm(
+                pool.imap(process_conversation, list(self.data.items())),
+                total=len(self.data),
+                desc="Processing conversations"
+            ))
+
+        # Combine results from all workers
+        for result in results:
+            for key, items in result.items():
+                OUTPUT[key].extend(items)
+
+        # Save final results
+        with open(output_file_path, 'w') as f:
+            json.dump(OUTPUT, f, indent=4)
--- a/evaluation/src/memzero/add.py
+++ b/evaluation/src/memzero/add.py
@@ -0,0 +1,141 @@
+from mem0 import MemoryClient
+import json
+import time
+import os
+import threading
+from tqdm import tqdm
+from concurrent.futures import ThreadPoolExecutor
+from dotenv import load_dotenv
+
+load_dotenv()
+
+
+# Update custom instructions
+custom_instructions ="""
+Generate personal memories that follow these guidelines:
+
+1. Each memory should be self-contained with complete context, including:
+   - The person's name, do not use "user" while creating memories
+   - Personal details (career aspirations, hobbies, life circumstances)
+   - Emotional states and reactions
+   - Ongoing journeys or future plans
+   - Specific dates when events occurred
+
+2. Include meaningful personal narratives focusing on:
+   - Identity and self-acceptance journeys
+   - Family planning and parenting
+   - Creative outlets and hobbies
+   - Mental health and self-care activities
+   - Career aspirations and education goals
+   - Important life events and milestones
+
+3. Make each memory rich with specific details rather than general statements
+   - Include timeframes (exact dates when possible)
+   - Name specific activities (e.g., "charity race for mental health" rather than just "exercise")
+   - Include emotional context and personal growth elements
+
+4. Extract memories only from user messages, not incorporating assistant responses
+
+5. Format each memory as a paragraph with a clear narrative structure that captures the person's experience, challenges, and aspirations
+"""
+
+
+class MemoryADD:
+    def __init__(self, data_path=None, batch_size=2, is_graph=False):
+        self.mem0_client = MemoryClient(
+            api_key=os.getenv("MEM0_API_KEY"),
+            org_id=os.getenv("MEM0_ORGANIZATION_ID"),
+            project_id=os.getenv("MEM0_PROJECT_ID")
+        )
+
+        self.mem0_client.update_project(custom_instructions=custom_instructions)
+        self.batch_size = batch_size
+        self.data_path = data_path
+        self.data = None
+        self.is_graph = is_graph
+        if data_path:
+            self.load_data()
+
+    def load_data(self):
+        with open(self.data_path, 'r') as f:
+            self.data = json.load(f)
+        return self.data
+
+    def add_memory(self, user_id, message, metadata, retries=3):
+        for attempt in range(retries):
+            try:
+                _ = self.mem0_client.add(message, user_id=user_id, version="v2",
+                                         metadata=metadata, enable_graph=self.is_graph)
+                return
+            except Exception as e:
+                if attempt < retries - 1:
+                    time.sleep(1)  # Wait before retrying
+                    continue
+                else:
+                    raise e
+
+    def add_memories_for_speaker(self, speaker, messages, timestamp, desc):
+        for i in tqdm(range(0, len(messages), self.batch_size), desc=desc):
+            batch_messages = messages[i:i+self.batch_size]
+            self.add_memory(speaker, batch_messages, metadata={"timestamp": timestamp})
+
+    def process_conversation(self, item, idx):
+        conversation = item['conversation']
+        speaker_a = conversation['speaker_a']
+        speaker_b = conversation['speaker_b']
+
+        speaker_a_user_id = f"{speaker_a}_{idx}"
+        speaker_b_user_id = f"{speaker_b}_{idx}"
+
+        # delete all memories for the two users
+        self.mem0_client.delete_all(user_id=speaker_a_user_id)
+        self.mem0_client.delete_all(user_id=speaker_b_user_id)
+
+        for key in conversation.keys():
+            if key in ['speaker_a', 'speaker_b'] or "date" in key or "timestamp" in key:
+                continue
+
+            date_time_key = key + "_date_time"
+            timestamp = conversation[date_time_key]
+            chats = conversation[key]
+
+            messages = []
+            messages_reverse = []
+            for chat in chats:
+                if chat['speaker'] == speaker_a:
+                    messages.append({"role": "user", "content": f"{speaker_a}: {chat['text']}"})
+                    messages_reverse.append({"role": "assistant", "content": f"{speaker_a}: {chat['text']}"})
+                elif chat['speaker'] == speaker_b:
+                    messages.append({"role": "assistant", "content": f"{speaker_b}: {chat['text']}"})
+                    messages_reverse.append({"role": "user", "content": f"{speaker_b}: {chat['text']}"})
+                else:
+                    raise ValueError(f"Unknown speaker: {chat['speaker']}")
+
+            # add memories for the two users on different threads
+            thread_a = threading.Thread(
+                target=self.add_memories_for_speaker,
+                args=(speaker_a_user_id, messages, timestamp, "Adding Memories for Speaker A")
+            )
+            thread_b = threading.Thread(
+                target=self.add_memories_for_speaker,
+                args=(speaker_b_user_id, messages_reverse, timestamp, "Adding Memories for Speaker B")
+            )
+
+            thread_a.start()
+            thread_b.start()
+            thread_a.join()
+            thread_b.join()
+
+        print("Messages added successfully")
+
+    def process_all_conversations(self, max_workers=10):
+        if not self.data:
+            raise ValueError("No data loaded. Please set data_path and call load_data() first.")
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            futures = [
+                executor.submit(self.process_conversation, item, idx)
+                for idx, item in enumerate(self.data)
+            ]
+
+            for future in futures:
+                future.result()
--- a/evaluation/src/memzero/search.py
+++ b/evaluation/src/memzero/search.py
@@ -0,0 +1,189 @@
+from collections import defaultdict
+from concurrent.futures import ThreadPoolExecutor
+from tqdm import tqdm
+from mem0 import MemoryClient
+import json
+import time
+from jinja2 import Template
+from openai import OpenAI
+from prompts import ANSWER_PROMPT_GRAPH, ANSWER_PROMPT
+import os
+from dotenv import load_dotenv
+
+load_dotenv()
+
+
+class MemorySearch:
+
+    def __init__(self, output_path='results.json', top_k=10, filter_memories=False, is_graph=False):
+        self.mem0_client = MemoryClient(
+            api_key=os.getenv("MEM0_API_KEY"),
+            org_id=os.getenv("MEM0_ORGANIZATION_ID"),
+            project_id=os.getenv("MEM0_PROJECT_ID")
+        )
+        self.top_k = top_k
+        self.openai_client = OpenAI()
+        self.results = defaultdict(list)
+        self.output_path = output_path
+        self.filter_memories = filter_memories
+        self.is_graph = is_graph
+
+        if self.is_graph:
+            self.ANSWER_PROMPT = ANSWER_PROMPT_GRAPH
+        else:
+            self.ANSWER_PROMPT = ANSWER_PROMPT
+
+    def search_memory(self, user_id, query, max_retries=3, retry_delay=1):
+        start_time = time.time()
+        retries = 0
+        while retries < max_retries:
+            try:
+                if self.is_graph:
+                    print("Searching with graph")
+                    memories = self.mem0_client.search(query, user_id=user_id, top_k=self.top_k,
+                                                filter_memories=self.filter_memories, enable_graph=True, output_format='v1.1')
+                else:
+                    memories = self.mem0_client.search(query, user_id=user_id, top_k=self.top_k,
+                                                filter_memories=self.filter_memories)
+                break
+            except Exception as e:
+                print("Retrying...")
+                retries += 1
+                if retries >= max_retries:
+                    raise e
+                time.sleep(retry_delay)
+
+        end_time = time.time()
+        if not self.is_graph:
+            semantic_memories = [{'memory': memory['memory'],
+                        'timestamp': memory['metadata']['timestamp'],
+                        'score': round(memory['score'], 2)}
+                    for memory in memories]
+            graph_memories = None
+        else:
+            semantic_memories = [{'memory': memory['memory'],
+                        'timestamp': memory['metadata']['timestamp'],
+                        'score': round(memory['score'], 2)} for memory in memories['results']]
+            graph_memories = [{"source": relation['source'], "relationship": relation['relationship'], "target": relation['target']} for relation in memories['relations']]
+        return semantic_memories, graph_memories, end_time - start_time
+
+    def answer_question(self, speaker_1_user_id, speaker_2_user_id, question, answer, category):
+        speaker_1_memories, speaker_1_graph_memories, speaker_1_memory_time = self.search_memory(speaker_1_user_id, question)
+        speaker_2_memories, speaker_2_graph_memories, speaker_2_memory_time = self.search_memory(speaker_2_user_id, question)
+
+        search_1_memory = [f"{item['timestamp']}: {item['memory']}" 
+                        for item in speaker_1_memories]
+        search_2_memory = [f"{item['timestamp']}: {item['memory']}" 
+                          for item in speaker_2_memories]
+
+        template = Template(self.ANSWER_PROMPT)
+        answer_prompt = template.render(
+            speaker_1_user_id=speaker_1_user_id.split('_')[0],
+            speaker_2_user_id=speaker_2_user_id.split('_')[0],
+            speaker_1_memories=json.dumps(search_1_memory, indent=4),
+            speaker_2_memories=json.dumps(search_2_memory, indent=4),
+            speaker_1_graph_memories=json.dumps(speaker_1_graph_memories, indent=4),
+            speaker_2_graph_memories=json.dumps(speaker_2_graph_memories, indent=4),
+            question=question
+        )
+
+        t1 = time.time()
+        response = self.openai_client.chat.completions.create(
+            model=os.getenv("MODEL"),
+            messages=[
+                {"role": "system", "content": answer_prompt}
+            ],
+            temperature=0.0
+        )
+        t2 = time.time()
+        response_time = t2 - t1
+        return response.choices[0].message.content, speaker_1_memories, speaker_2_memories, speaker_1_memory_time, speaker_2_memory_time, speaker_1_graph_memories, speaker_2_graph_memories, response_time
+
+    def process_question(self, val, speaker_a_user_id, speaker_b_user_id):
+        question = val.get('question', '')
+        answer = val.get('answer', '')
+        category = val.get('category', -1)
+        evidence = val.get('evidence', [])
+        adversarial_answer = val.get('adversarial_answer', '')
+
+        response, speaker_1_memories, speaker_2_memories, speaker_1_memory_time, speaker_2_memory_time, speaker_1_graph_memories, speaker_2_graph_memories, response_time = self.answer_question(
+            speaker_a_user_id,
+            speaker_b_user_id,
+            question,
+            answer,
+            category
+        )
+
+        result = {
+            "question": question,
+            "answer": answer,
+            "category": category,
+            "evidence": evidence,
+            "response": response,
+            "adversarial_answer": adversarial_answer,
+            "speaker_1_memories": speaker_1_memories,
+            "speaker_2_memories": speaker_2_memories,
+            'num_speaker_1_memories': len(speaker_1_memories),
+            'num_speaker_2_memories': len(speaker_2_memories),
+            'speaker_1_memory_time': speaker_1_memory_time,
+            'speaker_2_memory_time': speaker_2_memory_time,
+            "speaker_1_graph_memories": speaker_1_graph_memories,
+            "speaker_2_graph_memories": speaker_2_graph_memories,
+            "response_time": response_time
+        }
+
+        # Save results after each question is processed
+        with open(self.output_path, 'w') as f:
+            json.dump(self.results, f, indent=4)
+
+        return result
+
+    def process_data_file(self, file_path):
+        with open(file_path, 'r') as f:
+            data = json.load(f)
+
+        for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
+            qa = item['qa']
+            conversation = item['conversation']
+            speaker_a = conversation['speaker_a']
+            speaker_b = conversation['speaker_b']
+
+            speaker_a_user_id = f"{speaker_a}_{idx}"
+            speaker_b_user_id = f"{speaker_b}_{idx}"
+
+            for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
+                result = self.process_question(
+                    question_item,
+                    speaker_a_user_id,
+                    speaker_b_user_id
+                )
+                self.results[idx].append(result)
+
+                # Save results after each question is processed
+                with open(self.output_path, 'w') as f:
+                    json.dump(self.results, f, indent=4)
+
+        # Final save at the end
+        with open(self.output_path, 'w') as f:
+            json.dump(self.results, f, indent=4)
+
+    def process_questions_parallel(self, qa_list, speaker_a_user_id, speaker_b_user_id, max_workers=1):
+        def process_single_question(val):
+            result = self.process_question(val, speaker_a_user_id, speaker_b_user_id)
+            # Save results after each question is processed
+            with open(self.output_path, 'w') as f:
+                json.dump(self.results, f, indent=4)
+            return result
+
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            results = list(tqdm(
+                executor.map(process_single_question, qa_list),
+                total=len(qa_list),
+                desc="Answering Questions"
+            ))
+
+        # Final save at the end
+        with open(self.output_path, 'w') as f:
+            json.dump(self.results, f, indent=4)
+
+        return results
--- a/evaluation/src/openai/predict.py
+++ b/evaluation/src/openai/predict.py
@@ -0,0 +1,143 @@
+from openai import OpenAI
+import os
+import json
+from jinja2 import Template
+from tqdm import tqdm
+import time
+from collections import defaultdict
+from dotenv import load_dotenv
+import argparse
+
+load_dotenv()
+
+
+ANSWER_PROMPT = """
+    You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
+
+    # CONTEXT:
+    You have access to memories from a conversation. These memories contain
+    timestamped information that may be relevant to answering the question.
+
+    # INSTRUCTIONS:
+    1. Carefully analyze all provided memories
+    2. Pay special attention to the timestamps to determine the answer
+    3. If the question asks about a specific event or fact, look for direct evidence in the memories
+    4. If the memories contain contradictory information, prioritize the most recent memory
+    5. If there is a question about time references (like "last year", "two months ago", etc.), 
+       calculate the actual date based on the memory timestamp. For example, if a memory from 
+       4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
+    6. Always convert relative time references to specific dates, months, or years. For example, 
+       convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory 
+       timestamp. Ignore the reference while answering the question.
+    7. Focus only on the content of the memories. Do not confuse character 
+       names mentioned in memories with the actual users who created those memories.
+    8. The answer should be less than 5-6 words.
+
+    # APPROACH (Think step by step):
+    1. First, examine all memories that contain information related to the question
+    2. Examine the timestamps and content of these memories carefully
+    3. Look for explicit mentions of dates, times, locations, or events that answer the question
+    4. If the answer requires calculation (e.g., converting relative time references), show your work
+    5. Formulate a precise, concise answer based solely on the evidence in the memories
+    6. Double-check that your answer directly addresses the question asked
+    7. Ensure your final answer is specific and avoids vague time references
+
+    Memories:
+
+    {{memories}}
+
+    Question: {{question}}
+    Answer:
+    """
+
+
+class OpenAIPredict:
+    def __init__(self, model="gpt-4o-mini"):
+        self.model = model
+        self.openai_client = OpenAI()
+        self.results = defaultdict(list)
+
+    def search_memory(self, idx):
+
+        with open(f'memories/{idx}.txt', 'r') as file:
+            memories = file.read()
+
+        return memories, 0
+
+    def process_question(self, val, idx):
+        question = val.get('question', '')
+        answer = val.get('answer', '')
+        category = val.get('category', -1)
+        evidence = val.get('evidence', [])
+        adversarial_answer = val.get('adversarial_answer', '')
+
+        response, search_memory_time, response_time, context = self.answer_question(
+            idx,
+            question
+        )
+
+        result = {
+            "question": question,
+            "answer": answer,
+            "category": category,
+            "evidence": evidence,
+            "response": response,
+            "adversarial_answer": adversarial_answer,
+            "search_memory_time": search_memory_time,
+            "response_time": response_time,
+            "context": context
+        }
+
+        return result
+
+    def answer_question(self, idx, question):
+        memories, search_memory_time = self.search_memory(idx)
+
+        template = Template(ANSWER_PROMPT)
+        answer_prompt = template.render(
+            memories=memories,
+            question=question
+        )
+
+        t1 = time.time()
+        response = self.openai_client.chat.completions.create(
+            model=os.getenv("MODEL"),
+            messages=[
+                {"role": "system", "content": answer_prompt}
+            ],
+            temperature=0.0
+        )
+        t2 = time.time()
+        response_time = t2 - t1
+        return response.choices[0].message.content, search_memory_time, response_time, memories
+
+    def process_data_file(self, file_path, output_file_path):
+        with open(file_path, 'r') as f:
+            data = json.load(f)
+
+        for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
+            qa = item['qa']
+
+            for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
+                result = self.process_question(
+                    question_item,
+                    idx
+                )
+                self.results[idx].append(result)
+
+                # Save results after each question is processed
+                with open(output_file_path, 'w') as f:
+                    json.dump(self.results, f, indent=4)
+
+        # Final save at the end
+        with open(output_file_path, 'w') as f:
+            json.dump(self.results, f, indent=4)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--output_file_path", type=str, required=True)
+    args = parser.parse_args()
+    openai_predict = OpenAIPredict()
+    openai_predict.process_data_file("../../dataset/locomo10.json", args.output_file_path)
+
--- a/evaluation/src/rag.py
+++ b/evaluation/src/rag.py
@@ -0,0 +1,197 @@
+from openai import OpenAI
+import json
+import numpy as np
+from tqdm import tqdm
+from jinja2 import Template
+import tiktoken
+import time
+from collections import defaultdict
+import os
+from dotenv import load_dotenv
+
+load_dotenv()
+
+PROMPT = """
+# Question: 
+{{QUESTION}}
+
+# Context: 
+{{CONTEXT}}
+
+# Short answer:
+"""
+
+
+class RAGManager:
+    def __init__(self, data_path="dataset/locomo10_rag.json", chunk_size=500, k=1):
+        self.model = os.getenv("MODEL")
+        self.client = OpenAI()
+        self.data_path = data_path
+        self.chunk_size = chunk_size
+        self.k = k
+
+    def generate_response(self, question, context):
+        template = Template(PROMPT)
+        prompt = template.render(
+            CONTEXT=context,
+            QUESTION=question
+        )
+
+        max_retries = 3
+        retries = 0
+
+        while retries <= max_retries:
+            try:
+                t1 = time.time()
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=[
+                        {"role": "system",
+                         "content": "You are a helpful assistant that can answer "
+                                    "questions based on the provided context."
+                                    "If the question involves timing, use the conversation date for reference."
+                                    "Provide the shortest possible answer."
+                                    "Use words directly from the conversation when possible."
+                                    "Avoid using subjects in your answer."},
+                        {"role": "user", "content": prompt}
+                    ],
+                    temperature=0
+                )
+                t2 = time.time()
+                return response.choices[0].message.content.strip(), t2-t1
+            except Exception as e:
+                retries += 1
+                if retries > max_retries:
+                    raise e
+                time.sleep(1)  # Wait before retrying
+
+    def clean_chat_history(self, chat_history):
+        cleaned_chat_history = ""
+        for c in chat_history:
+            cleaned_chat_history += (f"{c['timestamp']} | {c['speaker']}: "
+                                     f"{c['text']}\n")
+
+        return cleaned_chat_history
+
+    def calculate_embedding(self, document):
+        response = self.client.embeddings.create(
+            model=os.getenv("EMBEDDING_MODEL"),
+            input=document
+        )
+        return response.data[0].embedding
+
+    def calculate_similarity(self, embedding1, embedding2):
+        return np.dot(embedding1, embedding2) / (
+            np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
+
+    def search(self, query, chunks, embeddings, k=1):
+        """
+        Search for the top-k most similar chunks to the query.
+
+        Args:
+            query: The query string
+            chunks: List of text chunks
+            embeddings: List of embeddings for each chunk
+            k: Number of top chunks to return (default: 1)
+
+        Returns:
+            combined_chunks: The combined text of the top-k chunks
+            search_time: Time taken for the search
+        """
+        t1 = time.time()
+        query_embedding = self.calculate_embedding(query)
+        similarities = [
+            self.calculate_similarity(query_embedding, embedding) 
+            for embedding in embeddings
+        ]
+
+        # Get indices of top-k most similar chunks
+        if k == 1:
+            # Original behavior - just get the most similar chunk
+            top_indices = [np.argmax(similarities)]
+        else:
+            # Get indices of top-k chunks
+            top_indices = np.argsort(similarities)[-k:][::-1]
+
+        # Combine the top-k chunks
+        combined_chunks = "\n<->\n".join([chunks[i] for i in top_indices])
+
+        t2 = time.time()
+        return combined_chunks, t2-t1
+
+    def create_chunks(self, chat_history, chunk_size=500):
+        """
+        Create chunks using tiktoken for more accurate token counting
+        """
+        # Get the encoding for the model
+        encoding = tiktoken.encoding_for_model(os.getenv("EMBEDDING_MODEL"))
+
+        documents = self.clean_chat_history(chat_history)
+
+        if chunk_size == -1:
+            return [documents], []
+
+        chunks = []
+
+        # Encode the document
+        tokens = encoding.encode(documents)
+
+        # Split into chunks based on token count
+        for i in range(0, len(tokens), chunk_size):
+            chunk_tokens = tokens[i:i+chunk_size]
+            chunk = encoding.decode(chunk_tokens)
+            chunks.append(chunk)
+
+        embeddings = []
+        for chunk in chunks:
+            embedding = self.calculate_embedding(chunk)
+            embeddings.append(embedding)
+
+        return chunks, embeddings
+
+    def process_all_conversations(self, output_file_path):
+        with open(self.data_path, "r") as f:
+            data = json.load(f)
+
+        FINAL_RESULTS = defaultdict(list)
+        for key, value in tqdm(data.items(), desc="Processing conversations"):
+            chat_history = value["conversation"]
+            questions = value["question"]
+
+            chunks, embeddings = self.create_chunks(
+                chat_history, self.chunk_size
+            )
+
+            for item in tqdm(
+                questions, desc="Answering questions", leave=False
+            ):
+                question = item["question"]
+                answer = item.get("answer", "")
+                category = item["category"]
+
+                if self.chunk_size == -1:
+                    context = chunks[0]
+                    search_time = 0
+                else:
+                    context, search_time = self.search(
+                        question, chunks, embeddings, k=self.k
+                    )
+                response, response_time = self.generate_response(
+                    question, context
+                )
+
+                FINAL_RESULTS[key].append({
+                    "question": question,
+                    "answer": answer,
+                    "category": category,
+                    "context": context,
+                    "response": response,
+                    "search_time": search_time,
+                    "response_time": response_time,
+                })
+                with open(output_file_path, "w+") as f:
+                    json.dump(FINAL_RESULTS, f, indent=4)
+
+        # Save results
+        with open(output_file_path, "w+") as f:
+            json.dump(FINAL_RESULTS, f, indent=4)
--- a/evaluation/src/utils.py
+++ b/evaluation/src/utils.py
@@ -0,0 +1,12 @@
+TECHNIQUES = [
+    "mem0",
+    "rag",
+    "langmem",
+    "zep",
+    "openai"
+]
+
+METHODS = [
+    "add",
+    "search"
+]
--- a/evaluation/src/zep/add.py
+++ b/evaluation/src/zep/add.py
@@ -0,0 +1,73 @@
+import argparse
+import json
+import os
+from dotenv import load_dotenv
+from tqdm import tqdm
+from zep_cloud import Message
+from zep_cloud.client import Zep
+
+load_dotenv()
+
+
+class ZepAdd:
+    def __init__(self, data_path=None):
+        self.zep_client = Zep(api_key=os.getenv("ZEP_API_KEY"))
+        self.data_path = data_path
+        self.data = None
+        if data_path:
+            self.load_data()
+
+    def load_data(self):
+        with open(self.data_path, 'r') as f:
+            self.data = json.load(f)
+        return self.data
+
+    def process_conversation(self, run_id, item, idx):
+        conversation = item['conversation']
+
+        user_id = f"run_id_{run_id}_experiment_user_{idx}"
+        session_id = f"run_id_{run_id}_experiment_session_{idx}"
+
+        # # delete all memories for the two users
+        # self.zep_client.user.delete(user_id=user_id)
+        # self.zep_client.memory.delete(session_id=session_id)
+
+        self.zep_client.user.add(user_id=user_id)
+        self.zep_client.memory.add_session(
+            user_id=user_id,
+            session_id=session_id,
+        )
+
+        print("Starting to add memories... for user", user_id)
+        for key in tqdm(conversation.keys(), desc=f"Processing user {user_id}"):
+            if key in ['speaker_a', 'speaker_b'] or "date" in key:
+                continue
+
+            date_time_key = key + "_date_time"
+            timestamp = conversation[date_time_key]
+            chats = conversation[key]
+
+            for chat in tqdm(chats, desc=f"Adding chats for {key}", leave=False):
+                self.zep_client.memory.add(
+                    session_id=session_id,
+                    messages=[Message(
+                        role=chat['speaker'],
+                        role_type="user",
+                        content=f"{timestamp}: {chat['text']}",
+                    )]
+                )
+
+    def process_all_conversations(self, run_id):
+        if not self.data:
+            raise ValueError("No data loaded. Please set data_path and call load_data() first.")
+        for idx, item in tqdm(enumerate(self.data)):
+            if idx == 0:
+                self.process_conversation(run_id, item, idx)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--run_id", type=str, required=True)
+    args = parser.parse_args()
+    zep_add = ZepAdd(data_path="../../dataset/locomo10.json")
+    zep_add.process_all_conversations(args.run_id)
--- a/evaluation/src/zep/search.py
+++ b/evaluation/src/zep/search.py
@@ -0,0 +1,148 @@
+import argparse
+from collections import defaultdict
+from dotenv import load_dotenv
+from jinja2 import Template
+from openai import OpenAI
+from tqdm import tqdm
+from zep_cloud import EntityEdge, EntityNode
+from zep_cloud.client import Zep
+import json
+import os
+import pandas as pd
+import time
+from prompts import ANSWER_PROMPT_ZEP
+
+load_dotenv()
+
+TEMPLATE = """
+FACTS and ENTITIES represent relevant context to the current conversation.
+
+# These are the most relevant facts and their valid date ranges
+# format: FACT (Date range: from - to)
+
+{facts}
+
+
+# These are the most relevant entities
+# ENTITY_NAME: entity summary
+
+{entities}
+
+"""
+
+
+class ZepSearch:
+    def __init__(self):
+        self.zep_client = Zep(api_key=os.getenv("ZEP_API_KEY"))
+        self.results = defaultdict(list)
+        self.openai_client = OpenAI()
+
+    def format_edge_date_range(self, edge: EntityEdge) -> str:
+        # return f"{datetime(edge.valid_at).strftime('%Y-%m-%d %H:%M:%S') if edge.valid_at else 'date unknown'} - {(edge.invalid_at.strftime('%Y-%m-%d %H:%M:%S') if edge.invalid_at else 'present')}"
+        return f"{edge.valid_at if edge.valid_at else 'date unknown'} - {(edge.invalid_at if edge.invalid_at else 'present')}"
+
+    def compose_search_context(self, edges: list[EntityEdge], nodes: list[EntityNode]) -> str:
+        facts = [f'  - {edge.fact} ({self.format_edge_date_range(edge)})' for edge in edges]
+        entities = [f'  - {node.name}: {node.summary}' for node in nodes]
+        return TEMPLATE.format(facts='\n'.join(facts), entities='\n'.join(entities))
+
+    def search_memory(self, run_id, idx, query, max_retries=3, retry_delay=1):
+        start_time = time.time()
+        retries = 0
+        while retries < max_retries:
+            try:
+                user_id = f"run_id_{run_id}_experiment_user_{idx}"
+                session_id = f"run_id_{run_id}_experiment_session_{idx}"
+                edges_results = (self.zep_client.graph.search(user_id=user_id, reranker='cross_encoder', query=query, scope='edges', limit=20)).edges
+                node_results = (self.zep_client.graph.search(user_id=user_id, reranker='rrf', query=query, scope='nodes', limit=20)).nodes
+                context = self.compose_search_context(edges_results, node_results)
+                break
+            except Exception as e:
+                print("Retrying...")
+                retries += 1
+                if retries >= max_retries:
+                    raise e
+                time.sleep(retry_delay)
+
+        end_time = time.time()
+
+        return context, end_time - start_time
+
+    def process_question(self, run_id, val, idx):
+        question = val.get('question', '')
+        answer = val.get('answer', '')
+        category = val.get('category', -1)
+        evidence = val.get('evidence', [])
+        adversarial_answer = val.get('adversarial_answer', '')
+
+        response, search_memory_time, response_time, context = self.answer_question(
+            run_id,
+            idx,
+            question
+        )
+
+        result = {
+            "question": question,
+            "answer": answer,
+            "category": category,
+            "evidence": evidence,
+            "response": response,
+            "adversarial_answer": adversarial_answer,
+            "search_memory_time": search_memory_time,
+            "response_time": response_time,
+            "context": context
+        }
+
+        return result
+
+    def answer_question(self, run_id, idx, question):
+        context, search_memory_time = self.search_memory(run_id, idx, question)
+
+        template = Template(ANSWER_PROMPT_ZEP)
+        answer_prompt = template.render(
+            memories=context,
+            question=question
+        )
+
+        t1 = time.time()
+        response = self.openai_client.chat.completions.create(
+            model=os.getenv("MODEL"),
+            messages=[
+                {"role": "system", "content": answer_prompt}
+            ],
+            temperature=0.0
+        )
+        t2 = time.time()
+        response_time = t2 - t1
+        return response.choices[0].message.content, search_memory_time, response_time, context
+
+    def process_data_file(self, file_path, run_id, output_file_path):
+        with open(file_path, 'r') as f:
+            data = json.load(f)
+
+        for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
+            qa = item['qa']
+
+            for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
+                result = self.process_question(
+                    run_id,
+                    question_item,
+                    idx
+                )
+                self.results[idx].append(result)
+
+                # Save results after each question is processed
+                with open(output_file_path, 'w') as f:
+                    json.dump(self.results, f, indent=4)
+
+        # Final save at the end
+        with open(output_file_path, 'w') as f:
+            json.dump(self.results, f, indent=4)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--run_id", type=str, required=True)
+    args = parser.parse_args()
+    zep_search = ZepSearch()
+    zep_search.process_data_file("../../dataset/locomo10.json", args.run_id, "results/zep_search_results.json")