Docs Update (#2591)

This commit is contained in:
Prateek Chhikara
2025-04-29 08:15:25 -07:00
committed by GitHub
parent 6d13e83001
commit 393a4fd5a6
111 changed files with 2296 additions and 99 deletions

31
evaluation/Makefile Normal file
View File

@@ -0,0 +1,31 @@
# Run the experiments
run-mem0-add:
python run_experiments.py --technique_type mem0 --method add
run-mem0-search:
python run_experiments.py --technique_type mem0 --method search --output_folder results/ --top_k 30
run-mem0-plus-add:
python run_experiments.py --technique_type mem0 --method add --is_graph
run-mem0-plus-search:
python run_experiments.py --technique_type mem0 --method search --is_graph --output_folder results/ --top_k 30
run-rag:
python run_experiments.py --technique_type rag --chunk_size 500 --num_chunks 1 --output_folder results/
run-full-context:
python run_experiments.py --technique_type rag --chunk_size -1 --num_chunks 1 --output_folder results/
run-langmem:
python run_experiments.py --technique_type langmem --output_folder results/
run-zep-add:
python run_experiments.py --technique_type zep --method add --output_folder results/
run-zep-search:
python run_experiments.py --technique_type zep --method search --output_folder results/
run-openai:
python run_experiments.py --technique_type openai --output_folder results/

View File

@@ -0,0 +1,192 @@
# Mem0: Building ProductionReady AI Agents with Scalable LongTerm Memory
[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/abs/XXXX.XXXXX)
[![Website](https://img.shields.io/badge/Website-Project-blue)](https://mem0.ai/research)
This repository contains the code and dataset for our paper: **Mem0: Building ProductionReady AI Agents with Scalable LongTerm Memory**.
## 📋 Overview
This project evaluates Mem0 and compares it with different memory and retrieval techniques for AI systems:
1. **Established LOCOMO Benchmarks**: We evaluate against five established approaches from the literature: LoCoMo, ReadAgent, MemoryBank, MemGPT, and A-Mem.
2. **Open-Source Memory Solutions**: We test promising open-source memory architectures including LangMem, which provides flexible memory management capabilities.
3. **RAG Systems**: We implement Retrieval-Augmented Generation with various configurations, testing different chunk sizes and retrieval counts to optimize performance.
4. **Full-Context Processing**: We examine the effectiveness of passing the entire conversation history within the context window of the LLM as a baseline approach.
5. **Proprietary Memory Systems**: We evaluate OpenAI's built-in memory feature available in their ChatGPT interface to compare against commercial solutions.
6. **Third-Party Memory Providers**: We incorporate Zep, a specialized memory management platform designed for AI agents, to assess the performance of dedicated memory infrastructure.
We test these techniques on the LOCOMO dataset, which contains conversational data with various question types to evaluate memory recall and understanding.
## 🔍 Dataset
The dataset is located in the `dataset/` directory:
- `locomo10.json`: Original dataset
- `locomo10_rag.json`: Dataset formatted for RAG experiments
## 📁 Project Structure
```
.
├── src/ # Source code for different memory techniques
│ ├── mem0/ # Implementation of the Mem0 technique
│ ├── openai/ # Implementation of the OpenAI memory
│ ├── zep/ # Implementation of the Zep memory
│ ├── rag.py # Implementation of the RAG technique
│ └── langmem.py # Implementation of the Language-based memory
├── metrics/ # Code for evaluation metrics
├── results/ # Results of experiments
├── dataset/ # Dataset files
├── evals.py # Evaluation script
├── run_experiments.py # Script to run experiments
├── generate_scores.py # Script to generate scores from results
└── prompts.py # Prompts used for the models
```
## 🚀 Getting Started
### Prerequisites
Create a `.env` file with your API keys and configurations. The following keys are required:
```
# OpenAI API key for GPT models and embeddings
OPENAI_API_KEY="your-openai-api-key"
# Mem0 API keys (for Mem0 and Mem0+ techniques)
MEM0_API_KEY="your-mem0-api-key"
MEM0_PROJECT_ID="your-mem0-project-id"
MEM0_ORGANIZATION_ID="your-mem0-organization-id"
# Model configuration
MODEL="gpt-4o-mini" # or your preferred model
EMBEDDING_MODEL="text-embedding-3-small" # or your preferred embedding model
ZEP_API_KEY="api-key-from-zep"
```
### Running Experiments
You can run experiments using the provided Makefile commands:
#### Memory Techniques
```bash
# Run Mem0 experiments
make run-mem0-add # Add memories using Mem0
make run-mem0-search # Search memories using Mem0
# Run Mem0+ experiments (with graph-based search)
make run-mem0-plus-add # Add memories using Mem0+
make run-mem0-plus-search # Search memories using Mem0+
# Run RAG experiments
make run-rag # Run RAG with chunk size 500
make run-full-context # Run RAG with full context
# Run LangMem experiments
make run-langmem # Run LangMem
# Run Zep experiments
make run-zep-add # Add memories using Zep
make run-zep-search # Search memories using Zep
# Run OpenAI experiments
make run-openai # Run OpenAI experiments
```
Alternatively, you can run experiments directly with custom parameters:
```bash
python run_experiments.py --technique_type [mem0|rag|langmem] [additional parameters]
```
#### Command-line Parameters:
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--technique_type` | Memory technique to use (mem0, rag, langmem) | mem0 |
| `--method` | Method to use (add, search) | add |
| `--chunk_size` | Chunk size for processing | 1000 |
| `--top_k` | Number of top memories to retrieve | 30 |
| `--filter_memories` | Whether to filter memories | False |
| `--is_graph` | Whether to use graph-based search | False |
| `--num_chunks` | Number of chunks to process for RAG | 1 |
### 📊 Evaluation
To evaluate results, run:
```bash
python evals.py --input_file [path_to_results] --output_file [output_path]
```
This script:
1. Processes each question-answer pair
2. Calculates BLEU and F1 scores automatically
3. Uses an LLM judge to evaluate answer correctness
4. Saves the combined results to the output file
### 📈 Generating Scores
Generate final scores with:
```bash
python generate_scores.py
```
This script:
1. Loads the evaluation metrics data
2. Calculates mean scores for each category (BLEU, F1, LLM)
3. Reports the number of questions per category
4. Calculates overall mean scores across all categories
Example output:
```
Mean Scores Per Category:
bleu_score f1_score llm_score count
category
1 0.xxxx 0.xxxx 0.xxxx xx
2 0.xxxx 0.xxxx 0.xxxx xx
3 0.xxxx 0.xxxx 0.xxxx xx
Overall Mean Scores:
bleu_score 0.xxxx
f1_score 0.xxxx
llm_score 0.xxxx
```
## 📏 Evaluation Metrics
We use several metrics to evaluate the performance of different memory techniques:
1. **BLEU Score**: Measures the similarity between the model's response and the ground truth
2. **F1 Score**: Measures the harmonic mean of precision and recall
3. **LLM Score**: A binary score (0 or 1) determined by an LLM judge evaluating the correctness of responses
4. **Token Consumption**: Number of tokens required to generate final answer.
5. **Latency**: Time required during search and to generate response.
## 📚 Citation
If you use this code or dataset in your research, please cite our paper:
```bibtex
@article{mem0,
title={Mem0: Building ProductionReady AI Agents with Scalable LongTerm Memory},
author={---},
journal={arXiv preprint},
year={2025}
}
```
## 📄 License
[MIT License](LICENSE)
## 👥 Contributors
- [Prateek Chhikara](https://github.com/prateekchhikara)
- [Dev Khant](https://github.com/Dev-Khant)
- [Saket Aryan](https://github.com/whysosaket)
- [Taranjeet Singh](https://github.com/taranjeet)
- [Deshraj Yadav](https://github.com/deshraj)

81
evaluation/evals.py Normal file
View File

@@ -0,0 +1,81 @@
import json
import argparse
from metrics.utils import calculate_metrics, calculate_bleu_scores
from metrics.llm_judge import evaluate_llm_judge
from collections import defaultdict
from tqdm import tqdm
import concurrent.futures
import threading
def process_item(item_data):
k, v = item_data
local_results = defaultdict(list)
for item in v:
gt_answer = str(item['answer'])
pred_answer = str(item['response'])
category = str(item['category'])
question = str(item['question'])
# Skip category 5
if category == '5':
continue
metrics = calculate_metrics(pred_answer, gt_answer)
bleu_scores = calculate_bleu_scores(pred_answer, gt_answer)
llm_score = evaluate_llm_judge(question, gt_answer, pred_answer)
local_results[k].append({
"question": question,
"answer": gt_answer,
"response": pred_answer,
"category": category,
"bleu_score": bleu_scores["bleu1"],
"f1_score": metrics["f1"],
"llm_score": llm_score
})
return local_results
def main():
parser = argparse.ArgumentParser(description='Evaluate RAG results')
parser.add_argument('--input_file', type=str,
default="results/rag_results_500_k1.json",
help='Path to the input dataset file')
parser.add_argument('--output_file', type=str,
default="evaluation_metrics.json",
help='Path to save the evaluation results')
parser.add_argument('--max_workers', type=int, default=10,
help='Maximum number of worker threads')
args = parser.parse_args()
with open(args.input_file, 'r') as f:
data = json.load(f)
results = defaultdict(list)
results_lock = threading.Lock()
# Use ThreadPoolExecutor with specified workers
with concurrent.futures.ThreadPoolExecutor(max_workers=args.max_workers) as executor:
futures = [executor.submit(process_item, item_data)
for item_data in data.items()]
for future in tqdm(concurrent.futures.as_completed(futures),
total=len(futures)):
local_results = future.result()
with results_lock:
for k, items in local_results.items():
results[k].extend(items)
# Save results to JSON file
with open(args.output_file, 'w') as f:
json.dump(results, f, indent=4)
print(f"Results saved to {args.output_file}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,41 @@
import pandas as pd
import json
# Load the evaluation metrics data
with open('evaluation_metrics.json', 'r') as f:
data = json.load(f)
# Flatten the data into a list of question items
all_items = []
for key in data:
all_items.extend(data[key])
# Convert to DataFrame
df = pd.DataFrame(all_items)
# Convert category to numeric type
df['category'] = pd.to_numeric(df['category'])
# Calculate mean scores by category
result = df.groupby('category').agg({
'bleu_score': 'mean',
'f1_score': 'mean',
'llm_score': 'mean'
}).round(4)
# Add count of questions per category
result['count'] = df.groupby('category').size()
# Print the results
print("Mean Scores Per Category:")
print(result)
# Calculate overall means
overall_means = df.agg({
'bleu_score': 'mean',
'f1_score': 'mean',
'llm_score': 'mean'
}).round(4)
print("\nOverall Mean Scores:")
print(overall_means)

View File

@@ -0,0 +1,127 @@
from openai import OpenAI
import json
from collections import defaultdict
import numpy as np
import argparse
client = OpenAI()
ACCURACY_PROMPT = """
Your task is to label an answer to a question as CORRECT or WRONG. You will be given the following data:
(1) a question (posed by one user to another user),
(2) a gold (ground truth) answer,
(3) a generated answer
which you will score as CORRECT/WRONG.
The point of the question is to ask about something one user should know about the other user based on their prior conversations.
The gold answer will usually be a concise and short answer that includes the referenced topic, for example:
Question: Do you remember what I got the last time I went to Hawaii?
Gold answer: A shell necklace
The generated answer might be much longer, but you should be generous with your grading - as long as it touches on the same topic as the gold answer, it should be counted as CORRECT.
For time related questions, the gold answer will be a specific date, month, year, etc. The generated answer might be much longer or use relative time references (like "last Tuesday" or "next month"), but you should be generous with your grading - as long as it refers to the same date or time period as the gold answer, it should be counted as CORRECT. Even if the format differs (e.g., "May 7th" vs "7 May"), consider it CORRECT if it's the same date.
Now its time for the real question:
Question: {question}
Gold answer: {gold_answer}
Generated answer: {generated_answer}
First, provide a short (one sentence) explanation of your reasoning, then finish with CORRECT or WRONG.
Do NOT include both CORRECT and WRONG in your response, or it will break the evaluation script.
Just return the label CORRECT or WRONG in a json format with the key as "label".
"""
def evaluate_llm_judge(question, gold_answer, generated_answer):
"""Evaluate the generated answer against the gold answer using an LLM judge."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": ACCURACY_PROMPT.format(
question=question,
gold_answer=gold_answer,
generated_answer=generated_answer
)
}],
response_format={"type": "json_object"},
temperature=0.0
)
label = json.loads(response.choices[0].message.content)['label']
return 1 if label == "CORRECT" else 0
def main():
"""Main function to evaluate RAG results using LLM judge."""
parser = argparse.ArgumentParser(
description='Evaluate RAG results using LLM judge'
)
parser.add_argument(
'--input_file',
type=str,
default="results/default_run_v4_k30_new_graph.json",
help='Path to the input dataset file'
)
args = parser.parse_args()
dataset_path = args.input_file
output_path = f"results/llm_judge_{dataset_path.split('/')[-1]}"
with open(dataset_path, "r") as f:
data = json.load(f)
LLM_JUDGE = defaultdict(list)
RESULTS = defaultdict(list)
index = 0
for k, v in data.items():
for x in v:
question = x['question']
gold_answer = x['answer']
generated_answer = x['response']
category = x['category']
# Skip category 5
if int(category) == 5:
continue
# Evaluate the answer
label = evaluate_llm_judge(question, gold_answer, generated_answer)
LLM_JUDGE[category].append(label)
# Store the results
RESULTS[index].append({
"question": question,
"gt_answer": gold_answer,
"response": generated_answer,
"category": category,
"llm_label": label
})
# Save intermediate results
with open(output_path, "w") as f:
json.dump(RESULTS, f, indent=4)
# Print current accuracy for all categories
print("All categories accuracy:")
for cat, results in LLM_JUDGE.items():
if results: # Only print if there are results for this category
print(f" Category {cat}: {np.mean(results):.4f} "
f"({sum(results)}/{len(results)})")
print("------------------------------------------")
index += 1
# Save final results
with open(output_path, "w") as f:
json.dump(RESULTS, f, indent=4)
# Print final summary
print("PATH: ", dataset_path)
print("------------------------------------------")
for k, v in LLM_JUDGE.items():
print(k, np.mean(v))
if __name__ == "__main__":
main()

224
evaluation/metrics/utils.py Normal file
View File

@@ -0,0 +1,224 @@
"""
Borrowed from https://github.com/WujiangXu/AgenticMemory/blob/main/utils.py
@article{xu2025mem,
title={A-mem: Agentic memory for llm agents},
author={Xu, Wujiang and Liang, Zujie and Mei, Kai and Gao, Hang and Tan, Juntao
and Zhang, Yongfeng},
journal={arXiv preprint arXiv:2502.12110},
year={2025}
}
"""
import re
import string
import numpy as np
from typing import List, Dict, Union
import statistics
from collections import defaultdict
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from bert_score import score as bert_score
import nltk
from nltk.translate.meteor_score import meteor_score
from sentence_transformers import SentenceTransformer
import logging
from dataclasses import dataclass
from pathlib import Path
from openai import OpenAI
# from load_dataset import load_locomo_dataset, QA, Turn, Session, Conversation
from sentence_transformers.util import pytorch_cos_sim
# Download required NLTK data
try:
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
except Exception as e:
print(f"Error downloading NLTK data: {e}")
# Initialize SentenceTransformer model (this will be reused)
try:
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
except Exception as e:
print(f"Warning: Could not load SentenceTransformer model: {e}")
sentence_model = None
def simple_tokenize(text):
"""Simple tokenization function."""
# Convert to string if not already
text = str(text)
return text.lower().replace('.', ' ').replace(',', ' ').replace('!', ' ').replace('?', ' ').split()
def calculate_rouge_scores(prediction: str, reference: str) -> Dict[str, float]:
"""Calculate ROUGE scores for prediction against reference."""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, prediction)
return {
'rouge1_f': scores['rouge1'].fmeasure,
'rouge2_f': scores['rouge2'].fmeasure,
'rougeL_f': scores['rougeL'].fmeasure
}
def calculate_bleu_scores(prediction: str, reference: str) -> Dict[str, float]:
"""Calculate BLEU scores with different n-gram settings."""
pred_tokens = nltk.word_tokenize(prediction.lower())
ref_tokens = [nltk.word_tokenize(reference.lower())]
weights_list = [(1, 0, 0, 0), (0.5, 0.5, 0, 0), (0.33, 0.33, 0.33, 0), (0.25, 0.25, 0.25, 0.25)]
smooth = SmoothingFunction().method1
scores = {}
for n, weights in enumerate(weights_list, start=1):
try:
score = sentence_bleu(ref_tokens, pred_tokens, weights=weights, smoothing_function=smooth)
except Exception:
print(f"Error calculating BLEU score: {e}")
score = 0.0
scores[f'bleu{n}'] = score
return scores
def calculate_bert_scores(prediction: str, reference: str) -> Dict[str, float]:
"""Calculate BERTScore for semantic similarity."""
try:
P, R, F1 = bert_score([prediction], [reference], lang='en', verbose=False)
return {
'bert_precision': P.item(),
'bert_recall': R.item(),
'bert_f1': F1.item()
}
except Exception as e:
print(f"Error calculating BERTScore: {e}")
return {
'bert_precision': 0.0,
'bert_recall': 0.0,
'bert_f1': 0.0
}
def calculate_meteor_score(prediction: str, reference: str) -> float:
"""Calculate METEOR score for the prediction."""
try:
return meteor_score([reference.split()], prediction.split())
except Exception as e:
print(f"Error calculating METEOR score: {e}")
return 0.0
def calculate_sentence_similarity(prediction: str, reference: str) -> float:
"""Calculate sentence embedding similarity using SentenceBERT."""
if sentence_model is None:
return 0.0
try:
# Encode sentences
embedding1 = sentence_model.encode([prediction], convert_to_tensor=True)
embedding2 = sentence_model.encode([reference], convert_to_tensor=True)
# Calculate cosine similarity
similarity = pytorch_cos_sim(embedding1, embedding2).item()
return float(similarity)
except Exception as e:
print(f"Error calculating sentence similarity: {e}")
return 0.0
def calculate_metrics(prediction: str, reference: str) -> Dict[str, float]:
"""Calculate comprehensive evaluation metrics for a prediction."""
# Handle empty or None values
if not prediction or not reference:
return {
"exact_match": 0,
"f1": 0.0,
"rouge1_f": 0.0,
"rouge2_f": 0.0,
"rougeL_f": 0.0,
"bleu1": 0.0,
"bleu2": 0.0,
"bleu3": 0.0,
"bleu4": 0.0,
"bert_f1": 0.0,
"meteor": 0.0,
"sbert_similarity": 0.0
}
# Convert to strings if they're not already
prediction = str(prediction).strip()
reference = str(reference).strip()
# Calculate exact match
exact_match = int(prediction.lower() == reference.lower())
# Calculate token-based F1 score
pred_tokens = set(simple_tokenize(prediction))
ref_tokens = set(simple_tokenize(reference))
common_tokens = pred_tokens & ref_tokens
if not pred_tokens or not ref_tokens:
f1 = 0.0
else:
precision = len(common_tokens) / len(pred_tokens)
recall = len(common_tokens) / len(ref_tokens)
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
# Calculate all scores
rouge_scores = 0 #calculate_rouge_scores(prediction, reference)
bleu_scores = calculate_bleu_scores(prediction, reference)
bert_scores = 0 # calculate_bert_scores(prediction, reference)
meteor = 0 # calculate_meteor_score(prediction, reference)
sbert_similarity = 0 # calculate_sentence_similarity(prediction, reference)
# Combine all metrics
metrics = {
"exact_match": exact_match,
"f1": f1,
# **rouge_scores,
**bleu_scores,
# **bert_scores,
# "meteor": meteor,
# "sbert_similarity": sbert_similarity
}
return metrics
def aggregate_metrics(all_metrics: List[Dict[str, float]], all_categories: List[int]) -> Dict[str, Dict[str, Union[float, Dict[str, float]]]]:
"""Calculate aggregate statistics for all metrics, split by category."""
if not all_metrics:
return {}
# Initialize aggregates for overall and per-category metrics
aggregates = defaultdict(list)
category_aggregates = defaultdict(lambda: defaultdict(list))
# Collect all values for each metric, both overall and per category
for metrics, category in zip(all_metrics, all_categories):
for metric_name, value in metrics.items():
aggregates[metric_name].append(value)
category_aggregates[category][metric_name].append(value)
# Calculate statistics for overall metrics
results = {
"overall": {}
}
for metric_name, values in aggregates.items():
results["overall"][metric_name] = {
'mean': statistics.mean(values),
'std': statistics.stdev(values) if len(values) > 1 else 0.0,
'median': statistics.median(values),
'min': min(values),
'max': max(values),
'count': len(values)
}
# Calculate statistics for each category
for category in sorted(category_aggregates.keys()):
results[f"category_{category}"] = {}
for metric_name, values in category_aggregates[category].items():
if values: # Only calculate if we have values for this category
results[f"category_{category}"][metric_name] = {
'mean': statistics.mean(values),
'std': statistics.stdev(values) if len(values) > 1 else 0.0,
'median': statistics.median(values),
'min': min(values),
'max': max(values),
'count': len(values)
}
return results

147
evaluation/prompts.py Normal file
View File

@@ -0,0 +1,147 @@
ANSWER_PROMPT_GRAPH = """
You are an intelligent memory assistant tasked with retrieving accurate information from
conversation memories.
# CONTEXT:
You have access to memories from two speakers in a conversation. These memories contain
timestamped information that may be relevant to answering the question. You also have
access to knowledge graph relations for each user, showing connections between entities,
concepts, and events relevant to that user.
# INSTRUCTIONS:
1. Carefully analyze all provided memories from both speakers
2. Pay special attention to the timestamps to determine the answer
3. If the question asks about a specific event or fact, look for direct evidence in the
memories
4. If the memories contain contradictory information, prioritize the most recent memory
5. If there is a question about time references (like "last year", "two months ago",
etc.), calculate the actual date based on the memory timestamp. For example, if a
memory from 4 May 2022 mentions "went to India last year," then the trip occurred
in 2021.
6. Always convert relative time references to specific dates, months, or years. For
example, convert "last year" to "2022" or "two months ago" to "March 2023" based
on the memory timestamp. Ignore the reference while answering the question.
7. Focus only on the content of the memories from both speakers. Do not confuse
character names mentioned in memories with the actual users who created those
memories.
8. The answer should be less than 5-6 words.
9. Use the knowledge graph relations to understand the user's knowledge network and
identify important relationships between entities in the user's world.
# APPROACH (Think step by step):
1. First, examine all memories that contain information related to the question
2. Examine the timestamps and content of these memories carefully
3. Look for explicit mentions of dates, times, locations, or events that answer the
question
4. If the answer requires calculation (e.g., converting relative time references),
show your work
5. Analyze the knowledge graph relations to understand the user's knowledge context
6. Formulate a precise, concise answer based solely on the evidence in the memories
7. Double-check that your answer directly addresses the question asked
8. Ensure your final answer is specific and avoids vague time references
Memories for user {{speaker_1_user_id}}:
{{speaker_1_memories}}
Relations for user {{speaker_1_user_id}}:
{{speaker_1_graph_memories}}
Memories for user {{speaker_2_user_id}}:
{{speaker_2_memories}}
Relations for user {{speaker_2_user_id}}:
{{speaker_2_graph_memories}}
Question: {{question}}
Answer:
"""
ANSWER_PROMPT = """
You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
# CONTEXT:
You have access to memories from two speakers in a conversation. These memories contain
timestamped information that may be relevant to answering the question.
# INSTRUCTIONS:
1. Carefully analyze all provided memories from both speakers
2. Pay special attention to the timestamps to determine the answer
3. If the question asks about a specific event or fact, look for direct evidence in the memories
4. If the memories contain contradictory information, prioritize the most recent memory
5. If there is a question about time references (like "last year", "two months ago", etc.),
calculate the actual date based on the memory timestamp. For example, if a memory from
4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
6. Always convert relative time references to specific dates, months, or years. For example,
convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory
timestamp. Ignore the reference while answering the question.
7. Focus only on the content of the memories from both speakers. Do not confuse character
names mentioned in memories with the actual users who created those memories.
8. The answer should be less than 5-6 words.
# APPROACH (Think step by step):
1. First, examine all memories that contain information related to the question
2. Examine the timestamps and content of these memories carefully
3. Look for explicit mentions of dates, times, locations, or events that answer the question
4. If the answer requires calculation (e.g., converting relative time references), show your work
5. Formulate a precise, concise answer based solely on the evidence in the memories
6. Double-check that your answer directly addresses the question asked
7. Ensure your final answer is specific and avoids vague time references
Memories for user {{speaker_1_user_id}}:
{{speaker_1_memories}}
Memories for user {{speaker_2_user_id}}:
{{speaker_2_memories}}
Question: {{question}}
Answer:
"""
ANSWER_PROMPT_ZEP = """
You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
# CONTEXT:
You have access to memories from a conversation. These memories contain
timestamped information that may be relevant to answering the question.
# INSTRUCTIONS:
1. Carefully analyze all provided memories
2. Pay special attention to the timestamps to determine the answer
3. If the question asks about a specific event or fact, look for direct evidence in the memories
4. If the memories contain contradictory information, prioritize the most recent memory
5. If there is a question about time references (like "last year", "two months ago", etc.),
calculate the actual date based on the memory timestamp. For example, if a memory from
4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
6. Always convert relative time references to specific dates, months, or years. For example,
convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory
timestamp. Ignore the reference while answering the question.
7. Focus only on the content of the memories. Do not confuse character
names mentioned in memories with the actual users who created those memories.
8. The answer should be less than 5-6 words.
# APPROACH (Think step by step):
1. First, examine all memories that contain information related to the question
2. Examine the timestamps and content of these memories carefully
3. Look for explicit mentions of dates, times, locations, or events that answer the question
4. If the answer requires calculation (e.g., converting relative time references), show your work
5. Formulate a precise, concise answer based solely on the evidence in the memories
6. Double-check that your answer directly addresses the question asked
7. Ensure your final answer is specific and avoids vague time references
Memories:
{{memories}}
Question: {{question}}
Answer:
"""

View File

@@ -0,0 +1,102 @@
import os
import json
from src.memzero.add import MemoryADD
from src.memzero.search import MemorySearch
from src.utils import TECHNIQUES, METHODS
import argparse
from src.rag import RAGManager
from src.langmem import LangMemManager
from src.zep.search import ZepSearch
from src.zep.add import ZepAdd
from src.openai.predict import OpenAIPredict
class Experiment:
def __init__(self, technique_type, chunk_size):
self.technique_type = technique_type
self.chunk_size = chunk_size
def run(self):
print(f"Running experiment with technique: {self.technique_type}, chunk size: {self.chunk_size}")
def main():
parser = argparse.ArgumentParser(description='Run memory experiments')
parser.add_argument('--technique_type', choices=TECHNIQUES, default='mem0',
help='Memory technique to use')
parser.add_argument('--method', choices=METHODS, default='add',
help='Method to use')
parser.add_argument('--chunk_size', type=int, default=1000,
help='Chunk size for processing')
parser.add_argument('--output_folder', type=str, default='results/',
help='Output path for results')
parser.add_argument('--top_k', type=int, default=30,
help='Number of top memories to retrieve')
parser.add_argument('--filter_memories', action='store_true', default=False,
help='Whether to filter memories')
parser.add_argument('--is_graph', action='store_true', default=False,
help='Whether to use graph-based search')
parser.add_argument('--num_chunks', type=int, default=1,
help='Number of chunks to process')
args = parser.parse_args()
# Add your experiment logic here
print(f"Running experiments with technique: {args.technique_type}, chunk size: {args.chunk_size}")
if args.technique_type == "mem0":
if args.method == "add":
memory_manager = MemoryADD(
data_path='dataset/locomo10.json',
is_graph=args.is_graph
)
memory_manager.process_all_conversations()
elif args.method == "search":
output_file_path = os.path.join(
args.output_folder,
f"mem0_results_top_{args.top_k}_filter_{args.filter_memories}_graph_{args.is_graph}.json"
)
memory_searcher = MemorySearch(
output_file_path,
args.top_k,
args.filter_memories,
args.is_graph
)
memory_searcher.process_data_file('dataset/locomo10.json')
elif args.technique_type == "rag":
output_file_path = os.path.join(
args.output_folder,
f"rag_results_{args.chunk_size}_k{args.num_chunks}.json"
)
rag_manager = RAGManager(
data_path="dataset/locomo10_rag.json",
chunk_size=args.chunk_size,
k=args.num_chunks
)
rag_manager.process_all_conversations(output_file_path)
elif args.technique_type == "langmem":
output_file_path = os.path.join(args.output_folder, "langmem_results.json")
langmem_manager = LangMemManager(dataset_path="dataset/locomo10_rag.json")
langmem_manager.process_all_conversations(output_file_path)
elif args.technique_type == "zep":
if args.method == "add":
zep_manager = ZepAdd(data_path="dataset/locomo10.json")
zep_manager.process_all_conversations("1")
elif args.method == "search":
output_file_path = os.path.join(args.output_folder, "zep_search_results.json")
zep_manager = ZepSearch()
zep_manager.process_data_file(
"dataset/locomo10.json",
"1",
output_file_path
)
elif args.technique_type == "openai":
output_file_path = os.path.join(args.output_folder, "openai_results.json")
openai_manager = OpenAIPredict()
openai_manager.process_data_file("dataset/locomo10.json", output_file_path)
else:
raise ValueError(f"Invalid technique type: {args.technique_type}")
if __name__ == "__main__":
main()

193
evaluation/src/langmem.py Normal file
View File

@@ -0,0 +1,193 @@
from langgraph.checkpoint.memory import MemorySaver
from langgraph.prebuilt import create_react_agent
from langgraph.store.memory import InMemoryStore
from langgraph.utils.config import get_store
from langmem import (
create_manage_memory_tool,
create_search_memory_tool
)
import time
import multiprocessing as mp
import json
from functools import partial
import os
from tqdm import tqdm
from openai import OpenAI
from collections import defaultdict
from dotenv import load_dotenv
from prompts import ANSWER_PROMPT
load_dotenv()
client = OpenAI()
from jinja2 import Template
ANSWER_PROMPT_TEMPLATE = Template(ANSWER_PROMPT)
def get_answer(question, speaker_1_user_id, speaker_1_memories, speaker_2_user_id, speaker_2_memories):
prompt = ANSWER_PROMPT_TEMPLATE.render(
question=question,
speaker_1_user_id=speaker_1_user_id,
speaker_1_memories=speaker_1_memories,
speaker_2_user_id=speaker_2_user_id,
speaker_2_memories=speaker_2_memories
)
t1 = time.time()
response = client.chat.completions.create(
model=os.getenv("MODEL"),
messages=[{"role": "system", "content": prompt}],
temperature=0.0
)
t2 = time.time()
return response.choices[0].message.content, t2 - t1
def prompt(state):
"""Prepare the messages for the LLM."""
store = get_store()
memories = store.search(
("memories",),
query=state["messages"][-1].content,
)
system_msg = f"""You are a helpful assistant.
## Memories
<memories>
{memories}
</memories>
"""
return [{"role": "system", "content": system_msg}, *state["messages"]]
class LangMem:
def __init__(self,):
self.store = InMemoryStore(
index={
"dims": 1536,
"embed": f"openai:{os.getenv('EMBEDDING_MODEL')}",
}
)
self.checkpointer = MemorySaver() # Checkpoint graph state
self.agent = create_react_agent(
f"openai:{os.getenv('MODEL')}",
prompt=prompt,
tools=[
create_manage_memory_tool(namespace=("memories",)),
create_search_memory_tool(namespace=("memories",)),
],
store=self.store,
checkpointer=self.checkpointer,
)
def add_memory(self, message, config):
return self.agent.invoke(
{"messages": [{"role": "user", "content": message}]},
config=config
)
def search_memory(self, query, config):
try:
t1 = time.time()
response = self.agent.invoke(
{"messages": [{"role": "user", "content": query}]},
config=config
)
t2 = time.time()
return response["messages"][-1].content, t2 - t1
except Exception as e:
print(f"Error in search_memory: {e}")
return "", t2 - t1
class LangMemManager:
def __init__(self, dataset_path):
self.dataset_path = dataset_path
with open(self.dataset_path, 'r') as f:
self.data = json.load(f)
def process_all_conversations(self, output_file_path):
OUTPUT = defaultdict(list)
# Process conversations in parallel with multiple workers
def process_conversation(key_value_pair):
key, value = key_value_pair
result = defaultdict(list)
chat_history = value["conversation"]
questions = value["question"]
agent1 = LangMem()
agent2 = LangMem()
config = {"configurable": {"thread_id": f"thread-{key}"}}
speakers = set()
# Identify speakers
for conv in chat_history:
speakers.add(conv['speaker'])
if len(speakers) != 2:
raise ValueError(f"Expected 2 speakers, got {len(speakers)}")
speaker1 = list(speakers)[0]
speaker2 = list(speakers)[1]
# Add memories for each message
for conv in tqdm(chat_history, desc=f"Processing messages {key}", leave=False):
message = f"{conv['timestamp']} | {conv['speaker']}: {conv['text']}"
if conv['speaker'] == speaker1:
agent1.add_memory(message, config)
elif conv['speaker'] == speaker2:
agent2.add_memory(message, config)
else:
raise ValueError(f"Expected speaker1 or speaker2, got {conv['speaker']}")
# Process questions
for q in tqdm(questions, desc=f"Processing questions {key}", leave=False):
category = q['category']
if int(category) == 5:
continue
answer = q['answer']
question = q['question']
response1, speaker1_memory_time = agent1.search_memory(question, config)
response2, speaker2_memory_time = agent2.search_memory(question, config)
generated_answer, response_time = get_answer(
question, speaker1, response1, speaker2, response2
)
result[key].append({
"question": question,
"answer": answer,
"response1": response1,
"response2": response2,
"category": category,
"speaker1_memory_time": speaker1_memory_time,
"speaker2_memory_time": speaker2_memory_time,
"response_time": response_time,
'response': generated_answer
})
return result
# Use multiprocessing to process conversations in parallel
with mp.Pool(processes=10) as pool:
results = list(tqdm(
pool.imap(process_conversation, list(self.data.items())),
total=len(self.data),
desc="Processing conversations"
))
# Combine results from all workers
for result in results:
for key, items in result.items():
OUTPUT[key].extend(items)
# Save final results
with open(output_file_path, 'w') as f:
json.dump(OUTPUT, f, indent=4)

View File

@@ -0,0 +1,141 @@
from mem0 import MemoryClient
import json
import time
import os
import threading
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
from dotenv import load_dotenv
load_dotenv()
# Update custom instructions
custom_instructions ="""
Generate personal memories that follow these guidelines:
1. Each memory should be self-contained with complete context, including:
- The person's name, do not use "user" while creating memories
- Personal details (career aspirations, hobbies, life circumstances)
- Emotional states and reactions
- Ongoing journeys or future plans
- Specific dates when events occurred
2. Include meaningful personal narratives focusing on:
- Identity and self-acceptance journeys
- Family planning and parenting
- Creative outlets and hobbies
- Mental health and self-care activities
- Career aspirations and education goals
- Important life events and milestones
3. Make each memory rich with specific details rather than general statements
- Include timeframes (exact dates when possible)
- Name specific activities (e.g., "charity race for mental health" rather than just "exercise")
- Include emotional context and personal growth elements
4. Extract memories only from user messages, not incorporating assistant responses
5. Format each memory as a paragraph with a clear narrative structure that captures the person's experience, challenges, and aspirations
"""
class MemoryADD:
def __init__(self, data_path=None, batch_size=2, is_graph=False):
self.mem0_client = MemoryClient(
api_key=os.getenv("MEM0_API_KEY"),
org_id=os.getenv("MEM0_ORGANIZATION_ID"),
project_id=os.getenv("MEM0_PROJECT_ID")
)
self.mem0_client.update_project(custom_instructions=custom_instructions)
self.batch_size = batch_size
self.data_path = data_path
self.data = None
self.is_graph = is_graph
if data_path:
self.load_data()
def load_data(self):
with open(self.data_path, 'r') as f:
self.data = json.load(f)
return self.data
def add_memory(self, user_id, message, metadata, retries=3):
for attempt in range(retries):
try:
_ = self.mem0_client.add(message, user_id=user_id, version="v2",
metadata=metadata, enable_graph=self.is_graph)
return
except Exception as e:
if attempt < retries - 1:
time.sleep(1) # Wait before retrying
continue
else:
raise e
def add_memories_for_speaker(self, speaker, messages, timestamp, desc):
for i in tqdm(range(0, len(messages), self.batch_size), desc=desc):
batch_messages = messages[i:i+self.batch_size]
self.add_memory(speaker, batch_messages, metadata={"timestamp": timestamp})
def process_conversation(self, item, idx):
conversation = item['conversation']
speaker_a = conversation['speaker_a']
speaker_b = conversation['speaker_b']
speaker_a_user_id = f"{speaker_a}_{idx}"
speaker_b_user_id = f"{speaker_b}_{idx}"
# delete all memories for the two users
self.mem0_client.delete_all(user_id=speaker_a_user_id)
self.mem0_client.delete_all(user_id=speaker_b_user_id)
for key in conversation.keys():
if key in ['speaker_a', 'speaker_b'] or "date" in key or "timestamp" in key:
continue
date_time_key = key + "_date_time"
timestamp = conversation[date_time_key]
chats = conversation[key]
messages = []
messages_reverse = []
for chat in chats:
if chat['speaker'] == speaker_a:
messages.append({"role": "user", "content": f"{speaker_a}: {chat['text']}"})
messages_reverse.append({"role": "assistant", "content": f"{speaker_a}: {chat['text']}"})
elif chat['speaker'] == speaker_b:
messages.append({"role": "assistant", "content": f"{speaker_b}: {chat['text']}"})
messages_reverse.append({"role": "user", "content": f"{speaker_b}: {chat['text']}"})
else:
raise ValueError(f"Unknown speaker: {chat['speaker']}")
# add memories for the two users on different threads
thread_a = threading.Thread(
target=self.add_memories_for_speaker,
args=(speaker_a_user_id, messages, timestamp, "Adding Memories for Speaker A")
)
thread_b = threading.Thread(
target=self.add_memories_for_speaker,
args=(speaker_b_user_id, messages_reverse, timestamp, "Adding Memories for Speaker B")
)
thread_a.start()
thread_b.start()
thread_a.join()
thread_b.join()
print("Messages added successfully")
def process_all_conversations(self, max_workers=10):
if not self.data:
raise ValueError("No data loaded. Please set data_path and call load_data() first.")
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(self.process_conversation, item, idx)
for idx, item in enumerate(self.data)
]
for future in futures:
future.result()

View File

@@ -0,0 +1,189 @@
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
from mem0 import MemoryClient
import json
import time
from jinja2 import Template
from openai import OpenAI
from prompts import ANSWER_PROMPT_GRAPH, ANSWER_PROMPT
import os
from dotenv import load_dotenv
load_dotenv()
class MemorySearch:
def __init__(self, output_path='results.json', top_k=10, filter_memories=False, is_graph=False):
self.mem0_client = MemoryClient(
api_key=os.getenv("MEM0_API_KEY"),
org_id=os.getenv("MEM0_ORGANIZATION_ID"),
project_id=os.getenv("MEM0_PROJECT_ID")
)
self.top_k = top_k
self.openai_client = OpenAI()
self.results = defaultdict(list)
self.output_path = output_path
self.filter_memories = filter_memories
self.is_graph = is_graph
if self.is_graph:
self.ANSWER_PROMPT = ANSWER_PROMPT_GRAPH
else:
self.ANSWER_PROMPT = ANSWER_PROMPT
def search_memory(self, user_id, query, max_retries=3, retry_delay=1):
start_time = time.time()
retries = 0
while retries < max_retries:
try:
if self.is_graph:
print("Searching with graph")
memories = self.mem0_client.search(query, user_id=user_id, top_k=self.top_k,
filter_memories=self.filter_memories, enable_graph=True, output_format='v1.1')
else:
memories = self.mem0_client.search(query, user_id=user_id, top_k=self.top_k,
filter_memories=self.filter_memories)
break
except Exception as e:
print("Retrying...")
retries += 1
if retries >= max_retries:
raise e
time.sleep(retry_delay)
end_time = time.time()
if not self.is_graph:
semantic_memories = [{'memory': memory['memory'],
'timestamp': memory['metadata']['timestamp'],
'score': round(memory['score'], 2)}
for memory in memories]
graph_memories = None
else:
semantic_memories = [{'memory': memory['memory'],
'timestamp': memory['metadata']['timestamp'],
'score': round(memory['score'], 2)} for memory in memories['results']]
graph_memories = [{"source": relation['source'], "relationship": relation['relationship'], "target": relation['target']} for relation in memories['relations']]
return semantic_memories, graph_memories, end_time - start_time
def answer_question(self, speaker_1_user_id, speaker_2_user_id, question, answer, category):
speaker_1_memories, speaker_1_graph_memories, speaker_1_memory_time = self.search_memory(speaker_1_user_id, question)
speaker_2_memories, speaker_2_graph_memories, speaker_2_memory_time = self.search_memory(speaker_2_user_id, question)
search_1_memory = [f"{item['timestamp']}: {item['memory']}"
for item in speaker_1_memories]
search_2_memory = [f"{item['timestamp']}: {item['memory']}"
for item in speaker_2_memories]
template = Template(self.ANSWER_PROMPT)
answer_prompt = template.render(
speaker_1_user_id=speaker_1_user_id.split('_')[0],
speaker_2_user_id=speaker_2_user_id.split('_')[0],
speaker_1_memories=json.dumps(search_1_memory, indent=4),
speaker_2_memories=json.dumps(search_2_memory, indent=4),
speaker_1_graph_memories=json.dumps(speaker_1_graph_memories, indent=4),
speaker_2_graph_memories=json.dumps(speaker_2_graph_memories, indent=4),
question=question
)
t1 = time.time()
response = self.openai_client.chat.completions.create(
model=os.getenv("MODEL"),
messages=[
{"role": "system", "content": answer_prompt}
],
temperature=0.0
)
t2 = time.time()
response_time = t2 - t1
return response.choices[0].message.content, speaker_1_memories, speaker_2_memories, speaker_1_memory_time, speaker_2_memory_time, speaker_1_graph_memories, speaker_2_graph_memories, response_time
def process_question(self, val, speaker_a_user_id, speaker_b_user_id):
question = val.get('question', '')
answer = val.get('answer', '')
category = val.get('category', -1)
evidence = val.get('evidence', [])
adversarial_answer = val.get('adversarial_answer', '')
response, speaker_1_memories, speaker_2_memories, speaker_1_memory_time, speaker_2_memory_time, speaker_1_graph_memories, speaker_2_graph_memories, response_time = self.answer_question(
speaker_a_user_id,
speaker_b_user_id,
question,
answer,
category
)
result = {
"question": question,
"answer": answer,
"category": category,
"evidence": evidence,
"response": response,
"adversarial_answer": adversarial_answer,
"speaker_1_memories": speaker_1_memories,
"speaker_2_memories": speaker_2_memories,
'num_speaker_1_memories': len(speaker_1_memories),
'num_speaker_2_memories': len(speaker_2_memories),
'speaker_1_memory_time': speaker_1_memory_time,
'speaker_2_memory_time': speaker_2_memory_time,
"speaker_1_graph_memories": speaker_1_graph_memories,
"speaker_2_graph_memories": speaker_2_graph_memories,
"response_time": response_time
}
# Save results after each question is processed
with open(self.output_path, 'w') as f:
json.dump(self.results, f, indent=4)
return result
def process_data_file(self, file_path):
with open(file_path, 'r') as f:
data = json.load(f)
for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
qa = item['qa']
conversation = item['conversation']
speaker_a = conversation['speaker_a']
speaker_b = conversation['speaker_b']
speaker_a_user_id = f"{speaker_a}_{idx}"
speaker_b_user_id = f"{speaker_b}_{idx}"
for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
result = self.process_question(
question_item,
speaker_a_user_id,
speaker_b_user_id
)
self.results[idx].append(result)
# Save results after each question is processed
with open(self.output_path, 'w') as f:
json.dump(self.results, f, indent=4)
# Final save at the end
with open(self.output_path, 'w') as f:
json.dump(self.results, f, indent=4)
def process_questions_parallel(self, qa_list, speaker_a_user_id, speaker_b_user_id, max_workers=1):
def process_single_question(val):
result = self.process_question(val, speaker_a_user_id, speaker_b_user_id)
# Save results after each question is processed
with open(self.output_path, 'w') as f:
json.dump(self.results, f, indent=4)
return result
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(tqdm(
executor.map(process_single_question, qa_list),
total=len(qa_list),
desc="Answering Questions"
))
# Final save at the end
with open(self.output_path, 'w') as f:
json.dump(self.results, f, indent=4)
return results

View File

@@ -0,0 +1,143 @@
from openai import OpenAI
import os
import json
from jinja2 import Template
from tqdm import tqdm
import time
from collections import defaultdict
from dotenv import load_dotenv
import argparse
load_dotenv()
ANSWER_PROMPT = """
You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
# CONTEXT:
You have access to memories from a conversation. These memories contain
timestamped information that may be relevant to answering the question.
# INSTRUCTIONS:
1. Carefully analyze all provided memories
2. Pay special attention to the timestamps to determine the answer
3. If the question asks about a specific event or fact, look for direct evidence in the memories
4. If the memories contain contradictory information, prioritize the most recent memory
5. If there is a question about time references (like "last year", "two months ago", etc.),
calculate the actual date based on the memory timestamp. For example, if a memory from
4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
6. Always convert relative time references to specific dates, months, or years. For example,
convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory
timestamp. Ignore the reference while answering the question.
7. Focus only on the content of the memories. Do not confuse character
names mentioned in memories with the actual users who created those memories.
8. The answer should be less than 5-6 words.
# APPROACH (Think step by step):
1. First, examine all memories that contain information related to the question
2. Examine the timestamps and content of these memories carefully
3. Look for explicit mentions of dates, times, locations, or events that answer the question
4. If the answer requires calculation (e.g., converting relative time references), show your work
5. Formulate a precise, concise answer based solely on the evidence in the memories
6. Double-check that your answer directly addresses the question asked
7. Ensure your final answer is specific and avoids vague time references
Memories:
{{memories}}
Question: {{question}}
Answer:
"""
class OpenAIPredict:
def __init__(self, model="gpt-4o-mini"):
self.model = model
self.openai_client = OpenAI()
self.results = defaultdict(list)
def search_memory(self, idx):
with open(f'memories/{idx}.txt', 'r') as file:
memories = file.read()
return memories, 0
def process_question(self, val, idx):
question = val.get('question', '')
answer = val.get('answer', '')
category = val.get('category', -1)
evidence = val.get('evidence', [])
adversarial_answer = val.get('adversarial_answer', '')
response, search_memory_time, response_time, context = self.answer_question(
idx,
question
)
result = {
"question": question,
"answer": answer,
"category": category,
"evidence": evidence,
"response": response,
"adversarial_answer": adversarial_answer,
"search_memory_time": search_memory_time,
"response_time": response_time,
"context": context
}
return result
def answer_question(self, idx, question):
memories, search_memory_time = self.search_memory(idx)
template = Template(ANSWER_PROMPT)
answer_prompt = template.render(
memories=memories,
question=question
)
t1 = time.time()
response = self.openai_client.chat.completions.create(
model=os.getenv("MODEL"),
messages=[
{"role": "system", "content": answer_prompt}
],
temperature=0.0
)
t2 = time.time()
response_time = t2 - t1
return response.choices[0].message.content, search_memory_time, response_time, memories
def process_data_file(self, file_path, output_file_path):
with open(file_path, 'r') as f:
data = json.load(f)
for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
qa = item['qa']
for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
result = self.process_question(
question_item,
idx
)
self.results[idx].append(result)
# Save results after each question is processed
with open(output_file_path, 'w') as f:
json.dump(self.results, f, indent=4)
# Final save at the end
with open(output_file_path, 'w') as f:
json.dump(self.results, f, indent=4)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--output_file_path", type=str, required=True)
args = parser.parse_args()
openai_predict = OpenAIPredict()
openai_predict.process_data_file("../../dataset/locomo10.json", args.output_file_path)

197
evaluation/src/rag.py Normal file
View File

@@ -0,0 +1,197 @@
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm
from jinja2 import Template
import tiktoken
import time
from collections import defaultdict
import os
from dotenv import load_dotenv
load_dotenv()
PROMPT = """
# Question:
{{QUESTION}}
# Context:
{{CONTEXT}}
# Short answer:
"""
class RAGManager:
def __init__(self, data_path="dataset/locomo10_rag.json", chunk_size=500, k=1):
self.model = os.getenv("MODEL")
self.client = OpenAI()
self.data_path = data_path
self.chunk_size = chunk_size
self.k = k
def generate_response(self, question, context):
template = Template(PROMPT)
prompt = template.render(
CONTEXT=context,
QUESTION=question
)
max_retries = 3
retries = 0
while retries <= max_retries:
try:
t1 = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system",
"content": "You are a helpful assistant that can answer "
"questions based on the provided context."
"If the question involves timing, use the conversation date for reference."
"Provide the shortest possible answer."
"Use words directly from the conversation when possible."
"Avoid using subjects in your answer."},
{"role": "user", "content": prompt}
],
temperature=0
)
t2 = time.time()
return response.choices[0].message.content.strip(), t2-t1
except Exception as e:
retries += 1
if retries > max_retries:
raise e
time.sleep(1) # Wait before retrying
def clean_chat_history(self, chat_history):
cleaned_chat_history = ""
for c in chat_history:
cleaned_chat_history += (f"{c['timestamp']} | {c['speaker']}: "
f"{c['text']}\n")
return cleaned_chat_history
def calculate_embedding(self, document):
response = self.client.embeddings.create(
model=os.getenv("EMBEDDING_MODEL"),
input=document
)
return response.data[0].embedding
def calculate_similarity(self, embedding1, embedding2):
return np.dot(embedding1, embedding2) / (
np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
def search(self, query, chunks, embeddings, k=1):
"""
Search for the top-k most similar chunks to the query.
Args:
query: The query string
chunks: List of text chunks
embeddings: List of embeddings for each chunk
k: Number of top chunks to return (default: 1)
Returns:
combined_chunks: The combined text of the top-k chunks
search_time: Time taken for the search
"""
t1 = time.time()
query_embedding = self.calculate_embedding(query)
similarities = [
self.calculate_similarity(query_embedding, embedding)
for embedding in embeddings
]
# Get indices of top-k most similar chunks
if k == 1:
# Original behavior - just get the most similar chunk
top_indices = [np.argmax(similarities)]
else:
# Get indices of top-k chunks
top_indices = np.argsort(similarities)[-k:][::-1]
# Combine the top-k chunks
combined_chunks = "\n<->\n".join([chunks[i] for i in top_indices])
t2 = time.time()
return combined_chunks, t2-t1
def create_chunks(self, chat_history, chunk_size=500):
"""
Create chunks using tiktoken for more accurate token counting
"""
# Get the encoding for the model
encoding = tiktoken.encoding_for_model(os.getenv("EMBEDDING_MODEL"))
documents = self.clean_chat_history(chat_history)
if chunk_size == -1:
return [documents], []
chunks = []
# Encode the document
tokens = encoding.encode(documents)
# Split into chunks based on token count
for i in range(0, len(tokens), chunk_size):
chunk_tokens = tokens[i:i+chunk_size]
chunk = encoding.decode(chunk_tokens)
chunks.append(chunk)
embeddings = []
for chunk in chunks:
embedding = self.calculate_embedding(chunk)
embeddings.append(embedding)
return chunks, embeddings
def process_all_conversations(self, output_file_path):
with open(self.data_path, "r") as f:
data = json.load(f)
FINAL_RESULTS = defaultdict(list)
for key, value in tqdm(data.items(), desc="Processing conversations"):
chat_history = value["conversation"]
questions = value["question"]
chunks, embeddings = self.create_chunks(
chat_history, self.chunk_size
)
for item in tqdm(
questions, desc="Answering questions", leave=False
):
question = item["question"]
answer = item.get("answer", "")
category = item["category"]
if self.chunk_size == -1:
context = chunks[0]
search_time = 0
else:
context, search_time = self.search(
question, chunks, embeddings, k=self.k
)
response, response_time = self.generate_response(
question, context
)
FINAL_RESULTS[key].append({
"question": question,
"answer": answer,
"category": category,
"context": context,
"response": response,
"search_time": search_time,
"response_time": response_time,
})
with open(output_file_path, "w+") as f:
json.dump(FINAL_RESULTS, f, indent=4)
# Save results
with open(output_file_path, "w+") as f:
json.dump(FINAL_RESULTS, f, indent=4)

12
evaluation/src/utils.py Normal file
View File

@@ -0,0 +1,12 @@
TECHNIQUES = [
"mem0",
"rag",
"langmem",
"zep",
"openai"
]
METHODS = [
"add",
"search"
]

73
evaluation/src/zep/add.py Normal file
View File

@@ -0,0 +1,73 @@
import argparse
import json
import os
from dotenv import load_dotenv
from tqdm import tqdm
from zep_cloud import Message
from zep_cloud.client import Zep
load_dotenv()
class ZepAdd:
def __init__(self, data_path=None):
self.zep_client = Zep(api_key=os.getenv("ZEP_API_KEY"))
self.data_path = data_path
self.data = None
if data_path:
self.load_data()
def load_data(self):
with open(self.data_path, 'r') as f:
self.data = json.load(f)
return self.data
def process_conversation(self, run_id, item, idx):
conversation = item['conversation']
user_id = f"run_id_{run_id}_experiment_user_{idx}"
session_id = f"run_id_{run_id}_experiment_session_{idx}"
# # delete all memories for the two users
# self.zep_client.user.delete(user_id=user_id)
# self.zep_client.memory.delete(session_id=session_id)
self.zep_client.user.add(user_id=user_id)
self.zep_client.memory.add_session(
user_id=user_id,
session_id=session_id,
)
print("Starting to add memories... for user", user_id)
for key in tqdm(conversation.keys(), desc=f"Processing user {user_id}"):
if key in ['speaker_a', 'speaker_b'] or "date" in key:
continue
date_time_key = key + "_date_time"
timestamp = conversation[date_time_key]
chats = conversation[key]
for chat in tqdm(chats, desc=f"Adding chats for {key}", leave=False):
self.zep_client.memory.add(
session_id=session_id,
messages=[Message(
role=chat['speaker'],
role_type="user",
content=f"{timestamp}: {chat['text']}",
)]
)
def process_all_conversations(self, run_id):
if not self.data:
raise ValueError("No data loaded. Please set data_path and call load_data() first.")
for idx, item in tqdm(enumerate(self.data)):
if idx == 0:
self.process_conversation(run_id, item, idx)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--run_id", type=str, required=True)
args = parser.parse_args()
zep_add = ZepAdd(data_path="../../dataset/locomo10.json")
zep_add.process_all_conversations(args.run_id)

View File

@@ -0,0 +1,148 @@
import argparse
from collections import defaultdict
from dotenv import load_dotenv
from jinja2 import Template
from openai import OpenAI
from tqdm import tqdm
from zep_cloud import EntityEdge, EntityNode
from zep_cloud.client import Zep
import json
import os
import pandas as pd
import time
from prompts import ANSWER_PROMPT_ZEP
load_dotenv()
TEMPLATE = """
FACTS and ENTITIES represent relevant context to the current conversation.
# These are the most relevant facts and their valid date ranges
# format: FACT (Date range: from - to)
{facts}
# These are the most relevant entities
# ENTITY_NAME: entity summary
{entities}
"""
class ZepSearch:
def __init__(self):
self.zep_client = Zep(api_key=os.getenv("ZEP_API_KEY"))
self.results = defaultdict(list)
self.openai_client = OpenAI()
def format_edge_date_range(self, edge: EntityEdge) -> str:
# return f"{datetime(edge.valid_at).strftime('%Y-%m-%d %H:%M:%S') if edge.valid_at else 'date unknown'} - {(edge.invalid_at.strftime('%Y-%m-%d %H:%M:%S') if edge.invalid_at else 'present')}"
return f"{edge.valid_at if edge.valid_at else 'date unknown'} - {(edge.invalid_at if edge.invalid_at else 'present')}"
def compose_search_context(self, edges: list[EntityEdge], nodes: list[EntityNode]) -> str:
facts = [f' - {edge.fact} ({self.format_edge_date_range(edge)})' for edge in edges]
entities = [f' - {node.name}: {node.summary}' for node in nodes]
return TEMPLATE.format(facts='\n'.join(facts), entities='\n'.join(entities))
def search_memory(self, run_id, idx, query, max_retries=3, retry_delay=1):
start_time = time.time()
retries = 0
while retries < max_retries:
try:
user_id = f"run_id_{run_id}_experiment_user_{idx}"
session_id = f"run_id_{run_id}_experiment_session_{idx}"
edges_results = (self.zep_client.graph.search(user_id=user_id, reranker='cross_encoder', query=query, scope='edges', limit=20)).edges
node_results = (self.zep_client.graph.search(user_id=user_id, reranker='rrf', query=query, scope='nodes', limit=20)).nodes
context = self.compose_search_context(edges_results, node_results)
break
except Exception as e:
print("Retrying...")
retries += 1
if retries >= max_retries:
raise e
time.sleep(retry_delay)
end_time = time.time()
return context, end_time - start_time
def process_question(self, run_id, val, idx):
question = val.get('question', '')
answer = val.get('answer', '')
category = val.get('category', -1)
evidence = val.get('evidence', [])
adversarial_answer = val.get('adversarial_answer', '')
response, search_memory_time, response_time, context = self.answer_question(
run_id,
idx,
question
)
result = {
"question": question,
"answer": answer,
"category": category,
"evidence": evidence,
"response": response,
"adversarial_answer": adversarial_answer,
"search_memory_time": search_memory_time,
"response_time": response_time,
"context": context
}
return result
def answer_question(self, run_id, idx, question):
context, search_memory_time = self.search_memory(run_id, idx, question)
template = Template(ANSWER_PROMPT_ZEP)
answer_prompt = template.render(
memories=context,
question=question
)
t1 = time.time()
response = self.openai_client.chat.completions.create(
model=os.getenv("MODEL"),
messages=[
{"role": "system", "content": answer_prompt}
],
temperature=0.0
)
t2 = time.time()
response_time = t2 - t1
return response.choices[0].message.content, search_memory_time, response_time, context
def process_data_file(self, file_path, run_id, output_file_path):
with open(file_path, 'r') as f:
data = json.load(f)
for idx, item in tqdm(enumerate(data), total=len(data), desc="Processing conversations"):
qa = item['qa']
for question_item in tqdm(qa, total=len(qa), desc=f"Processing questions for conversation {idx}", leave=False):
result = self.process_question(
run_id,
question_item,
idx
)
self.results[idx].append(result)
# Save results after each question is processed
with open(output_file_path, 'w') as f:
json.dump(self.results, f, indent=4)
# Final save at the end
with open(output_file_path, 'w') as f:
json.dump(self.results, f, indent=4)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--run_id", type=str, required=True)
args = parser.parse_args()
zep_search = ZepSearch()
zep_search.process_data_file("../../dataset/locomo10.json", args.run_id, "results/zep_search_results.json")