AI 비교하기AI 사용하기AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyContact

Memeval Framework Tests AI Agent Memory Performance

Memeval Framework Tests AI Agent Memory Performance

DEV.to
Tuesday, June 2, 2026
  • •Anupam Gevariya launched Memeval, an open-source testing framework for verifying the memory performance of AI agents.
  • •The framework supports standardized benchmarking for providers like Mem0, Zep, Letta, LangGraph, and CrewAI using a unified protocol.
  • •Memeval evaluates seven key metrics including recall, relevance, contradiction detection, latency, and privacy isolation through YAML-defined test cases.
  • •Anupam Gevariya launched Memeval, an open-source testing framework for verifying the memory performance of AI agents.
  • •The framework supports standardized benchmarking for providers like Mem0, Zep, Letta, LangGraph, and CrewAI using a unified protocol.
  • •Memeval evaluates seven key metrics including recall, relevance, contradiction detection, latency, and privacy isolation through YAML-defined test cases.

Anupam Gevariya released Memeval, an open-source testing framework designed to evaluate the memory capabilities of AI agents. While tools like LangSmith and Ragas exist for prompts and RAG (retrieval-augmented generation) pipelines, memory systems often lack standardized testing, leaving developers to discover failures through user feedback. Memeval fills this gap by executing standardized YAML-based scenarios against various memory backends to identify recall, consistency, and privacy issues.

The framework utilizes a Standard Memory Protocol (SMP) that acts as an interface between the evaluation harness and different providers, including Mem0, Zep, Letta, LangGraph, and CrewAI. By standardizing these interactions, Memeval allows developers to run the same metrics across different memory architectures without modifying test code. The tool includes 30 built-in test cases categorized by session, core storage, lifecycle, governance, and operations.

Memeval evaluates performance across seven specific dimensions: recall accuracy, relevance, consistency (contradiction detection), update propagation, forgetting quality, latency and cost, and privacy isolation. For example, the consistency metric uses embedding-based detection to identify numeric or structural contradictions in stored facts. When a scenario fails, a built-in diagnostic tool provides specific feedback, such as whether a system retrieved stale data rather than updated information.

Benchmarking results indicate that providers manage these memory tasks differently; for instance, Mem0 utilizes LLM-based fact extraction to improve recall accuracy but incurs higher latency due to LLM-dependent write operations. In contrast, LangGraph demonstrates perfect recall and update propagation but weaker relevance ranking, while Zep's asynchronous knowledge graph processing may affect real-time agent performance. Memeval also integrates the LongMemEval benchmark (Wu et al., ICLR 2025) to test retrieval performance on 500 QA pairs derived from multi-session conversations. The framework is available as a Python package, with optional dependencies for specific adapters and machine learning libraries.

Anupam Gevariya released Memeval, an open-source testing framework designed to evaluate the memory capabilities of AI agents. While tools like LangSmith and Ragas exist for prompts and RAG (retrieval-augmented generation) pipelines, memory systems often lack standardized testing, leaving developers to discover failures through user feedback. Memeval fills this gap by executing standardized YAML-based scenarios against various memory backends to identify recall, consistency, and privacy issues.

The framework utilizes a Standard Memory Protocol (SMP) that acts as an interface between the evaluation harness and different providers, including Mem0, Zep, Letta, LangGraph, and CrewAI. By standardizing these interactions, Memeval allows developers to run the same metrics across different memory architectures without modifying test code. The tool includes 30 built-in test cases categorized by session, core storage, lifecycle, governance, and operations.

Memeval evaluates performance across seven specific dimensions: recall accuracy, relevance, consistency (contradiction detection), update propagation, forgetting quality, latency and cost, and privacy isolation. For example, the consistency metric uses embedding-based detection to identify numeric or structural contradictions in stored facts. When a scenario fails, a built-in diagnostic tool provides specific feedback, such as whether a system retrieved stale data rather than updated information.

Benchmarking results indicate that providers manage these memory tasks differently; for instance, Mem0 utilizes LLM-based fact extraction to improve recall accuracy but incurs higher latency due to LLM-dependent write operations. In contrast, LangGraph demonstrates perfect recall and update propagation but weaker relevance ranking, while Zep's asynchronous knowledge graph processing may affect real-time agent performance. Memeval also integrates the LongMemEval benchmark (Wu et al., ICLR 2025) to test retrieval performance on 500 QA pairs derived from multi-session conversations. The framework is available as a Python package, with optional dependencies for specific adapters and machine learning libraries.

Read original (English)·Jun 1, 2026
#memeval#ai agent#memory#benchmark#rag#evaluation#opensource