Memory Agent Evaluation
Memory is foundational for building truly context-aware and personalized AI applications. We explore how to construct a robust Memory Agent from langgraph, that utilizes langgraph for orchestration. This agent not only remembers facts but also decides when to save new information and how to use it.

This agent performs various memory related functionality. It can be evaluated in the following criteria
- Routing correctness: how accurate it calls the memory methods such as update memory.
- Writing correctness: whether the memory after an update matches what user meant
- Retrieval correctness: whether the agent retrieves the right memories for the given query
- Temporal correctness: whether time related fields and status transitions are correct
🌐 High-Level Architecture: From Stateless to Stateful
Memory agent called task_mAIstro acts as a central orchestrator that transforms an LLM into a context-aware system. It differentiates itself through four architectural pillars:
- Contextual Persistence: It implements a tiered memory system (Semantic, Episodic, Procedural).
- Dynamic Routing: It utilizes a graph-based state machine to determine when to write to memory versus when to simply respond.
- Verifiability: It rejects “black box” memory in favor of inspectable, structured storage.
- Composability: built on the
langgraphframework, allowing for modular extension.
The Execution Flow
- Ingestion: User input enters the system.
- Cognitive Routing: A decision tool (
UpdateMemory) analyzes the intent. Does this input require a change to the user’s profile, the task list, or the rules of engagement? - State Transition: The graph routes the state to a specific update node (
update_profile,update_todos, etc.). - Consolidation: Data is committed to the
InMemoryStore(Long-term) orMemorySaver(Short-term/Thread-level). - Feedback Loop: The system verifies the write operation via internal “Spy” listeners or external user confirmation.
🧠 The Taxonomy of AI Memory
To build a truly intelligent agent, we must move beyond a generic “context window.” task_mAIstro segments data into three distinct cognitive categories, mimicking human memory systems:
| Memory Type | Cognitive Equivalent | Description | Data Structure |
|---|---|---|---|
| Semantic | User Profile | General facts and static knowledge about the user (Who they are). | Profile Schema |
| Episodic | ToDo List | Temporal, event-based memory tracking specific tasks and states (What they need to do). | ToDo Schema |
| Procedural | Instructions | Implicit rules and preferences governing how the agent should behave. | String / List |
Structured Schema Definitions
By using Pydantic, we enforce strict typing on our memory, preventing hallucinations from corrupting the agent’s long-term state.
from pydantic import BaseModel, Field
from typing import Optional, Literal, List
from datetime import datetime
# Semantic Memory (The "Who")
class Profile(BaseModel):
name: Optional[str] = Field(description="User's preferred name")
location: Optional[str] = Field(description="Current geographical context")
job: Optional[str] = Field(description="Professional role")
interests: List[str] = Field(description="Topics of engagement")
# Episodic Memory (The "What")
class ToDo(BaseModel):
task: str = Field(description="Actionable item")
deadline: Optional[datetime] = Field(description="Hard constraint for completion")
solutions: List[str] = Field(description="Proposed pathways to completion")
status: Literal["not started", "in progress", "done", "archived"]
# Procedural Memory is handled via raw instruction strings injection.
🔗 The StateGraph: Orchestrating Cognition
The nervous system of task_mAIstro is the StateGraph. This defines the deterministic flows between reasoning and memory updates.
Unlike a linear chain, this graph allows for conditional branching:
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
# Define the State schema to track routing intent
class AgentState(TypedDict):
update_type: Literal["user", "todo", "instructions", "none"]
messages: list
# Initialize the Graph
builder = StateGraph(AgentState)
# 1. Define Nodes (The Functional Units)
builder.add_node("task_mAIstro", task_mAIstro_agent)
builder.add_node("update_profile", update_profile_tool)
builder.add_node("update_todos", update_todos_tool)
builder.add_node("update_instructions", update_instructions_tool)
# 2. Define Conditional Routing (The Logic)
def route_memory(state: AgentState):
"""Determines which memory subsystem requires an update."""
return state.get("update_type", "none")
builder.add_conditional_edges(
"task_mAIstro",
route_memory,
{
"user": "update_profile",
"todo": "update_todos",
"instructions": "update_instructions",
"none": END
}
)
# 3. Define Re-entry (The Feedback Loop)
# After memory is updated, loop back to the agent to confirm or continue
builder.add_edge("update_profile", "task_mAIstro")
builder.add_edge("update_todos", "task_mAIstro")
builder.add_edge("update_instructions", "task_mAIstro")
# 4. Compile
builder.set_entry_point("task_mAIstro")
graph = builder.compile()
✅ Trust and Verification
A memory system is only as good as its accuracy. task_mAIstro implements a “Trust but Verify” protocol:
-
Observability (The “Spy”): We attach a listener to the tool execution stream. This allows developers to inspect the raw
PatchDoccalls—seeing exactly how the JSON patch was applied to the memory store.Log Output:
Document 0 updated: Plan: Replace 'location' value. Result: 'Sammamish, WA' -
Explicit User Confirmation: For high-stakes episodic memory (like modifying a deadline), the agent is instructed to request verbal confirmation before committing the state change.
-
Direct Store Inspection: Developers can bypass the agent simulation and query the KV (Key-Value) store directly to debug state issues.
# Direct access to the instruction namespace for user 'Lance' current_rules = across_thread_memory.get(("instructions", "Lance"))
(More evaluation code upcoming )
✅ Metrics
Let’s apply a one metric as an example to showcase evaluation of the memory agent. Recall write correctness metric as following.
- Write correctness: Agent returns the correct memory after an update
store = InMemoryStore()
user_id = "u123"
profile_ns = ("profile", user_id)
# initial write
graph.invoke(
{"messages": [{"role": "user", "content": "My name is Sam"}]},
config={"configurable": {"user_id": user_id}, "thread_id": thread_id},
store=store
)
# update profile
graph.invoke(
{"messages": [{"role": "user", "content": "Actually use my full name Samantha"}]},
config={"configurable": {"user_id": user_id}, "thread_id": thread_id},
store=store
)
# retrieve the memory
result = store.search(profile_ns)
<your-favority-evaluator>(result, {"name": "Samantha"})
✅ Memory Benchmark
Public benchmark such as LongMemEval evaluates long-term interactive memory in chat assistants. It tests whether a system can store, retrieve, update and reason over information that appears across many user-assistant sessions.
Example of Information Extraction
Session 1
User: My cat Luna is 3 years old and loves salmon treats.
Session 2
Q: What is the name of my cat?
Expected Answer: Luna
Code snippet showcasing memory agent evaluation through the benchmark
- Step A: Run a given test case
def run_test_case(app, testcase, ..): # get session history qid = testcase["question_id"] haystack_sessions = testcase["haystack_sessions"] # step 1: Replaying session turns for si, session in enumerate(haystack_sessions): # if it is user-only mode for evaluation, run each user turn for ti, turn in enumerate(session): _ = invoke_app( app, user_id="<user_id>", thread_id="<thread-id>", message="<context prefix such as date>" + turn.get("content", "") ) # otherwise ingest session as history _ = invoke_app( app, user_id="<user_id>", thread_id="<thread-id>", message="<context prefix such as date>" + "<concatenated_session_history>" ) # step 2: Ask the benchmark question in a new turn return { "hypothesis_id": qid, "hypothesis": invoke_app( app, user_id="<user_id>", thread_id="<thread-id-qa>", message="<context prefix such as date>" + testcase["question"] ) } - Step B: Iterate over dataset to generate prediction
for _, testcase in enumerate(longmemeval_dataset): prediction = run_test_case(app, testcase, ..) file.write(json.dumps(pred) + "\n") - Step C: Scoring with LongMemEval LLM-as-judge
python3 evaluate_qa.py \ gpt-4o \ <path-to-predictions.jsonl file> \ <path-to-longmemeval benchmark.json file>This will produce per-item labels and aggregated metrics.
🚀 Conclusion
Memory agent represents a shift from reactive AI to accumulative AI. By unifying semantic, episodic, and procedural memory into a single graph-based architecture, we create agents that do not just process tokens—they build relationships and maintain continuity.
This architecture balances the flexibility of LLMs with the rigidity required for reliable software, ensuring your agent becomes a long-term partner rather than a fleeting conversationalist.
