Blog / field notes

We measured recall latency. Here is what we found.

2026-06-22 · pgmnemo contributors · ~5 min read

field-notesv0.10.0latencybenchmarks

Every memory provider publishes recall quality (recall@K). None that we have found publishes recall latency for a synchronous agent call. So we measured ours on a live 7,612-lesson corpus and shipped the result as recall_fast(): p50 22–46ms, p95 57–99ms at k=10, 0% timeouts. Here are the numbers, why recall_fast is now the default, and the tradeoff we are not hiding.

The question nobody answers

When an agent fires a memory lookup in the middle of a turn, that lookup is on the critical path of the response. Two questions decide whether memory is usable at that point: how fast is recall, and where in the agent's lifecycle should you attach it. The field answers the first with silence. We checked the public docs for Mem0, Zep, Letta, Constructive, and agentmemory: recall@K tables, yes; a p50/p95 retrieval-latency table for a synchronous call, none that we could find. If a provider publishes one, tell us and we will link it.

What we measured

We ran two paths over the same live corpus: recall_fast() (pure HNSW vector recall, ORDER BY embedding <=> query LIMIT k, no BM25 scan, no graph BFS) and the full recall_hybrid(). The corpus is real: a live production agent-memory deployment, 7,612 active embedded lessons over 73,258 temporal/causal edges, written by a busy multi-agent system. The methodology is published with the numbers in benchmarks/gate/v0.10.0-recall-at-k.json.

Path (k=10)	p50	p95	Timeout rate
`recall_fast()`	22–46ms	57–99ms	0%
`recall_hybrid()`	92–147ms	164–2,119ms	13.8–27%

The two numbers in each cell are the unfiltered run and a project-filtered run; both are in the JSON. The headline: recall_fast answers a synchronous turn in well under 100ms at p95, on a corpus that is not a toy.

Why recall_fast is the default, and it is not about quality

It would be easy to frame this as "fast path trades accuracy for speed." That is not what the data says. The honest reason recall_fast is the MCP default is reliability: on this corpus, recall_hybrid times out on 13.8–27% of queries. Its recursive graph_walk CTE walks 73k+ edges, which goes sequential on a dense graph; even successful hybrid calls show p95 = 2,119ms. A retrieval path that fails one query in four-to-seven, and takes two seconds when it does succeed, is not a sane default for a synchronous agent call. So recall_fast answers by default, and you opt into the heavy path:

-- MCP default: pure-HNSW, sub-100ms p95
SELECT * FROM pgmnemo.recall_fast(query_embedding := $1, k := 10);

-- opt into full fusion (BM25 + graph + recency + provenance scoring)
SELECT * FROM pgmnemo.recall_hybrid(query_text := 'retry storm', k := 10);

The tradeoff we are not hiding

recall_fast reaches 80% overlap@10 with full hybrid (measured, corrected down from an earlier 90% analytical estimate — the real number is 80%). At k=20 the median overlap rises to 0.90. But the bottom of the distribution is real: overlap@10 p10 = 0.40, meaning the worst 10% of queries miss 60% of what hybrid would have surfaced. For routine dispatch-time recall (top-5 to top-10 lessons), 80% overlap at 2–6.6× lower latency is the right trade. For exploratory tasks, safety-critical lookups, or sparse-embedding corpora, pass deep=true and take the hybrid path.

The framing we hold ourselves to: "first published" is a claim about what we can find, not a law of nature. We have not seen p50/p95 synchronous-recall latency published by another SQL-native memory layer. If that changes, this post changes.

v0.10.1: the hybrid path got more honest too

The hotfix that followed hardens the recall_hybrid BM25 path so a timeout degrades gracefully to vector-only instead of erroring, caps the lexical query at 200 characters, uses the indexed full_text column, and switches the text-search config to simple for correct tokenization of Cyrillic and structured lesson text. On those inputs the BM25 p95 dropped from 910ms to 44ms. And recall_fast(NULL, …) now raises instead of silently returning NULL scores — a quiet data-corruption path, closed.

Reproduce it

The benchmark script, the corpus description, and the raw JSON are in the repo. Run it against your own Postgres and your own lessons; the only honest latency number is the one measured on the corpus you actually recall against.

ALTER EXTENSION pgmnemo UPDATE TO '0.10.1';
-- benchmarks/gate/v0.10.0-recall-at-k.json holds the methodology + numbers

← All posts