Benchmark overview
Official v7.0.0 benchmark snapshot (21 repos, including R language)
Benchmark ID: sigmap-v7.0-main · Date: 2026-06-14
| Metric | Value |
|---|---|
| Hit@5 (18 core repos) | 76% vs 13.6% baseline |
| Token reduction (21 repos) | 97.0% |
| Retrieval lift | 5.6× |
| Prompt reduction | 39.4% (2.84 → 1.72) |
| Task success proxy | 52.2% |
| GPT-4o overflow (without → with) | 16/21 → 0/21 |
This is the landing page for the public benchmark story. It answers four different questions:
| If you want to prove... | Open |
|---|---|
| SigMap reduces context size dramatically | Token reduction |
| SigMap finds the right file more often | Retrieval benchmark |
| SigMap reduces retries and wrong-context answers | Task benchmark |
| SigMap keeps large repos inside model limits | Quality benchmark |
Official v7.0.0 snapshot (with R language support)
Latest saved benchmark run: 2026-06-14 (v7.0.0)
| Metric | Result |
|---|---|
| Token reduction repos | 21 (including R: ggplot2, dplyr, shiny) |
| Retrieval benchmark repos | 18 (core languages) |
| Total tasks | 90 |
| Average token reduction (all 21) | 97.0% |
| Retrieval hit@5 (18 core) | 76% |
| Graph-boosted hit@5 | 76% |
| Random baseline hit@5 | 13.6% |
| Prompt reduction | 39.4% (2.84 → 1.72 prompts) |
| GPT-4o overflow repos without SigMap | 16 / 21 |
| GPT-4o monthly input savings at 10 calls/day | $9,899.90 |
What each benchmark proves
1. Token reduction (21 repositories)
- Raw source across benchmark set: 13,499,894 tokens (21 repos)
- Final SigMap output: ~470,000 tokens
- Overall reduction: 97.0%
- New in v6.11.1: R language support verified
- ggplot2: 94.3% reduction (381.5K → 21.7K tokens)
- dplyr: 93.4% reduction (145.1K → 9.5K tokens)
- shiny: 96.2% reduction (264.6K → 10.0K tokens)
- New in v6.12.0: demand-driven Surgical Context (
ask --mode index+ theget_linesMCP tool) cuts upfrontaskcontext further on top of the figures above by emitting symbol pointers instead of bodies — see the Surgical Context guide. - Line anchors (v6.13.0): extended to JavaScript and to class methods / interface members (TS & JS), raising index-mode token reduction on real repos from ~4.6% to 32–42% (axios 42.1%, fastify 41.1%, svelte 36.8%, vue-core 32.4%).
2. Retrieval quality
- SigMap hit@5: 76%
- Graph-boosted hit@5: 76% (+0.0pp with dependency graph)
- Random baseline: 13.6%
- Lift: 6.0x
This is the best benchmark when the question is: "Does SigMap actually put the right file in context?"
3. Task outcomes
- Correct: 47 / 90 (52.2%)
- Partial: 24 / 90 (26.7%)
- Wrong: 18 / 90 (20%)
- Average prompts: 2.84 → 1.72
This is the best benchmark when the question is: "Does the developer need fewer retries to finish the job?"
4. Quality and overflow
- 16/21 repos overflow GPT-4o's 128K context window without SigMap
- R repos add to overflow risk: ggplot2 and shiny both overflow without SigMap
- 5,200+ files would be hidden from the model in the raw-flow scenario
- 16,500+ symbols are surfaced in SigMap output across all benchmark repos
- With SigMap: 0/21 repos overflow — all repos fit within 128K context
This is the best benchmark when the question is: "Why does token reduction matter operationally?"
Open the HTML dashboard
The easiest way to inspect the latest benchmark run is the self-contained HTML report:
node scripts/run-benchmark-matrix.mjs --save --skip-clone
open benchmarks/reports/benchmark-report.htmlThat generates synchronized JSON plus a dashboard for token, retrieval, quality, and task metrics together.
Reproduce the full benchmark set
node scripts/run-benchmark.mjs --save --skip-clone
node scripts/run-retrieval-benchmark.mjs --save
node scripts/run-quality-benchmark.mjs --save
node scripts/run-task-benchmark.mjs --save
node scripts/run-benchmark-matrix.mjs --save --skip-cloneEach benchmark repository is pinned to a fixed commit in scripts/run-benchmark.mjs (fetched by SHA on clone). The corpus is frozen, so retrieval and token numbers move only when SigMap's own ranking/extraction changes — not when an upstream repo does. This makes hit@5 a true release-over-release signal rather than a moving target.
The matrix run writes:
benchmarks/reports/token-reduction.jsonbenchmarks/reports/retrieval.jsonbenchmarks/reports/quality.jsonbenchmarks/reports/task-benchmark.jsonbenchmarks/reports/benchmark-matrix.jsonbenchmarks/reports/benchmark-report.html
Benchmark resources
- Benchmark suite → — Open-source scripts, 90 real-world coding tasks, and per-repo raw results
- Archived data (Zenodo) → — Full benchmark dataset for reproducibility and independent analysis
- Hacker News discussion → — Community feedback and related work