Harbor Hard10 passed with the Continuity protocol while matched baselines failed.
Trust protocol
Proof that survives more than one benchmark chart.
Continuity is not claiming a better model. It is claiming that persistent orientation state lets the same coding agent finish harder work with less drift. That claim needs public benchmarks, hard external tasks, held-out challenges, and real-user telemetry.
SWE-bench Lite was tied by every variant, with Continuity adding latency.
Public benchmarks, external tasks, held-out tasks, and beta telemetry.
Runtime claims only count when the model, tools, budget, and start state match.
What serious teams do
The credible pattern is not one leaderboard. It is a proof stack.
Use expert validation, public rubrics, open subsets, system cards, and explicit caveats. SWE-bench Verified, SWE-Lancer, PaperBench, and GDPval all move toward real work and auditable methodology.
Pair benchmark charts with agent traces, tool details, excluded task lists, system cards, and customer/partner validation.
Open weights, model cards, papers, deployment instructions, benchmark matrices, and responsible-use guidance shift trust from assertion to reproducibility.
CursorBench uses real agent sessions, short ambiguous prompts, larger code changes, offline grading, and online product signals to catch gaps public benchmarks miss.
Autonomous-agent demos and SWE-bench numbers win attention, but skepticism grows when third parties cannot inspect enough of the task and reproduction path.
Continuity proof stack
Four lanes before we earn the missing-product claim.
Run Terminal-Bench, SWE-bench Pro, SWE-bench Verified/Lite, and other public suites with matched model/tool budgets.
Use unrelated repos and synthetic companies. Tasks should be vague, multi-file, long-horizon, and scored by hidden tests plus rubric.
Keep a rotating private set with public categories and rubric, then release artifacts after each rotation.
Measure real users: accepted diffs, abandoned runs, repeated exploration, resume time, corrective turns, and handoff confidence.
evals/insanity-trust/. The evidence auditor checks real artifact paths, same-model comparisons, negative results, current claim level, and the fact that the missing-product claim is still blocked. The research readout lives at docs/research/agent-trust-playbook.md.
Claim gates
What has to be true before the big claim is allowed.
| Gate | Pass condition | Why it matters |
|---|---|---|
| Matched A/B | Same model, tools, repo, time budget, and network policy. | Separates runtime advantage from model advantage. |
| External generalization | Most scored tasks are unrelated to this repo and this product domain. | Prevents dogfood overfitting. |
| Full artifacts | Prompts, transcripts, patches, verifier logs, cost, latency, and exclusions are retained. | Makes the result inspectable instead of trust-me-bro. |
| Anti-cheat checks | Runs are screened for solution lookup, hidden-test leakage, and harness shortcuts. | Protects against benchmark gaming. |
| Human calibration | Rubric-heavy tasks are sampled by human reviewers. | Captures judgment, scope, and maintainability beyond unit tests. |
| Negative results | Failed, neutral, and latency-negative results ship beside wins. | Builds credibility when the result is mixed. |
Sources