Trust protocol

Proof that survives more than one benchmark chart.

Continuity is not claiming a better model. It is claiming that persistent orientation state lets the same coding agent finish harder work with less drift. That claim needs public benchmarks, hard external tasks, held-out challenges, and real-user telemetry.

current signal 10/10

Harbor Hard10 passed with the Continuity protocol while matched baselines failed.

counter-signal 2/2

SWE-bench Lite was tied by every variant, with Continuity adding latency.

next bar 4 lanes

Public benchmarks, external tasks, held-out tasks, and beta telemetry.

claim rule same model

Runtime claims only count when the model, tools, budget, and start state match.

What serious teams do

The credible pattern is not one leaderboard. It is a proof stack.

OpenAI Define the benchmark and publish the limits.

Use expert validation, public rubrics, open subsets, system cards, and explicit caveats. SWE-bench Verified, SWE-Lancer, PaperBench, and GDPval all move toward real work and auditable methodology.

Anthropic Show the scaffold, not just the score.

Pair benchmark charts with agent traces, tool details, excluded task lists, system cards, and customer/partner validation.

Meta / Qwen Let outsiders inspect the artifact.

Open weights, model cards, papers, deployment instructions, benchmark matrices, and responsible-use guidance shift trust from assertion to reproducibility.

Cursor Measure the work users actually ask for.

CursorBench uses real agent sessions, short ambiguous prompts, larger code changes, offline grading, and online product signals to catch gaps public benchmarks miss.

Devin Category creation needs auditability.

Autonomous-agent demos and SWE-bench numbers win attention, but skepticism grows when third parties cannot inspect enough of the task and reproduction path.

Continuity proof stack

Four lanes before we earn the missing-product claim.

01 Public canonical

Run Terminal-Bench, SWE-bench Pro, SWE-bench Verified/Lite, and other public suites with matched model/tool budgets.

02 Fresh external

Use unrelated repos and synthetic companies. Tasks should be vague, multi-file, long-horizon, and scored by hidden tests plus rubric.

03 Held-out challenge

Keep a rotating private set with public categories and rubric, then release artifacts after each rotation.

04 Production telemetry

Measure real users: accepted diffs, abandoned runs, repeated exploration, resume time, corrective turns, and handoff confidence.

Current implementation: the protocol, task seed set, evidence ledger, and validators live under evals/insanity-trust/. The evidence auditor checks real artifact paths, same-model comparisons, negative results, current claim level, and the fact that the missing-product claim is still blocked. The research readout lives at docs/research/agent-trust-playbook.md.

Claim gates

What has to be true before the big claim is allowed.

Gate Pass condition Why it matters
Matched A/B Same model, tools, repo, time budget, and network policy. Separates runtime advantage from model advantage.
External generalization Most scored tasks are unrelated to this repo and this product domain. Prevents dogfood overfitting.
Full artifacts Prompts, transcripts, patches, verifier logs, cost, latency, and exclusions are retained. Makes the result inspectable instead of trust-me-bro.
Anti-cheat checks Runs are screened for solution lookup, hidden-test leakage, and harness shortcuts. Protects against benchmark gaming.
Human calibration Rubric-heavy tasks are sampled by human reviewers. Captures judgment, scope, and maintainability beyond unit tests.
Negative results Failed, neutral, and latency-negative results ship beside wins. Builds credibility when the result is mixed.

Sources

The protocol is modeled on public benchmark practice.