Codex GPT-5.2 and GPT-5.5 both finished cleanly and failed every hidden verifier.
Eval packet
What we ran, what failed, and what it cost.
The headline result is real, but it is not magic. This page shows the first two smoke checks, the single hard ledger slice, the 10-task hard batch, verifier failures, setup errors, latency, cost, the local Harbor job paths behind the landing-page claim, and a separate real interactive Codex TUI sidecar.
Codex GPT-5.2 passed the same task family with graph orientation and reconciliation.
The final baseline, GPT-5.5 baseline, and Continuity hard10 runs all completed without run exceptions.
Valid GPT-5.5 tmux trials, plus 10 unscored harness/setup attempts retained below.
Accuracy moved from 0% to 100%; wall time increased by about 2m43s versus the GPT-5.2 baseline rerun.
Run ledger
Every result we are using for the claim.
| Run | Variant | Model | Score | Exceptions | Wall time | Avg agent time | Cost | Tokens in / out |
|---|---|---|---|---|---|---|---|---|
| Smoke statfix | baseline | GPT-5.2 | 1/1 | 0 | 52s | 37s | $0.060 | 148k / 1.2k |
| Smoke statfix | with Continuity | GPT-5.2 | 1/1 | 0 | 1m 29s | 1m 15s | $0.101 | 291k / 2.6k |
| Single ledger | baseline | GPT-5.2 | 0/1 | 0 | 5m 24s | 5m 7s | $0.300 | 445k / 8.0k |
| Single ledger | with Continuity | GPT-5.2 | 1/1 | 0 | 4m 16s | 4m 0s | $0.267 | 569k / 8.9k |
| Hard10 | baseline | GPT-5.2 | 0/10 | 0 | 23m 24s | 4m 5s | $2.633 | 4.17M / 78k |
| Hard10 | baseline | GPT-5.5 | 0/10 | 0 | 7m 5s | 1m 2s | $1.890 | 954k / 25k |
| Hard10 | with Continuity | GPT-5.2 | 10/10 | 0 | 26m 7s | 4m 32s | $3.105 | 5.63M / 100k |
codex exec. The first smoke and single-ledger jobs used older goal-baseline.jinja and goal-insanity.jinja prompt templates that prefixed the task with /goal. The Hard10 landing-page claim uses baseline.jinja and insanity.jinja, which do not invoke slash commands, so this packet should not be read as a test of interactive Codex thread-goal state.
Attempt log
Every top-level Harbor job we ran, including superseded runs.
| Job | Scope | Agent / model | Result | Errors | Wall time | Cost | How to read it |
|---|---|---|---|---|---|---|---|
insanity-goal-ab-oracle |
first two statfix smoke tasks | oracle | 0/2 | 1 error, 1 cancelled | - | - | Initial verifier sanity run; superseded after a cancelled trial. |
insanity-goal-ab-oracle-shell |
first two statfix smoke tasks | oracle shell | 2/2 | 0 | 8s | - | Confirmed the smoke verifiers could pass before using Codex. |
insanity-goal-ab-baseline-codex |
baseline statfix smoke | Codex GPT-5.2 Codex | error | 1 agent error | 20s | - | Bad model choice for local ChatGPT auth: gpt-5.2-codex was rejected. |
insanity-goal-ab-baseline-codex-gpt52 |
baseline statfix smoke | Codex GPT-5.2 | 0/1 | 0 run errors | 1m 2s | $0.067 | Visible path passed, but hidden custom input exposed a count/mean bug. |
insanity-goal-ab-insanity-codex-gpt52 |
Continuity statfix smoke | Codex GPT-5.2 | 1/1 | 0 | 3m 5s | $0.209 | Protocol did not break the small task; rerun later for a cleaner paired smoke. |
insanity-goal-ab-oracle-final |
first two statfix smoke tasks | oracle | 2/2 | 0 | 7s | - | Final oracle sanity check for the two smoke tasks. |
insanity-goal-ab-baseline-codex-gpt52-final |
baseline statfix smoke | Codex GPT-5.2 | 1/1 | 0 | 52s | $0.060 | Final baseline smoke result used in the run ledger. |
insanity-goal-ab-insanity-codex-gpt52-final |
Continuity statfix smoke | Codex GPT-5.2 | 1/1 | 0 | 1m 29s | $0.101 | Final Continuity smoke result used in the run ledger. |
insanity-goal-ab-ledger-oracle |
single ledger slice | oracle | 1/2 | 0 run errors | 10s | - | Oracle setup bug: verifier hit awk: cannot open "" for output. |
insanity-goal-ab-ledger-oracle-fixed |
single ledger slice | oracle | 2/2 | 0 | 7s | - | Fixed oracle sanity check for the ledger task. |
insanity-goal-ab-ledger-baseline-codex-gpt52 |
baseline ledger slice | Codex GPT-5.2 | 0/1 | 0 run errors | 5m 24s | $0.300 | Completed cleanly, failed hidden custom ledger fixtures. |
insanity-goal-ab-ledger-insanity-codex-gpt52 |
Continuity ledger slice | Codex GPT-5.2 | 1/1 | 0 | 4m 16s | $0.267 | Passed the same hidden ledger verifier. |
insanity-protocol-ab-hard10-oracle |
Hard10 baseline + Continuity tasks | oracle | 10/20 | 0 run errors | 25s | - | Oracle setup failed on Continuity variants because command-log proof was missing. |
insanity-protocol-ab-hard10-oracle-fixed |
Hard10 baseline + Continuity tasks | oracle | 20/20 | 0 | 20s | - | Final Hard10 oracle sanity check. |
insanity-protocol-ab-hard10-baseline-codex-gpt52 |
baseline Hard10 | Codex GPT-5.2 | 0/10 | 0 run errors | 23m 11s | $2.612 | Original Hard10 baseline result; rerun kept below for a cleaner final number. |
insanity-protocol-ab-hard10-insanity-codex-gpt52 |
Continuity Hard10 | Codex GPT-5.2 | 10/10 | 0 | 26m 7s | $3.105 | Final protocol result behind the 10/10 claim. |
insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508 |
baseline Hard10 | Codex GPT-5.2 | 0/10 | 0 run errors | 23m 24s | $2.633 | Final GPT-5.2 baseline result used in the landing-page comparison. |
insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508 |
baseline Hard10 | Codex GPT-5.5 | 0/10 | 0 run errors | 7m 5s | $1.890 | Model-upgrade baseline check; still failed all 10 hidden verifiers. |
Score cells count verifier reward. The errors column counts Harbor run exceptions separately, so a clean run can still be a real model failure when the verifier returns zero.
Problems attempted
The task suite was not one problem repeated with new names.
Repair a tiny stats report script so it prints count, total, min, max, and mean. Hidden checks exercised alternate input/output paths.
Same stats repair, but the repo required Continuity orientation before editing. This verified the protocol did not break a small task.
Repair a multi-file ledger reconciliation script with account overrides, refunds, duplicates, invalid amounts, unknown accounts, and custom files.
| # | Hard10 problem | What it reconciled | GPT-5.2 baseline | GPT-5.5 baseline | Continuity protocol |
|---|---|---|---|---|---|
| 01 | receivables | customer receivables activity into report and exception files | 0 | 0 | 1 |
| 02 | warehouse | inventory movements with invalid, duplicate, and orphan records | 0 | 0 | 1 |
| 03 | subscriptions | subscription balance events and exception ordering | 0 | 0 | 1 |
| 04 | usage | quota usage events with custom fixture overrides | 0 | 0 | 1 |
| 05 | support | support credit/debit activity with hidden edge cases | 0 | 0 | 1 |
| 06 | shipping | shipping claim movements and exception reasons | 0 | 0 | 1 |
| 07 | builds | build minute accounting across posted and invalid events | 0 | 0 | 1 |
| 08 | access | access grant usage with inactive and unknown entities | 0 | 0 | 1 |
| 09 | invoices | invoice corrections, credits, and hidden fixture ordering | 0 | 0 | 1 |
| 10 | credits | credit adjustments with duplicates, orphan credits, and invalid rows | 0 | 0 | 1 |
The hard tasks share a shape because they measure the same capability: can an agent repair a shell reconciliation program while preserving file overrides, row ordering, exception ordering, dependency-free execution, and hidden custom fixtures?
Errors and failures
Verifier failures are separate from run exceptions.
| Where | What happened | Type | Status |
|---|---|---|---|
| Early smoke oracle | insanity-goal-ab-oracle left one trial cancelled with CancelledError. |
run error | Superseded by insanity-goal-ab-oracle-final, which passed 2/2. |
| First baseline smoke Codex run | insanity-goal-ab-baseline-codex exited with NonZeroAgentExitCodeError and a report diff. |
agent error | Superseded by explicit GPT-5.2 runs. |
| Initial GPT-5.2 smoke baseline | The visible fixture passed, but a custom hidden report expected count: 3 and the script produced count: 4. |
verifier fail | Final smoke baseline passed 1/1 after rerun. |
| First ledger oracle | The oracle run hit awk: cannot open "" for output and missing report/exceptions files on the baseline variant. |
oracle setup | Fixed by insanity-goal-ab-ledger-oracle-fixed, which passed 2/2. |
| First hard10 oracle | Continuity variants failed because the oracle solution did not produce /app/.insanity/command-log.txt. |
oracle setup | Fixed by insanity-protocol-ab-hard10-oracle-fixed, which passed 20/20. |
| Hard10 baselines | Both GPT-5.2 and GPT-5.5 completed with 0 exceptions, then failed hidden custom fixtures: invalid amounts, inactive entities, orphan credits, unknown entities, unknown kinds, and duplicates. | verifier fail | Kept as the baseline result: 0/10 and 0/10. |
| Hard10 Continuity protocol | No run exceptions and no verifier failures across all 10 tasks. | pass | Used as the proof result: 10/10. |
Interactive tmux sidecar
20 real Codex TUI runs, scored separately from Harbor.
codex exec. These runs drove the Codex TUI through tmux with GPT-5.5, accepted the project-trust prompt, captured transcripts, and then ran the same local verifiers. They are useful evidence, but they do not replace the Harbor hard10 headline.
plain is the task prompt only, goal sends the full task as an interactive /goal command, insanity requires the Continuity orientation workflow, and goal-insanity combines both. The scored matrix used external workspaces under /private/tmp/insanity-tmux-goal-ab-workspaces to avoid contamination from this repo's own AGENTS.md.
Average wall time 299s. Passed statfix, warehouse, and invoices.
Average wall time 356s. Underperformed the plain prompt on this matrix.
Average wall time 306s. Added the receivables pass while matching most plain wins.
Average wall time 358s. Matched Continuity-only accuracy with more latency.
| Scenario | Prompt | plain | /goal | Continuity | /goal + Continuity | Readout |
|---|---|---|---|---|---|---|
| statfix-clear | clear | 1 / 134s2026-05-08T22-50-32-505Z-statfix-clear-plain-593a3f |
1 / 177s2026-05-08T23-00-23-463Z-statfix-clear-goal-0b52f3 |
1 / 184s2026-05-08T23-03-35-316Z-statfix-clear-insanity-e72f39 |
1 / 254s2026-05-08T23-06-49-635Z-statfix-clear-goal-insanity-5bdee3 |
All variants solved the small stats repair; protocols added latency. |
| ledger-clear | clear | 0 / 225s2026-05-08T23-12-22-057Z-ledger-clear-plain-d3d967 |
0 / 263s2026-05-08T23-16-07-358Z-ledger-clear-goal-e7e7b4 |
0 / 327s2026-05-08T23-20-30-913Z-ledger-clear-insanity-b6419d |
0 / 279s2026-05-08T23-25-59-176Z-ledger-clear-goal-insanity-5e944f |
No variant solved this ledger task in the interactive matrix. |
| receivables-vague | vague | 0 / 367s2026-05-08T23-30-39-415Z-receivables-vague-plain-5624c2 |
0 / 431s2026-05-08T23-36-46-474Z-receivables-vague-goal-8626cb |
1 / 367s2026-05-08T23-43-57-594Z-receivables-vague-insanity-8e7957 |
1 / 446s2026-05-08T23-50-06-016Z-receivables-vague-goal-insanity-0d4f59 |
Continuity variants found the hidden accounting requirements that plain and /goal missed. |
| warehouse-vague | vague | 1 / 392s2026-05-08T23-57-33-948Z-warehouse-vague-plain-b0c058 |
1 / 517s2026-05-09T00-04-06-425Z-warehouse-vague-goal-40aaef |
1 / 326s2026-05-09T00-12-43-639Z-warehouse-vague-insanity-7c212c |
1 / 390s2026-05-09T00-18-11-243Z-warehouse-vague-goal-insanity-334ec6 |
All variants solved it; Continuity-only was fastest in the valid set. |
| invoices-vague | vague | 1 / 377s2026-05-09T00-24-42-713Z-invoices-vague-plain-f2f5e2 |
0 / 390s2026-05-09T00-30-59-831Z-invoices-vague-goal-c1bf3f |
1 / 326s2026-05-09T00-37-30-211Z-invoices-vague-insanity-bc9718 |
1 / 421s2026-05-09T00-42-58-164Z-invoices-vague-goal-insanity-4fbbd1 |
/goal alone lost this one; Continuity variants passed. |
| Unscored tmux attempt | Variant | Verifier | Wall time | Why it is not in the matrix |
|---|---|---|---|---|
2026-05-08T22-13-42-013Z-statfix-clear-plain-7bbb6f |
plain | 0 | 21s | Harness combined incompatible approval/sandbox flags, so Codex exited before receiving the prompt. |
2026-05-08T22-14-56-090Z-statfix-clear-plain-156b86 |
plain | 0 | 62s | The update notice plus multiline paste left the task prompt in the composer instead of submitting it. |
2026-05-08T22-17-33-176Z-statfix-clear-plain-f34df2 |
plain | 1 | 158s | Early successful smoke before external workspaces, trust handling, and explicit validTrial fields; superseded by the scored plain statfix run. |
2026-05-08T22-31-04-913Z-statfix-clear-goal-81b48a |
/goal | 0 | 67s | The resume command landed in the Codex composer instead of the shell, so no project edits were made. |
2026-05-08T22-34-13-417Z-statfix-clear-goal-3e9550 |
/goal | 0 | 87s | The harness sent /goal, then could not return from the TUI to submit the task prompt. |
2026-05-08T22-37-00-021Z-statfix-clear-goal-2ee3e5 |
/goal | 1 | 330s | Workspace lived inside the Continuity repo, so repo-level AGENTS.md caused Continuity commands in a nominal no-Continuity variant. |
2026-05-08T22-45-49-526Z-statfix-clear-plain-bc09ec |
plain | 0 | 75s | External workspace run stopped at the Codex project-trust prompt before the task reached the agent. |
2026-05-08T22-48-27-245Z-statfix-clear-plain-586047 |
plain | 0 | 75s | Same trust-prompt stop as the previous external workspace attempt. |
2026-05-08T22-53-24-371Z-statfix-clear-goal-f79ad7 |
/goal | 0 | 232s | The task prompt was sent after /goal had already started executing, so it remained in the composer. |
2026-05-08T22-58-20-750Z-statfix-clear-goal-302918 |
/goal | 0 | 75s | The full /goal prompt was injected too quickly and remained as pasted composer content. |
Raw artifact paths
Local artifact roots and files behind this page.
jobs/insanity-goal-ab-oracle
jobs/insanity-goal-ab-oracle-shell
jobs/insanity-goal-ab-baseline-codex
jobs/insanity-goal-ab-baseline-codex-gpt52
jobs/insanity-goal-ab-insanity-codex-gpt52
jobs/insanity-goal-ab-oracle-final
jobs/insanity-goal-ab-baseline-codex-gpt52-final
jobs/insanity-goal-ab-insanity-codex-gpt52-final
jobs/insanity-goal-ab-ledger-oracle
jobs/insanity-goal-ab-ledger-oracle-fixed
jobs/insanity-goal-ab-ledger-baseline-codex-gpt52
jobs/insanity-goal-ab-ledger-insanity-codex-gpt52
jobs/insanity-protocol-ab-hard10-oracle
jobs/insanity-protocol-ab-hard10-oracle-fixed
jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52
jobs/insanity-protocol-ab-hard10-insanity-codex-gpt52
jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508
jobs/insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508
evals/tmux-goal-ab/manifest.json
evals/tmux-goal-ab/scripts/run-trial.mjs
evals/tmux-goal-ab/scripts/run-matrix.mjs
jobs/tmux-goal-ab