Eval packet

What we ran, what failed, and what it cost.

The headline result is real, but it is not magic. This page shows the first two smoke checks, the single hard ledger slice, the 10-task hard batch, verifier failures, setup errors, latency, cost, the local Harbor job paths behind the landing-page claim, and a separate real interactive Codex TUI sidecar.

Hard10 baseline 0/10

Codex GPT-5.2 and GPT-5.5 both finished cleanly and failed every hidden verifier.

Hard10 with Continuity 10/10

Codex GPT-5.2 passed the same task family with graph orientation and reconciliation.

Infra exceptions 0

The final baseline, GPT-5.5 baseline, and Continuity hard10 runs all completed without run exceptions.

Interactive sidecar 20

Valid GPT-5.5 tmux trials, plus 10 unscored harness/setup attempts retained below.

Delta +100 pts

Accuracy moved from 0% to 100%; wall time increased by about 2m43s versus the GPT-5.2 baseline rerun.

Run ledger

Every result we are using for the claim.

Run Variant Model Score Exceptions Wall time Avg agent time Cost Tokens in / out
Smoke statfix baseline GPT-5.2 1/1 0 52s 37s $0.060 148k / 1.2k
Smoke statfix with Continuity GPT-5.2 1/1 0 1m 29s 1m 15s $0.101 291k / 2.6k
Single ledger baseline GPT-5.2 0/1 0 5m 24s 5m 7s $0.300 445k / 8.0k
Single ledger with Continuity GPT-5.2 1/1 0 4m 16s 4m 0s $0.267 569k / 8.9k
Hard10 baseline GPT-5.2 0/10 0 23m 24s 4m 5s $2.633 4.17M / 78k
Hard10 baseline GPT-5.5 0/10 0 7m 5s 1m 2s $1.890 954k / 25k
Hard10 with Continuity GPT-5.2 10/10 0 26m 7s 4m 32s $3.105 5.63M / 100k
Interpretation: The Continuity run used more context and spent about $0.47 more than the GPT-5.2 baseline rerun. It also converted every hidden verifier from fail to pass.
Adapter boundary: Harbor ran these jobs through its Codex adapter, which invokes codex exec. The first smoke and single-ledger jobs used older goal-baseline.jinja and goal-insanity.jinja prompt templates that prefixed the task with /goal. The Hard10 landing-page claim uses baseline.jinja and insanity.jinja, which do not invoke slash commands, so this packet should not be read as a test of interactive Codex thread-goal state.

Attempt log

Every top-level Harbor job we ran, including superseded runs.

Job Scope Agent / model Result Errors Wall time Cost How to read it
insanity-goal-ab-oracle first two statfix smoke tasks oracle 0/2 1 error, 1 cancelled - - Initial verifier sanity run; superseded after a cancelled trial.
insanity-goal-ab-oracle-shell first two statfix smoke tasks oracle shell 2/2 0 8s - Confirmed the smoke verifiers could pass before using Codex.
insanity-goal-ab-baseline-codex baseline statfix smoke Codex GPT-5.2 Codex error 1 agent error 20s - Bad model choice for local ChatGPT auth: gpt-5.2-codex was rejected.
insanity-goal-ab-baseline-codex-gpt52 baseline statfix smoke Codex GPT-5.2 0/1 0 run errors 1m 2s $0.067 Visible path passed, but hidden custom input exposed a count/mean bug.
insanity-goal-ab-insanity-codex-gpt52 Continuity statfix smoke Codex GPT-5.2 1/1 0 3m 5s $0.209 Protocol did not break the small task; rerun later for a cleaner paired smoke.
insanity-goal-ab-oracle-final first two statfix smoke tasks oracle 2/2 0 7s - Final oracle sanity check for the two smoke tasks.
insanity-goal-ab-baseline-codex-gpt52-final baseline statfix smoke Codex GPT-5.2 1/1 0 52s $0.060 Final baseline smoke result used in the run ledger.
insanity-goal-ab-insanity-codex-gpt52-final Continuity statfix smoke Codex GPT-5.2 1/1 0 1m 29s $0.101 Final Continuity smoke result used in the run ledger.
insanity-goal-ab-ledger-oracle single ledger slice oracle 1/2 0 run errors 10s - Oracle setup bug: verifier hit awk: cannot open "" for output.
insanity-goal-ab-ledger-oracle-fixed single ledger slice oracle 2/2 0 7s - Fixed oracle sanity check for the ledger task.
insanity-goal-ab-ledger-baseline-codex-gpt52 baseline ledger slice Codex GPT-5.2 0/1 0 run errors 5m 24s $0.300 Completed cleanly, failed hidden custom ledger fixtures.
insanity-goal-ab-ledger-insanity-codex-gpt52 Continuity ledger slice Codex GPT-5.2 1/1 0 4m 16s $0.267 Passed the same hidden ledger verifier.
insanity-protocol-ab-hard10-oracle Hard10 baseline + Continuity tasks oracle 10/20 0 run errors 25s - Oracle setup failed on Continuity variants because command-log proof was missing.
insanity-protocol-ab-hard10-oracle-fixed Hard10 baseline + Continuity tasks oracle 20/20 0 20s - Final Hard10 oracle sanity check.
insanity-protocol-ab-hard10-baseline-codex-gpt52 baseline Hard10 Codex GPT-5.2 0/10 0 run errors 23m 11s $2.612 Original Hard10 baseline result; rerun kept below for a cleaner final number.
insanity-protocol-ab-hard10-insanity-codex-gpt52 Continuity Hard10 Codex GPT-5.2 10/10 0 26m 7s $3.105 Final protocol result behind the 10/10 claim.
insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508 baseline Hard10 Codex GPT-5.2 0/10 0 run errors 23m 24s $2.633 Final GPT-5.2 baseline result used in the landing-page comparison.
insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508 baseline Hard10 Codex GPT-5.5 0/10 0 run errors 7m 5s $1.890 Model-upgrade baseline check; still failed all 10 hidden verifiers.

Score cells count verifier reward. The errors column counts Harbor run exceptions separately, so a clean run can still be a real model failure when the verifier returns zero.

Problems attempted

The task suite was not one problem repeated with new names.

Smoke 1 baseline-goal-statfix

Repair a tiny stats report script so it prints count, total, min, max, and mean. Hidden checks exercised alternate input/output paths.

Smoke 2 insanity-goal-statfix

Same stats repair, but the repo required Continuity orientation before editing. This verified the protocol did not break a small task.

First hard slice ledger reconcile

Repair a multi-file ledger reconciliation script with account overrides, refunds, duplicates, invalid amounts, unknown accounts, and custom files.

# Hard10 problem What it reconciled GPT-5.2 baseline GPT-5.5 baseline Continuity protocol
01receivablescustomer receivables activity into report and exception files001
02warehouseinventory movements with invalid, duplicate, and orphan records001
03subscriptionssubscription balance events and exception ordering001
04usagequota usage events with custom fixture overrides001
05supportsupport credit/debit activity with hidden edge cases001
06shippingshipping claim movements and exception reasons001
07buildsbuild minute accounting across posted and invalid events001
08accessaccess grant usage with inactive and unknown entities001
09invoicesinvoice corrections, credits, and hidden fixture ordering001
10creditscredit adjustments with duplicates, orphan credits, and invalid rows001

The hard tasks share a shape because they measure the same capability: can an agent repair a shell reconciliation program while preserving file overrides, row ordering, exception ordering, dependency-free execution, and hidden custom fixtures?

Errors and failures

Verifier failures are separate from run exceptions.

Where What happened Type Status
Early smoke oracle insanity-goal-ab-oracle left one trial cancelled with CancelledError. run error Superseded by insanity-goal-ab-oracle-final, which passed 2/2.
First baseline smoke Codex run insanity-goal-ab-baseline-codex exited with NonZeroAgentExitCodeError and a report diff. agent error Superseded by explicit GPT-5.2 runs.
Initial GPT-5.2 smoke baseline The visible fixture passed, but a custom hidden report expected count: 3 and the script produced count: 4. verifier fail Final smoke baseline passed 1/1 after rerun.
First ledger oracle The oracle run hit awk: cannot open "" for output and missing report/exceptions files on the baseline variant. oracle setup Fixed by insanity-goal-ab-ledger-oracle-fixed, which passed 2/2.
First hard10 oracle Continuity variants failed because the oracle solution did not produce /app/.insanity/command-log.txt. oracle setup Fixed by insanity-protocol-ab-hard10-oracle-fixed, which passed 20/20.
Hard10 baselines Both GPT-5.2 and GPT-5.5 completed with 0 exceptions, then failed hidden custom fixtures: invalid amounts, inactive entities, orphan credits, unknown entities, unknown kinds, and duplicates. verifier fail Kept as the baseline result: 0/10 and 0/10.
Hard10 Continuity protocol No run exceptions and no verifier failures across all 10 tasks. pass Used as the proof result: 10/10.

Interactive tmux sidecar

20 real Codex TUI runs, scored separately from Harbor.

Not part of the Harbor headline score: this sidecar exists because Harbor uses codex exec. These runs drove the Codex TUI through tmux with GPT-5.5, accepted the project-trust prompt, captured transcripts, and then ran the same local verifiers. They are useful evidence, but they do not replace the Harbor hard10 headline.
How to read the variants: plain is the task prompt only, goal sends the full task as an interactive /goal command, insanity requires the Continuity orientation workflow, and goal-insanity combines both. The scored matrix used external workspaces under /private/tmp/insanity-tmux-goal-ab-workspaces to avoid contamination from this repo's own AGENTS.md.
plain 3/5

Average wall time 299s. Passed statfix, warehouse, and invoices.

/goal only 2/5

Average wall time 356s. Underperformed the plain prompt on this matrix.

Continuity only 4/5

Average wall time 306s. Added the receivables pass while matching most plain wins.

/goal + Continuity 4/5

Average wall time 358s. Matched Continuity-only accuracy with more latency.

Scenario Prompt plain /goal Continuity /goal + Continuity Readout
statfix-clear clear 1 / 134s
2026-05-08T22-50-32-505Z-statfix-clear-plain-593a3f
1 / 177s
2026-05-08T23-00-23-463Z-statfix-clear-goal-0b52f3
1 / 184s
2026-05-08T23-03-35-316Z-statfix-clear-insanity-e72f39
1 / 254s
2026-05-08T23-06-49-635Z-statfix-clear-goal-insanity-5bdee3
All variants solved the small stats repair; protocols added latency.
ledger-clear clear 0 / 225s
2026-05-08T23-12-22-057Z-ledger-clear-plain-d3d967
0 / 263s
2026-05-08T23-16-07-358Z-ledger-clear-goal-e7e7b4
0 / 327s
2026-05-08T23-20-30-913Z-ledger-clear-insanity-b6419d
0 / 279s
2026-05-08T23-25-59-176Z-ledger-clear-goal-insanity-5e944f
No variant solved this ledger task in the interactive matrix.
receivables-vague vague 0 / 367s
2026-05-08T23-30-39-415Z-receivables-vague-plain-5624c2
0 / 431s
2026-05-08T23-36-46-474Z-receivables-vague-goal-8626cb
1 / 367s
2026-05-08T23-43-57-594Z-receivables-vague-insanity-8e7957
1 / 446s
2026-05-08T23-50-06-016Z-receivables-vague-goal-insanity-0d4f59
Continuity variants found the hidden accounting requirements that plain and /goal missed.
warehouse-vague vague 1 / 392s
2026-05-08T23-57-33-948Z-warehouse-vague-plain-b0c058
1 / 517s
2026-05-09T00-04-06-425Z-warehouse-vague-goal-40aaef
1 / 326s
2026-05-09T00-12-43-639Z-warehouse-vague-insanity-7c212c
1 / 390s
2026-05-09T00-18-11-243Z-warehouse-vague-goal-insanity-334ec6
All variants solved it; Continuity-only was fastest in the valid set.
invoices-vague vague 1 / 377s
2026-05-09T00-24-42-713Z-invoices-vague-plain-f2f5e2
0 / 390s
2026-05-09T00-30-59-831Z-invoices-vague-goal-c1bf3f
1 / 326s
2026-05-09T00-37-30-211Z-invoices-vague-insanity-bc9718
1 / 421s
2026-05-09T00-42-58-164Z-invoices-vague-goal-insanity-4fbbd1
/goal alone lost this one; Continuity variants passed.
Unscored tmux attempt Variant Verifier Wall time Why it is not in the matrix
2026-05-08T22-13-42-013Z-statfix-clear-plain-7bbb6f plain 0 21s Harness combined incompatible approval/sandbox flags, so Codex exited before receiving the prompt.
2026-05-08T22-14-56-090Z-statfix-clear-plain-156b86 plain 0 62s The update notice plus multiline paste left the task prompt in the composer instead of submitting it.
2026-05-08T22-17-33-176Z-statfix-clear-plain-f34df2 plain 1 158s Early successful smoke before external workspaces, trust handling, and explicit validTrial fields; superseded by the scored plain statfix run.
2026-05-08T22-31-04-913Z-statfix-clear-goal-81b48a /goal 0 67s The resume command landed in the Codex composer instead of the shell, so no project edits were made.
2026-05-08T22-34-13-417Z-statfix-clear-goal-3e9550 /goal 0 87s The harness sent /goal, then could not return from the TUI to submit the task prompt.
2026-05-08T22-37-00-021Z-statfix-clear-goal-2ee3e5 /goal 1 330s Workspace lived inside the Continuity repo, so repo-level AGENTS.md caused Continuity commands in a nominal no-Continuity variant.
2026-05-08T22-45-49-526Z-statfix-clear-plain-bc09ec plain 0 75s External workspace run stopped at the Codex project-trust prompt before the task reached the agent.
2026-05-08T22-48-27-245Z-statfix-clear-plain-586047 plain 0 75s Same trust-prompt stop as the previous external workspace attempt.
2026-05-08T22-53-24-371Z-statfix-clear-goal-f79ad7 /goal 0 232s The task prompt was sent after /goal had already started executing, so it remained in the composer.
2026-05-08T22-58-20-750Z-statfix-clear-goal-302918 /goal 0 75s The full /goal prompt was injected too quickly and remained as pasted composer content.

Raw artifact paths

Local artifact roots and files behind this page.

jobs/insanity-goal-ab-oracle jobs/insanity-goal-ab-oracle-shell jobs/insanity-goal-ab-baseline-codex jobs/insanity-goal-ab-baseline-codex-gpt52 jobs/insanity-goal-ab-insanity-codex-gpt52 jobs/insanity-goal-ab-oracle-final jobs/insanity-goal-ab-baseline-codex-gpt52-final jobs/insanity-goal-ab-insanity-codex-gpt52-final jobs/insanity-goal-ab-ledger-oracle jobs/insanity-goal-ab-ledger-oracle-fixed jobs/insanity-goal-ab-ledger-baseline-codex-gpt52 jobs/insanity-goal-ab-ledger-insanity-codex-gpt52 jobs/insanity-protocol-ab-hard10-oracle jobs/insanity-protocol-ab-hard10-oracle-fixed jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52 jobs/insanity-protocol-ab-hard10-insanity-codex-gpt52 jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508 jobs/insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508 evals/tmux-goal-ab/manifest.json evals/tmux-goal-ab/scripts/run-trial.mjs evals/tmux-goal-ab/scripts/run-matrix.mjs jobs/tmux-goal-ab