Eval packet

What we ran, what failed, and what it cost.

The headline result is real, but it is not magic. This page shows the first two smoke checks, the single hard ledger slice, the 10-task hard batch, verifier failures, setup errors, latency, cost, the local Harbor job paths behind the landing-page claim, and a separate real interactive Codex TUI sidecar.

Hard10 baseline 0/10

Codex GPT-5.2 and GPT-5.5 both finished cleanly and failed every hidden verifier.

Hard10 with Continuity 10/10

Codex GPT-5.2 passed the same task family with graph orientation and reconciliation.

Infra exceptions 0

The final baseline, GPT-5.5 baseline, and Continuity hard10 runs all completed without run exceptions.

Interactive sidecar 20

Valid GPT-5.5 tmux trials, plus 10 unscored harness/setup attempts retained below.

Delta +100 pts

Accuracy moved from 0% to 100%; wall time increased by about 2m43s versus the GPT-5.2 baseline rerun.

Run ledger

Every result we are using for the claim.

Run	Variant	Model	Score	Wall time	Avg agent time	Cost	Tokens in / out
Smoke statfix	baseline	GPT-5.2	1/1	52s	37s	$0.060	148k / 1.2k
Smoke statfix	with Continuity	GPT-5.2	1/1	1m 29s	1m 15s	$0.101	291k / 2.6k
Single ledger	baseline	GPT-5.2	0/1	5m 24s	5m 7s	$0.300	445k / 8.0k
Single ledger	with Continuity	GPT-5.2	1/1	4m 16s	4m 0s	$0.267	569k / 8.9k
Hard10	baseline	GPT-5.2	0/10	23m 24s	4m 5s	$2.633	4.17M / 78k
Hard10	baseline	GPT-5.5	0/10	7m 5s	1m 2s	$1.890	954k / 25k
Hard10	with Continuity	GPT-5.2	10/10	26m 7s	4m 32s	$3.105	5.63M / 100k

Interpretation: The Continuity run used more context and spent about $0.47 more than the GPT-5.2 baseline rerun. It also converted every hidden verifier from fail to pass.

Adapter boundary: Harbor ran these jobs through its Codex adapter, which invokes codex exec. The first smoke and single-ledger jobs used older goal-baseline.jinja and goal-insanity.jinja prompt templates that prefixed the task with /goal. The Hard10 landing-page claim uses baseline.jinja and insanity.jinja, which do not invoke slash commands, so this packet should not be read as a test of interactive Codex thread-goal state.

Attempt log

Every top-level Harbor job we ran, including superseded runs.

Job	Scope	Agent / model	Result	Errors	Wall time	Cost	How to read it
`insanity-goal-ab-oracle`	first two statfix smoke tasks	oracle	0/2	1 error, 1 cancelled	-	-	Initial verifier sanity run; superseded after a cancelled trial.
`insanity-goal-ab-oracle-shell`	first two statfix smoke tasks	oracle shell	2/2	0	8s	-	Confirmed the smoke verifiers could pass before using Codex.
`insanity-goal-ab-baseline-codex`	baseline statfix smoke	Codex GPT-5.2 Codex	error	1 agent error	20s	-	Bad model choice for local ChatGPT auth: `gpt-5.2-codex` was rejected.
`insanity-goal-ab-baseline-codex-gpt52`	baseline statfix smoke	Codex GPT-5.2	0/1	0 run errors	1m 2s	$0.067	Visible path passed, but hidden custom input exposed a count/mean bug.
`insanity-goal-ab-insanity-codex-gpt52`	Continuity statfix smoke	Codex GPT-5.2	1/1	0	3m 5s	$0.209	Protocol did not break the small task; rerun later for a cleaner paired smoke.
`insanity-goal-ab-oracle-final`	first two statfix smoke tasks	oracle	2/2	0	7s	-	Final oracle sanity check for the two smoke tasks.
`insanity-goal-ab-baseline-codex-gpt52-final`	baseline statfix smoke	Codex GPT-5.2	1/1	0	52s	$0.060	Final baseline smoke result used in the run ledger.
`insanity-goal-ab-insanity-codex-gpt52-final`	Continuity statfix smoke	Codex GPT-5.2	1/1	0	1m 29s	$0.101	Final Continuity smoke result used in the run ledger.
`insanity-goal-ab-ledger-oracle`	single ledger slice	oracle	1/2	0 run errors	10s	-	Oracle setup bug: verifier hit `awk: cannot open "" for output`.
`insanity-goal-ab-ledger-oracle-fixed`	single ledger slice	oracle	2/2	0	7s	-	Fixed oracle sanity check for the ledger task.
`insanity-goal-ab-ledger-baseline-codex-gpt52`	baseline ledger slice	Codex GPT-5.2	0/1	0 run errors	5m 24s	$0.300	Completed cleanly, failed hidden custom ledger fixtures.
`insanity-goal-ab-ledger-insanity-codex-gpt52`	Continuity ledger slice	Codex GPT-5.2	1/1	0	4m 16s	$0.267	Passed the same hidden ledger verifier.
`insanity-protocol-ab-hard10-oracle`	Hard10 baseline + Continuity tasks	oracle	10/20	0 run errors	25s	-	Oracle setup failed on Continuity variants because command-log proof was missing.
`insanity-protocol-ab-hard10-oracle-fixed`	Hard10 baseline + Continuity tasks	oracle	20/20	0	20s	-	Final Hard10 oracle sanity check.
`insanity-protocol-ab-hard10-baseline-codex-gpt52`	baseline Hard10	Codex GPT-5.2	0/10	0 run errors	23m 11s	$2.612	Original Hard10 baseline result; rerun kept below for a cleaner final number.
`insanity-protocol-ab-hard10-insanity-codex-gpt52`	Continuity Hard10	Codex GPT-5.2	10/10	0	26m 7s	$3.105	Final protocol result behind the 10/10 claim.
`insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508`	baseline Hard10	Codex GPT-5.2	0/10	0 run errors	23m 24s	$2.633	Final GPT-5.2 baseline result used in the landing-page comparison.
`insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508`	baseline Hard10	Codex GPT-5.5	0/10	0 run errors	7m 5s	$1.890	Model-upgrade baseline check; still failed all 10 hidden verifiers.

Score cells count verifier reward. The errors column counts Harbor run exceptions separately, so a clean run can still be a real model failure when the verifier returns zero.

Problems attempted

The task suite was not one problem repeated with new names.

Smoke 1 baseline-goal-statfix

Repair a tiny stats report script so it prints count, total, min, max, and mean. Hidden checks exercised alternate input/output paths.

Smoke 2 insanity-goal-statfix

Same stats repair, but the repo required Continuity orientation before editing. This verified the protocol did not break a small task.

First hard slice ledger reconcile

Repair a multi-file ledger reconciliation script with account overrides, refunds, duplicates, invalid amounts, unknown accounts, and custom files.

#	Hard10 problem	What it reconciled	Continuity protocol
01	receivables	customer receivables activity into report and exception files	1
02	warehouse	inventory movements with invalid, duplicate, and orphan records	1
03	subscriptions	subscription balance events and exception ordering	1
04	usage	quota usage events with custom fixture overrides	1
05	support	support credit/debit activity with hidden edge cases	1
06	shipping	shipping claim movements and exception reasons	1
07	builds	build minute accounting across posted and invalid events	1
08	access	access grant usage with inactive and unknown entities	1
09	invoices	invoice corrections, credits, and hidden fixture ordering	1
10	credits	credit adjustments with duplicates, orphan credits, and invalid rows	1

The hard tasks share a shape because they measure the same capability: can an agent repair a shell reconciliation program while preserving file overrides, row ordering, exception ordering, dependency-free execution, and hidden custom fixtures?

Errors and failures

Verifier failures are separate from run exceptions.

Where	What happened	Type	Status
Early smoke oracle	`insanity-goal-ab-oracle` left one trial cancelled with `CancelledError`.	run error	Superseded by `insanity-goal-ab-oracle-final`, which passed 2/2.
First baseline smoke Codex run	`insanity-goal-ab-baseline-codex` exited with `NonZeroAgentExitCodeError` and a report diff.	agent error	Superseded by explicit GPT-5.2 runs.
Initial GPT-5.2 smoke baseline	The visible fixture passed, but a custom hidden report expected `count: 3` and the script produced `count: 4`.	verifier fail	Final smoke baseline passed 1/1 after rerun.
First ledger oracle	The oracle run hit `awk: cannot open "" for output` and missing report/exceptions files on the baseline variant.	oracle setup	Fixed by `insanity-goal-ab-ledger-oracle-fixed`, which passed 2/2.
First hard10 oracle	Continuity variants failed because the oracle solution did not produce `/app/.insanity/command-log.txt`.	oracle setup	Fixed by `insanity-protocol-ab-hard10-oracle-fixed`, which passed 20/20.
Hard10 baselines	Both GPT-5.2 and GPT-5.5 completed with 0 exceptions, then failed hidden custom fixtures: invalid amounts, inactive entities, orphan credits, unknown entities, unknown kinds, and duplicates.	verifier fail	Kept as the baseline result: 0/10 and 0/10.
Hard10 Continuity protocol	No run exceptions and no verifier failures across all 10 tasks.	pass	Used as the proof result: 10/10.

Interactive tmux sidecar

20 real Codex TUI runs, scored separately from Harbor.

Not part of the Harbor headline score: this sidecar exists because Harbor uses codex exec. These runs drove the Codex TUI through tmux with GPT-5.5, accepted the project-trust prompt, captured transcripts, and then ran the same local verifiers. They are useful evidence, but they do not replace the Harbor hard10 headline.

How to read the variants: plain is the task prompt only, goal sends the full task as an interactive /goal command, insanity requires the Continuity orientation workflow, and goal-insanity combines both. The scored matrix used external workspaces under /private/tmp/insanity-tmux-goal-ab-workspaces to avoid contamination from this repo's own AGENTS.md.

plain 3/5

Average wall time 299s. Passed statfix, warehouse, and invoices.

/goal only 2/5

Average wall time 356s. Underperformed the plain prompt on this matrix.

Continuity only 4/5

Average wall time 306s. Added the receivables pass while matching most plain wins.

/goal + Continuity 4/5

Average wall time 358s. Matched Continuity-only accuracy with more latency.

Scenario	Prompt	plain	/goal	Continuity	/goal + Continuity	Readout
statfix-clear	clear	1 / 134s `2026-05-08T22-50-32-505Z-statfix-clear-plain-593a3f`	1 / 177s `2026-05-08T23-00-23-463Z-statfix-clear-goal-0b52f3`	1 / 184s `2026-05-08T23-03-35-316Z-statfix-clear-insanity-e72f39`	1 / 254s `2026-05-08T23-06-49-635Z-statfix-clear-goal-insanity-5bdee3`	All variants solved the small stats repair; protocols added latency.
ledger-clear	clear	0 / 225s `2026-05-08T23-12-22-057Z-ledger-clear-plain-d3d967`	0 / 263s `2026-05-08T23-16-07-358Z-ledger-clear-goal-e7e7b4`	0 / 327s `2026-05-08T23-20-30-913Z-ledger-clear-insanity-b6419d`	0 / 279s `2026-05-08T23-25-59-176Z-ledger-clear-goal-insanity-5e944f`	No variant solved this ledger task in the interactive matrix.
receivables-vague	vague	0 / 367s `2026-05-08T23-30-39-415Z-receivables-vague-plain-5624c2`	0 / 431s `2026-05-08T23-36-46-474Z-receivables-vague-goal-8626cb`	1 / 367s `2026-05-08T23-43-57-594Z-receivables-vague-insanity-8e7957`	1 / 446s `2026-05-08T23-50-06-016Z-receivables-vague-goal-insanity-0d4f59`	Continuity variants found the hidden accounting requirements that plain and `/goal` missed.
warehouse-vague	vague	1 / 392s `2026-05-08T23-57-33-948Z-warehouse-vague-plain-b0c058`	1 / 517s `2026-05-09T00-04-06-425Z-warehouse-vague-goal-40aaef`	1 / 326s `2026-05-09T00-12-43-639Z-warehouse-vague-insanity-7c212c`	1 / 390s `2026-05-09T00-18-11-243Z-warehouse-vague-goal-insanity-334ec6`	All variants solved it; Continuity-only was fastest in the valid set.
invoices-vague	vague	1 / 377s `2026-05-09T00-24-42-713Z-invoices-vague-plain-f2f5e2`	0 / 390s `2026-05-09T00-30-59-831Z-invoices-vague-goal-c1bf3f`	1 / 326s `2026-05-09T00-37-30-211Z-invoices-vague-insanity-bc9718`	1 / 421s `2026-05-09T00-42-58-164Z-invoices-vague-goal-insanity-4fbbd1`	`/goal` alone lost this one; Continuity variants passed.

Unscored tmux attempt	Variant	Verifier	Wall time	Why it is not in the matrix
`2026-05-08T22-13-42-013Z-statfix-clear-plain-7bbb6f`	plain	0	21s	Harness combined incompatible approval/sandbox flags, so Codex exited before receiving the prompt.
`2026-05-08T22-14-56-090Z-statfix-clear-plain-156b86`	plain	0	62s	The update notice plus multiline paste left the task prompt in the composer instead of submitting it.
`2026-05-08T22-17-33-176Z-statfix-clear-plain-f34df2`	plain	1	158s	Early successful smoke before external workspaces, trust handling, and explicit `validTrial` fields; superseded by the scored plain statfix run.
`2026-05-08T22-31-04-913Z-statfix-clear-goal-81b48a`	/goal	0	67s	The resume command landed in the Codex composer instead of the shell, so no project edits were made.
`2026-05-08T22-34-13-417Z-statfix-clear-goal-3e9550`	/goal	0	87s	The harness sent `/goal`, then could not return from the TUI to submit the task prompt.
`2026-05-08T22-37-00-021Z-statfix-clear-goal-2ee3e5`	/goal	1	330s	Workspace lived inside the Continuity repo, so repo-level `AGENTS.md` caused Continuity commands in a nominal no-Continuity variant.
`2026-05-08T22-45-49-526Z-statfix-clear-plain-bc09ec`	plain	0	75s	External workspace run stopped at the Codex project-trust prompt before the task reached the agent.
`2026-05-08T22-48-27-245Z-statfix-clear-plain-586047`	plain	0	75s	Same trust-prompt stop as the previous external workspace attempt.
`2026-05-08T22-53-24-371Z-statfix-clear-goal-f79ad7`	/goal	0	232s	The task prompt was sent after `/goal` had already started executing, so it remained in the composer.
`2026-05-08T22-58-20-750Z-statfix-clear-goal-302918`	/goal	0	75s	The full `/goal` prompt was injected too quickly and remained as pasted composer content.

Raw artifact paths

Local artifact roots and files behind this page.

jobs/insanity-goal-ab-oracle jobs/insanity-goal-ab-oracle-shell jobs/insanity-goal-ab-baseline-codex jobs/insanity-goal-ab-baseline-codex-gpt52 jobs/insanity-goal-ab-insanity-codex-gpt52 jobs/insanity-goal-ab-oracle-final jobs/insanity-goal-ab-baseline-codex-gpt52-final jobs/insanity-goal-ab-insanity-codex-gpt52-final jobs/insanity-goal-ab-ledger-oracle jobs/insanity-goal-ab-ledger-oracle-fixed jobs/insanity-goal-ab-ledger-baseline-codex-gpt52 jobs/insanity-goal-ab-ledger-insanity-codex-gpt52 jobs/insanity-protocol-ab-hard10-oracle jobs/insanity-protocol-ab-hard10-oracle-fixed jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52 jobs/insanity-protocol-ab-hard10-insanity-codex-gpt52 jobs/insanity-protocol-ab-hard10-baseline-codex-gpt52-rerun-20260508 jobs/insanity-protocol-ab-hard10-baseline-codex-gpt55-rerun-20260508 evals/tmux-goal-ab/manifest.json evals/tmux-goal-ab/scripts/run-trial.mjs evals/tmux-goal-ab/scripts/run-matrix.mjs jobs/tmux-goal-ab