AI Coding Bake-off Ep. 6: Four Models, One Landmine — GPT-5.5 & GLM 5.2 Tie at 94 (Issue #1134)

i.The assignment

A knob I couldn't reach.

In plain English — my bot decides when to bail out of a trade. Two of those exits are safety nets: a stop-loss ("sell if it drops past this line so I don't get wiped out") and a trailing stop (a net that follows the price up to lock in gains). How much room each net gives the price is tuned to the market — calm days get a tighter net, wild days a looser one. Right now those numbers are baked into the program. Fine, until I want to retune them across the whole bot at once — then I'd have to edit the setting in every single strategy by hand, or edit the code and rebuild the whole thing. Issue #1134 asked for the obvious fix: one knob I set once that every safety net picks up — unless a specific strategy wants to override it. Just an added setting. Sounds boring. It isn't — because of one strategy type.

Read the full issue #1134 on GitHub ↗

ii.The landmine

A lazy fix switches off a live safety net.

Here's the trap. Some of my strategies run in "manual" mode, and those already get their safety-net settings from a different place I built earlier. The new knob is only supposed to feed the normal strategies. The problem: to the computer, a manual strategy's safety net looks exactly like a normal one. So the obvious approach — "find every safety net on the default settings and feed it the new numbers" — would quietly reach into every manual strategy and overwrite a live safety net with the wrong numbers. No error. No failed test. The task calls this out by name and demands the fix skip manual strategies. So the real challenge wasn't "can you add a setting." It was "can you add it without switching off a real-money safety net — and prove it with a test."

⚠️ The trap. The bug doesn't crash. It doesn't fail a test. It just points a real-money safety net at the wrong settings, quietly, and waits. A green checkmark proves nothing here — the only thing that proves it is reading the code and confirming the fix skips those manual strategies.

iii.The contestants

OpenAI · Codex · high

iv.How they were judged

Four separate reviews. Every test re-run.

Each entry got its own reviewer, walled off from the other three so nobody was swayed by another's work. Each one pulled up the actual code, went through the task's checklist item by item, checked every promise against what the code really does — pointing to the exact line as proof — and re-ran all the automated tests itself instead of trusting the "all passing" badge on the submission.

Read the task. List exactly what the fix must do, and the safety trap it warned about.

Pull up the real code. Check every claim against what the code actually does, line by line.

Re-run the tests. Both copies of the bot — live and practice mode — from scratch.

Score out of 100 against the checklist. Report every miss, with proof.

⚙️ A note on the judge. The reviewing was all done by one model — Opus 4.8 (Anthropic's) — and one of the contestants is also Opus 4.8. Hold that thought. We'll come back to it at the bottom, next to the result it produced — because the result is what clears it.

v.The highest floor yet

Everyone cleared the bar — completely.

This was not a "spot the broken one" episode — it had the highest floor of any episode yet. All four dodged the trap: each one made the new knob skip manual strategies, and each shipped a test that sets up a manual strategy and checks its safety net is left alone. All four updated both copies of the bot (live + practice mode) to match, made a typo in the new setting fail loudly instead of doing nothing, updated the docs, and came with passing tests my reviewers re-ran themselves. Nobody broke the build, nobody touched the real-money safety nets, nobody faked a test. So it came down to how thoroughly they proved it — and the stats below are flavor, never part of the score.

Model	Lines changed	Files	Time	Cost (raw, mismatched units)
GPT-5.5	+690 / −14	10	~12 min	~37% of a 5-hour usage window (96%→59%)
GLM 5.2	+861 / −5	10	~19 min	~13.3M tokens (Cursor)
Opus 4.8	+561 / −14	9	~20 min (slowest)	22% of its 1M-token context window
Composer 2.5	+503 / −14	9 (least)	~6 min (fastest)	~5.1M tokens (Cursor)

📐 Side note — code volume and speed are not scored. The contestant that wrote the most code tied for first; the one that wrote the least came last; and the fastest first-pass came last too. Hold those numbers in your head — the scores below ignore them on purpose, and that's exactly what makes the result strange.

The reckoning

So who
actually won?

An "all tests passing" badge tells you nothing about how a real-money change was made. All four were scored against the task's checklist — every claim checked against the real code, all tests re-run. The gaps are tiny — three points across the whole field. The story is in them.

Counting down, from fourth place…

— Fourth place —

Composer 2.5 A strong debut from Cursor's brand-new model — and third place's near-twin. Smallest change, fastest run, fully correct, trap handled with a passing test, both copies of the bot updated. Two small things kept it last: it has no test for the case where the knob meets the bot's fancier way of labeling market conditions (only the simple labeling is tested), and it applies the new setting before double-checking it — safe in practice, because a later check still catches bad input, but not as careful as checking right away.

91/100

— Third place —

Opus 4.8 Correct and careful, with a tidy cleanup the others skipped. Its test proves the manual safety net stays untouched in both copies of the bot. It placed third purely on missing tests: no test for the knob meeting the bot's fancier market-labeling system (only the standard labels), and no test for the "do-nothing" case — where setting the knob to match the existing defaults should change nothing at all.

92/100

✦Before the result — three things that shouldn't both be true

📊

Most code tied for first

861 vs 503 lines

This flips last episode, where the shortest answer won. Here the model that wrote the most code (GLM, 861 lines) tied for first; the one that wrote the least (Composer, 503) came last. Not a "less is more" lesson — the extra lines are exactly the extra testing that was the difference.

⏱️

The fastest came last

Speed ≠ rank

The fastest model (Composer 2.5, ~6 min) finished last. The slowest (Opus, ~20 min) came third. The two co-winners sat in the middle on time. For the second episode running, speed didn't predict rank.

🔬

One test was the whole gap

94 vs 92 vs 91

Four models, three points apart — and the entire gap between the leaders and the rest was one test: a check that the new knob works with the bot's fancier way of labeling market conditions. The two winners wrote it; the other two didn't. That's the whole story.

🏆

— First place —

It's a tie.

You scrolled the whole way down expecting one winner. There are two — 94 each, separated by nothing.

GPT-5.5

OpenAI · Codex

PR #1142 ↗

94 /100

The cleanest run. Trap handled and tested, both copies of the bot updated, double-checks the setting even where it technically didn't have to — stricter and safer. 690 lines across 10 files; its only soft spot is a couple of requirements that lean on existing machinery without a brand-new test.

tied with

GLM 5.2

Z.ai · Cursor

PR #1143 ↗

94 /100

The most thorough testing. The only one with both the fancier-labeling test and a full end-to-end test that loads a real config and checks the manual safety net is safe. The price: the most code, 861 lines — and a write-up claiming "1,987 tests passed," a number we couldn't reproduce.

Same correctness. Same trap dodged. Both handled a typo'd setting by failing loudly. The entire gap to the top came down to one test — the check that the new knob works with the bot's fancier market-labeling system — that GPT-5.5 and GLM both wrote and the other two skipped. Opus sits at 92 and Composer at 91 on that single missing test; the whole field is three points. The gaps are soft; the dead heat at the top is real.

⚖️ About the judge — read before trusting the result I judge with Opus 4.8 (Anthropic's). Opus 4.8 was also a contestant. Here's why that runs against me this time: the Opus entry placed third, at 92 — and neither winner is mine (OpenAI's GPT-5.5 and Z.ai's GLM 5.2). Whatever bias you'd worry about, it ran the wrong way to help. Extra safeguards: each entry got its own separate reviewer, and the Opus one was held to a harder standard. And none of this is a matter of taste — either an entry made the fix skip manual strategies and proved it with a test, or it didn't; either the missing test is in the code, or it isn't. Pull up the code and check.

The headline isn't who won — it's that four models finished three points apart, the one that wrote the most code tied for first, the fastest came last, and the whole gap was a single test on a tricky case.

✦Summary

All four, at a glance.

OpenAI · Codex

GPT-5.5

Co-winner

Score

94 /100

Time

~12 min

Cost (raw)

~37% of a 5-hr window

Verdict

The cleanest run · trap handled and tested, both copies updated, double-checks even where optional · +690 / −14 across 10 files

Z.ai · Cursor

GLM 5.2

Co-winner

Score

94 /100

Time

~19 min

Cost (raw)

~13.3M tokens

Verdict

Most thorough testing · the fancier-labeling test + a full end-to-end check · oversold "1,987 passed" · +861 / −5 across 10 files

Anthropic · Claude Code

Opus 4.8

Score

92 /100

Time

~20 min

Cost (raw)

22% of its memory

Verdict

Correct + careful · tidy cleanup · skipped the fancier-labeling + "do-nothing" tests · +561 / −14 across 9 files

Cursor · Cursor · debut

Composer 2.5

Score

91 /100

Time

~6 min (fastest)

Cost (raw)

~5.1M tokens

Verdict

Smallest, fastest, fully correct · skipped the fancier-labeling test · applies the setting before double-checking it · +503 / −14 across 9 files

✦Epilogue — what shipped

Nothing's shipped — yet.

Last episode, I shipped a blend of the two winners and that closed it out. This one is still wide open. As of scoring, nothing has been merged in: all four entries (#1141, #1142, #1143, #1144) sit side by side, and I haven't started a combined "best-of" version.

If I follow last episode's pattern, it'll probably take one winner's version as the base and graft the other's extra tests on top. But that hasn't happened. For now it's a clean four-way tie with nothing decided, and the verdict stands on the code review alone.

Issue #1134 (open) ↗

A knob I couldn't reach.

A lazy fix switches off a live safety net.

GPT-5.5

GLM 5.2

Opus 4.8

Composer 2.5

Four separate reviews. Every test re-run.

Everyone cleared the bar — completely.

So whoactually won?

Most code tied for first

The fastest came last

One test was the whole gap

All four, at a glance.

Nothing's shipped — yet.

So who
actually won?