A knob I couldn't reach.
In plain English — my bot decides when to bail out of a trade. Two of those exits are safety nets: a stop-loss ("sell if it drops past this line so I don't get wiped out") and a trailing stop (a net that follows the price up to lock in gains). How much room each net gives the price is tuned to the market — calm days get a tighter net, wild days a looser one. Right now those numbers are baked into the program. Fine, until I want to retune them across the whole bot at once — then I'd have to edit the setting in every single strategy by hand, or edit the code and rebuild the whole thing. Issue #1134 asked for the obvious fix: one knob I set once that every safety net picks up — unless a specific strategy wants to override it. Just an added setting. Sounds boring. It isn't — because of one strategy type.
A lazy fix switches off a live safety net.
Here's the trap. Some of my strategies run in "manual" mode, and those already get their safety-net settings from a different place I built earlier. The new knob is only supposed to feed the normal strategies. The problem: to the computer, a manual strategy's safety net looks exactly like a normal one. So the obvious approach — "find every safety net on the default settings and feed it the new numbers" — would quietly reach into every manual strategy and overwrite a live safety net with the wrong numbers. No error. No failed test. The task calls this out by name and demands the fix skip manual strategies. So the real challenge wasn't "can you add a setting." It was "can you add it without switching off a real-money safety net — and prove it with a test."
GPT-5.5
GLM 5.2
Opus 4.8
Composer 2.5
Four separate reviews. Every test re-run.
Each entry got its own reviewer, walled off from the other three so nobody was swayed by another's work. Each one pulled up the actual code, went through the task's checklist item by item, checked every promise against what the code really does — pointing to the exact line as proof — and re-ran all the automated tests itself instead of trusting the "all passing" badge on the submission.
Read the task. List exactly what the fix must do, and the safety trap it warned about.
Pull up the real code. Check every claim against what the code actually does, line by line.
Re-run the tests. Both copies of the bot — live and practice mode — from scratch.
Score out of 100 against the checklist. Report every miss, with proof.
Everyone cleared the bar — completely.
This was not a "spot the broken one" episode — it had the highest floor of any episode yet. All four dodged the trap: each one made the new knob skip manual strategies, and each shipped a test that sets up a manual strategy and checks its safety net is left alone. All four updated both copies of the bot (live + practice mode) to match, made a typo in the new setting fail loudly instead of doing nothing, updated the docs, and came with passing tests my reviewers re-ran themselves. Nobody broke the build, nobody touched the real-money safety nets, nobody faked a test. So it came down to how thoroughly they proved it — and the stats below are flavor, never part of the score.
| Model | Lines changed | Files | Time | Cost (raw, mismatched units) |
|---|---|---|---|---|
| GPT-5.5 | +690 / −14 | 10 | ~12 min | ~37% of a 5-hour usage window (96%→59%) |
| GLM 5.2 | +861 / −5 | 10 | ~19 min | ~13.3M tokens (Cursor) |
| Opus 4.8 | +561 / −14 | 9 | ~20 min (slowest) | 22% of its 1M-token context window |
| Composer 2.5 | +503 / −14 | 9 (least) | ~6 min (fastest) | ~5.1M tokens (Cursor) |
So who
actually won?
An "all tests passing" badge tells you nothing about how a real-money change was made. All four were scored against the task's checklist — every claim checked against the real code, all tests re-run. The gaps are tiny — three points across the whole field. The story is in them.
Counting down, from fourth place…
Most code tied for first
This flips last episode, where the shortest answer won. Here the model that wrote the most code (GLM, 861 lines) tied for first; the one that wrote the least (Composer, 503) came last. Not a "less is more" lesson — the extra lines are exactly the extra testing that was the difference.
The fastest came last
The fastest model (Composer 2.5, ~6 min) finished last. The slowest (Opus, ~20 min) came third. The two co-winners sat in the middle on time. For the second episode running, speed didn't predict rank.
One test was the whole gap
Four models, three points apart — and the entire gap between the leaders and the rest was one test: a check that the new knob works with the bot's fancier way of labeling market conditions. The two winners wrote it; the other two didn't. That's the whole story.
You scrolled the whole way down expecting one winner. There are two — 94 each, separated by nothing.
The cleanest run. Trap handled and tested, both copies of the bot updated, double-checks the setting even where it technically didn't have to — stricter and safer. 690 lines across 10 files; its only soft spot is a couple of requirements that lean on existing machinery without a brand-new test.
The most thorough testing. The only one with both the fancier-labeling test and a full end-to-end test that loads a real config and checks the manual safety net is safe. The price: the most code, 861 lines — and a write-up claiming "1,987 tests passed," a number we couldn't reproduce.
Same correctness. Same trap dodged. Both handled a typo'd setting by failing loudly. The entire gap to the top came down to one test — the check that the new knob works with the bot's fancier market-labeling system — that GPT-5.5 and GLM both wrote and the other two skipped. Opus sits at 92 and Composer at 91 on that single missing test; the whole field is three points. The gaps are soft; the dead heat at the top is real.
The headline isn't who won — it's that four models finished three points apart, the one that wrote the most code tied for first, the fastest came last, and the whole gap was a single test on a tricky case.
All four, at a glance.
Nothing's shipped — yet.
Last episode, I shipped a blend of the two winners and that closed it out. This one is still wide open. As of scoring, nothing has been merged in: all four entries (#1141, #1142, #1143, #1144) sit side by side, and I haven't started a combined "best-of" version.
If I follow last episode's pattern, it'll probably take one winner's version as the base and graft the other's extra tests on top. But that hasn't happened. For now it's a clean four-way tie with nothing decided, and the verdict stands on the code review alone.