Summary
Red-teaming the eigh benchmark harness surfaced a family of reward hacks that leave custom_kernel genuinely correct but fabricate the reported time, plus the underlying structural gaps. I've opened focused PRs for the gaps that have clean fixes (#159, #160, #161); this issue tracks the rest, where the right remedy is a judgment call I'd rather leave to the maintainers.
All findings below were confirmed on the live B200 eigh leaderboard (every test submission deleted immediately after its verdict). A demonstration of the most severe one is currently sitting at rank #1 with a displayed score of 0.000.
Confirmed-accepted reward-hack families and their status
| Family |
What it does |
Status |
| Aggregator underflow |
Drives one shape's reported time toward 0 → geomean collapses to 0.000000 |
PR #159 (roofline floor) |
| In-process cache / file replay |
Solves once, returns cached result on reused timed calls |
PR #160 (regenerate inputs per iteration) |
| Lazy output (subclass / instance override) |
Returns placeholders, defers the real solve into the untimed checker |
PR #161 (reject deferral) |
| Timer / stats patch |
Leaves the kernel honest but patches Event.elapsed_time / calculate_stats to report 1/100th the time |
this issue |
| Forged result object |
Forges the Stats object the timed loop returns to the parent |
this issue |
The remaining gap: the reported time is taken on trust
The timing and the stats reduction happen in the same spawned worker that imports the submission, so a submission can reach and tamper with them (directly, or via aliasing / gc). kernelguard has merged detectors for some of these routes (SinatrasC/kernelguard #277, #278), which helps at the static-scan layer, but:
- those rules are not yet live on the production scanner (a re-test of the aliased-timer hack on 2026-06-28 was still accepted), and
- a static scanner is a pattern chase; the structural fix is to compute the reported statistic where the submission cannot reach it — e.g. time and reduce in the parent process from durations captured before the submission is imported, in a namespace the worker doesn't expose.
That structural change is more invasive than the three PRs above (it touches the harness's process/timing model), so I haven't sent it as an unsolicited large PR. If you'd welcome it, I have a working prototype and am happy to open it; alternatively this may be best handled at the kernelguard layer once the merged rules deploy. Flagging it so the decision is yours.
Also: no guards/ dir
Unlike qr_v2, eigh_py ships no guards/ (differential-correctness / invariance) directory, so those defenses don't run here. Worth adding as defense-in-depth.
Happy to provide minimal repros for any of the above.
Summary
Red-teaming the
eighbenchmark harness surfaced a family of reward hacks that leavecustom_kernelgenuinely correct but fabricate the reported time, plus the underlying structural gaps. I've opened focused PRs for the gaps that have clean fixes (#159, #160, #161); this issue tracks the rest, where the right remedy is a judgment call I'd rather leave to the maintainers.All findings below were confirmed on the live B200
eighleaderboard (every test submission deleted immediately after its verdict). A demonstration of the most severe one is currently sitting at rank #1 with a displayed score of0.000.Confirmed-accepted reward-hack families and their status
0.000000Event.elapsed_time/calculate_statsto report 1/100th the timeStatsobject the timed loop returns to the parentThe remaining gap: the reported time is taken on trust
The timing and the stats reduction happen in the same spawned worker that imports the submission, so a submission can reach and tamper with them (directly, or via aliasing /
gc).kernelguardhas merged detectors for some of these routes (SinatrasC/kernelguard #277, #278), which helps at the static-scan layer, but:That structural change is more invasive than the three PRs above (it touches the harness's process/timing model), so I haven't sent it as an unsolicited large PR. If you'd welcome it, I have a working prototype and am happy to open it; alternatively this may be best handled at the
kernelguardlayer once the merged rules deploy. Flagging it so the decision is yours.Also: no
guards/dirUnlike
qr_v2,eigh_pyships noguards/(differential-correctness / invariance) directory, so those defenses don't run here. Worth adding as defense-in-depth.Happy to provide minimal repros for any of the above.