Skip to content

eigh_py: reward hacks fabricating benchmark time (tracking + the timing-layer gap) #162

Description

@robobryce

Summary

Red-teaming the eigh benchmark harness surfaced a family of reward hacks that leave custom_kernel genuinely correct but fabricate the reported time, plus the underlying structural gaps. I've opened focused PRs for the gaps that have clean fixes (#159, #160, #161); this issue tracks the rest, where the right remedy is a judgment call I'd rather leave to the maintainers.

All findings below were confirmed on the live B200 eigh leaderboard (every test submission deleted immediately after its verdict). A demonstration of the most severe one is currently sitting at rank #1 with a displayed score of 0.000.

Confirmed-accepted reward-hack families and their status

Family What it does Status
Aggregator underflow Drives one shape's reported time toward 0 → geomean collapses to 0.000000 PR #159 (roofline floor)
In-process cache / file replay Solves once, returns cached result on reused timed calls PR #160 (regenerate inputs per iteration)
Lazy output (subclass / instance override) Returns placeholders, defers the real solve into the untimed checker PR #161 (reject deferral)
Timer / stats patch Leaves the kernel honest but patches Event.elapsed_time / calculate_stats to report 1/100th the time this issue
Forged result object Forges the Stats object the timed loop returns to the parent this issue

The remaining gap: the reported time is taken on trust

The timing and the stats reduction happen in the same spawned worker that imports the submission, so a submission can reach and tamper with them (directly, or via aliasing / gc). kernelguard has merged detectors for some of these routes (SinatrasC/kernelguard #277, #278), which helps at the static-scan layer, but:

  • those rules are not yet live on the production scanner (a re-test of the aliased-timer hack on 2026-06-28 was still accepted), and
  • a static scanner is a pattern chase; the structural fix is to compute the reported statistic where the submission cannot reach it — e.g. time and reduce in the parent process from durations captured before the submission is imported, in a namespace the worker doesn't expose.

That structural change is more invasive than the three PRs above (it touches the harness's process/timing model), so I haven't sent it as an unsolicited large PR. If you'd welcome it, I have a working prototype and am happy to open it; alternatively this may be best handled at the kernelguard layer once the merged rules deploy. Flagging it so the decision is yours.

Also: no guards/ dir

Unlike qr_v2, eigh_py ships no guards/ (differential-correctness / invariance) directory, so those defenses don't run here. Worth adding as defense-in-depth.

Happy to provide minimal repros for any of the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions