Show full-precision scores; add submissions show --no-code#63
Conversation
Two small quality-of-life fixes to the submissions views:
- Print scores at full f64 precision in both `submissions list` and
`submissions show`. They were formatted with `{:.4}`, which rounds the
geomean leaderboard score to 4 decimals (e.g. 0.0017 for two distinct
submissions that actually scored 0.0017318 vs 0.0017449) — enough to
make near-tied submissions indistinguishable. `f64::to_string()` emits
the shortest decimal that round-trips, so no rounding and no trailing
zero noise.
- Add a `--no-code` flag to `submissions show` to omit the (often large)
code block, for when you only want the metadata and per-run scores.
Default behavior is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
codecov/patch is failing you should fix that. |
|
Your PR description does not have a before and after example of |
|
Mark/other humans: I think --no-code or something similar is important for agents to avoid context blow out. I am not sure what the default should be. The best default for agents would be to not show code by default and require a --code flag to opt-in. However, that would be a breaking change, so I opted for --no-code instead. Thoughts? |
Pull the score-rendering logic (shared by `list` and `show`) into a `format_score` helper and cover it with unit tests, so the precision change has patch coverage (codecov/patch was 0%). Also adds a small `truncate` test. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fixed in 32f4ea5. I pulled the score-rendering logic (shared by |
|
Added a |
|
On the default: I agree code-by-default is the wrong default for agent use (it's the single biggest source of context blowout in
|
Two small quality-of-life fixes to the
submissionsviews.1. Full-precision scores in
listandshowScores were formatted with
{:.4}, rounding the geomean leaderboard score to 4 decimals. That's lossy enough to make near-tied submissions indistinguishable — two submissions scoring0.0017318...and0.0017449...both render as0.0017, so you can't tell which is faster from the CLI.Switch to
f64::to_string()(via a smallformat_scorehelper), which prints the shortest decimal that round-trips to the samef64— full precision, no trailing-zero noise.submissions list— before / after:submissions show 830213— before / after:2.
submissions show --no-codesubmissions showalways prints the full submission code, usually the largest part of the output. Add a--no-codeflag to omit it when you only want the metadata and per-run scores. Default behavior is unchanged.Verification
Built the binary and ran both against the live API:
listandshowprint full-precision scores;show --no-codeprints metadata + runs and omits the code block (defaultshowstill prints it).cargo fmt --all -- --check— cleancargo clippy --all-targets --all-features -- -D warnings— cleancargo test— 31 passed (addedformat_score+truncateunit tests for patch coverage)🤖 Generated with Claude Code