Skip to content

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]#8

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecarfrom
harbor-2-sidecar-fixes
Open

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]#8
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecarfrom
harbor-2-sidecar-fixes

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Stacks on #4 (harbor-2-sidecar). Addresses the trust-boundary findings from the review of that PR. 2 of 4 fix PRs.

What this fixes

# Sev Finding Fix
S1 🔴 tier_for_split fails open (unlisted split → viewable → per-sample labels written to the agent volume) Fail closed: unlisted → no_access, with a warning
S2 🔴 Mode A scorer provenance (the headline). task_project=None loads the @task.evaluation() scorer from the agent's committed repo; evaluate_admin then scores the hidden split with it, so a committed scorer returning 1.0 wins build_components refuses a Mode A config without task_project, so the hidden-split scorer is always sidecar-baked, never the agent's commit
S3 🔴 auto_best ranks candidates purely by the agent-influenced recorded score, so a candidate can inflate its own selection score Recorded score only shortlists; the winner is decided by re-running evaluate_admin (trusted scorer) on the selection split for the top-K and ranking by admin score. Fails closed
S4 🟠 The budget ledger is rebuilt from config on every start, so a sidecar restart resets spent budget to full Reload remaining_* from the persisted ledger when present (fail-safe to configured budgets)
S5 🟠 No adversarial/OS-level tests for the integrity claims Added: no_access→400, a cheating-agent repo proving finalize doesn't run the agent's scorer, token 0o600/unreadable

Greptile findings on PR #4, also fixed here (replies posted on #4)

  • stale result dirs (server.py): _route_results recreates the dir so it reflects exactly the current metered run.
  • reward_mode typed (serve.py): now Literal["submit","auto_best"], rejected at parse time.
  • auto_best dataset scoping (verifier.py): candidates filtered by selection_dataset_id so a foreign dataset's same-named split can't select the winner.

Architecture question

This PR keeps tier_for_split as the sidecar's write-routing resolver and flips it fail-closed; the engine-side real-time access gate lands in the core PR. See the design note (answering "should split access be a dataset property with a real-time check?") for the full reasoning. Short answer: yes, and the core PR implements the minimal version.

Tests

20 harbor unit tests pass (protocol, verifier, app). The 3 real-eval serve tests pass end-to-end (built via uv against a real task project): the honest path, the cheating-agent scorer-provenance proof (reward stays 0.0, not the agent's 1.0), and the ledger-reload-across-restart test.

🤖 Generated with Claude Code

Greptile Summary

This PR tightens Harbor sidecar evaluation and reward handling. The main changes are:

  • Unlisted splits now fail closed to no_access.
  • Mode A startup now requires a trusted task_project.
  • Budget ledgers are reloaded from persisted state across restarts.
  • auto_best now admin re-scores a shortlist before choosing a winner.
  • Result directories are recreated to avoid stale per-sample files.
  • Tests cover scorer provenance, no-access handling, token permissions, ledger reload, and verifier selection.

Confidence Score: 2/5

Not merge-safe until the auto-selection and budget reload paths are corrected.

The changes address several trust-boundary issues, but runtime checks reproduced cases where candidate selection can still be controlled by recorded scores, scorer failures can abort finalization prematurely, and persisted budget state can override the current baked configuration.

vero/src/vero/harbor/verifier.py and vero/src/vero/harbor/serve.py need attention.

T-Rex T-Rex Logs

What T-Rex did

  • Ran a focused pytest harness against the Verifier auto_best path with five validation candidates and a recorded-score shortlist that excluded the trusted best commit.
  • The run showed evaluate_admin was called only for bad_1, bad_2, and bad_3, with the final scoring run on bad_3 and the true_best (0.99) not evaluated because its mean_score ranked below the shortlist.
  • The test failed asserting the selected winner should match the full admin best, demonstrating auto_best picked bad_3 instead of true_best.
  • Ran a focused ledger-restart repro that wrote a persisted ledger.json with dataset-a/test budgets (100 samples, 10 runs, max 50) and then restarted ledger loading with reduced budgets (10 samples, 1 run, max 5).
  • The loaded ledger still contained the persisted old values, allowed check(dataset-a, test, 6) despite the current max_samples_per_run being 5, and, when restarted with an empty budget_cfgs list, the removed dataset-a/test split remained evaluable from the persisted ledger.
  • Ran a Harbor Verifier auto_best harness with two candidates ordered by recorded score; monkeypatched evaluate_admin to raise for first_raises and return 0.95 for later_success, and finalize raised RuntimeError after evaluating only first_raises, so later_success was never considered.
  • Before: a fail-open tier showed viewable split_accesses; After: the code path moved to a fail-closed tier with no_access, and a repro script constructs the failing path for the tier switch.
  • Before: mode A provenance showed harbor=None, task_project=None, initialized components, token written, and exit code 0; After: head raised ValueError about missing task_project and the fail-closed guard activated, with exit code 0; Supplemental: a fail-closed pytest run appeared successful, though the trusted-scorer test did not run due to a FileNotFoundError during fixture setup.
  • Before: reward_mode='bogus' with no selection_dataset_id and a chosen foreign_ds2_commit; After: the head rejected the bogus reward mode with ValidationError, exposed selection_dataset_id, narrowed ranking to ds1, and selected intended_ds1_commit.

View all artifacts

T-Rex Ran code and verified through T-Rex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/harbor/verifier.py:135-141
**Shortlist still controls selection.** The admin re-score only runs on candidates admitted by the agent-influenced `mean_score` ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below `rescore_top_k`; the trusted scorer never evaluates that lower-recorded commit, so `auto_best` can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.

### Issue 2 of 3
vero/src/vero/harbor/serve.py:102-123
**Persisted ledger overrides config.** When `ledger.json` exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked `budget_cfgs`. If a restart follows a config change that lowers a budget, removes a split, or changes `max_samples_per_run`, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from `budget_cfgs` as the authority, overlay persisted `remaining_*` only for matching `(split, dataset_id)` entries, and clamp remaining values to the configured totals.

### Issue 3 of 3
vero/src/vero/harbor/verifier.py:147-152
**Failed re-score aborts finalize.** Any exception from `evaluate_admin` inside the shortlist loop aborts `_best_from_db` before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as `default_minimum_score` or skip them after logging, and only raise `NoCandidateError` if every candidate fails.

Reviews (1): Last reviewed commit: "fix(harbor): Mode A scorer provenance, a..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

…ers, budget reload

Trust-boundary hardening for the eval sidecar (review findings on PR #4):

- protocol: tier_for_split fails CLOSED (unlisted split -> no_access, with a
  warning), instead of fail-open to viewable.
- serve: Mode A now requires task_project. build_components refuses to start a
  Mode A config without it, so the hidden-split/admin scorer is always loaded
  from the sidecar-baked task project, never the agent's committed repo (a
  committed scorer returning 1.0 can no longer win the reward).
- verifier: auto_best no longer trusts the agent-influenced recorded score. It
  shortlists by recorded score, then re-runs evaluate_admin (the trusted scorer)
  on the selection split for the top-K and ranks by the admin score; fails closed.
- serve: reload the persisted budget ledger on startup so a sidecar restart does
  not reset spent budget to full.
- tests: adversarial coverage (no_access -> 400, finalize does not run an
  agent-supplied scorer via a cheating-agent repo, admin token 0o600/unreadable).

Also folds in the valid Greptile findings on this PR: clear stale per-sample
result files on re-eval, type reward_mode as Literal[submit, auto_best], and
filter auto_best candidates by selection dataset id.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +135 to +141
# Shortlist by recorded score (cheap, agent-influenced -> not trusted as
# final), one row per candidate (highest recorded score wins the slot).
ranked = split_df.sort_values(
by=["mean_score", "candidate_created_at"], ascending=[False, False]
).iloc[0]
return best["candidate_commit"]
)
ranked = ranked.drop_duplicates(subset=["candidate_commit"], keep="first")
shortlist = ranked.head(max(1, self.rescore_top_k))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Shortlist still controls selection. The admin re-score only runs on candidates admitted by the agent-influenced mean_score ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below rescore_top_k; the trusted scorer never evaluates that lower-recorded commit, so auto_best can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.

Artifacts

Repro: focused pytest harness constructing recorded-score shortlist that excludes the trusted best candidate

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing shortlist membership, evaluate_admin calls, skipped true_best, selected bad_3, and failing assertion

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 135-141

Comment:
**Shortlist still controls selection.** The admin re-score only runs on candidates admitted by the agent-influenced `mean_score` ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below `rescore_top_k`; the trusted scorer never evaluates that lower-recorded commit, so `auto_best` can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +102 to +123
if persist_path.exists():
try:
persisted = json.loads(persist_path.read_text())
budgets: list[SplitBudget] = []
for entry in persisted:
b = SplitBudget(
split=entry["split"],
dataset_id=entry.get("dataset_id", ""),
total_sample_budget=entry.get("total_sample_budget"),
total_run_budget=entry.get("total_run_budget"),
max_samples_per_run=entry.get("max_samples_per_run"),
)
# __post_init__ reset remaining_* to total_*; restore spent state.
b.remaining_sample_budget = entry.get("remaining_sample_budget")
b.remaining_run_budget = entry.get("remaining_run_budget")
budgets.append(b)
logger.info(
"Reloaded persisted budget ledger from %s (%d splits).",
persist_path,
len(budgets),
)
return BudgetLedger(budgets, persist_path=persist_path)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Persisted ledger overrides config. When ledger.json exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked budget_cfgs. If a restart follows a config change that lowers a budget, removes a split, or changes max_samples_per_run, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from budget_cfgs as the authority, overlay persisted remaining_* only for matching (split, dataset_id) entries, and clamp remaining values to the configured totals.

Artifacts

Repro: focused ledger restart script

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Stack trace captured during the T-Rex run

  • Keeps the raw stack trace available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/serve.py
Line: 102-123

Comment:
**Persisted ledger overrides config.** When `ledger.json` exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked `budget_cfgs`. If a restart follows a config change that lowers a budget, removes a split, or changes `max_samples_per_run`, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from `budget_cfgs` as the authority, overlay persisted `remaining_*` only for matching `(split, dataset_id)` entries, and clamp remaining values to the configured totals.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment on lines +147 to +152
exp = await self.engine.evaluate_admin(
task=self.selection_task,
dataset_id=dataset_id,
split=self.selection_split,
commit=commit,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Failed re-score aborts finalize. Any exception from evaluate_admin inside the shortlist loop aborts _best_from_db before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as default_minimum_score or skip them after logging, and only raise NoCandidateError if every candidate fails.

Artifacts

Repro: focused auto_best verifier harness

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing finalize aborts before later candidate

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 147-152

Comment:
**Failed re-score aborts finalize.** Any exception from `evaluate_admin` inside the shortlist loop aborts `_best_from_db` before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as `default_minimum_score` or skip them after logging, and only raise `NoCandidateError` if every candidate fails.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant