fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]#8
Conversation
…ers, budget reload Trust-boundary hardening for the eval sidecar (review findings on PR #4): - protocol: tier_for_split fails CLOSED (unlisted split -> no_access, with a warning), instead of fail-open to viewable. - serve: Mode A now requires task_project. build_components refuses to start a Mode A config without it, so the hidden-split/admin scorer is always loaded from the sidecar-baked task project, never the agent's committed repo (a committed scorer returning 1.0 can no longer win the reward). - verifier: auto_best no longer trusts the agent-influenced recorded score. It shortlists by recorded score, then re-runs evaluate_admin (the trusted scorer) on the selection split for the top-K and ranks by the admin score; fails closed. - serve: reload the persisted budget ledger on startup so a sidecar restart does not reset spent budget to full. - tests: adversarial coverage (no_access -> 400, finalize does not run an agent-supplied scorer via a cheating-agent repo, admin token 0o600/unreadable). Also folds in the valid Greptile findings on this PR: clear stale per-sample result files on re-eval, type reward_mode as Literal[submit, auto_best], and filter auto_best candidates by selection dataset id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # Shortlist by recorded score (cheap, agent-influenced -> not trusted as | ||
| # final), one row per candidate (highest recorded score wins the slot). | ||
| ranked = split_df.sort_values( | ||
| by=["mean_score", "candidate_created_at"], ascending=[False, False] | ||
| ).iloc[0] | ||
| return best["candidate_commit"] | ||
| ) | ||
| ranked = ranked.drop_duplicates(subset=["candidate_commit"], keep="first") | ||
| shortlist = ranked.head(max(1, self.rescore_top_k)) |
There was a problem hiding this comment.
Shortlist still controls selection. The admin re-score only runs on candidates admitted by the agent-influenced
mean_score ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below rescore_top_k; the trusted scorer never evaluates that lower-recorded commit, so auto_best can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.
Artifacts
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 135-141
Comment:
**Shortlist still controls selection.** The admin re-score only runs on candidates admitted by the agent-influenced `mean_score` ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below `rescore_top_k`; the trusted scorer never evaluates that lower-recorded commit, so `auto_best` can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.
How can I resolve this? If you propose a fix, please make it concise.| if persist_path.exists(): | ||
| try: | ||
| persisted = json.loads(persist_path.read_text()) | ||
| budgets: list[SplitBudget] = [] | ||
| for entry in persisted: | ||
| b = SplitBudget( | ||
| split=entry["split"], | ||
| dataset_id=entry.get("dataset_id", ""), | ||
| total_sample_budget=entry.get("total_sample_budget"), | ||
| total_run_budget=entry.get("total_run_budget"), | ||
| max_samples_per_run=entry.get("max_samples_per_run"), | ||
| ) | ||
| # __post_init__ reset remaining_* to total_*; restore spent state. | ||
| b.remaining_sample_budget = entry.get("remaining_sample_budget") | ||
| b.remaining_run_budget = entry.get("remaining_run_budget") | ||
| budgets.append(b) | ||
| logger.info( | ||
| "Reloaded persisted budget ledger from %s (%d splits).", | ||
| persist_path, | ||
| len(budgets), | ||
| ) | ||
| return BudgetLedger(budgets, persist_path=persist_path) |
There was a problem hiding this comment.
Persisted ledger overrides config. When
ledger.json exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked budget_cfgs. If a restart follows a config change that lowers a budget, removes a split, or changes max_samples_per_run, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from budget_cfgs as the authority, overlay persisted remaining_* only for matching (split, dataset_id) entries, and clamp remaining values to the configured totals.
Artifacts
Repro: focused ledger restart script
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Stack trace captured during the T-Rex run
- Keeps the raw stack trace available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/serve.py
Line: 102-123
Comment:
**Persisted ledger overrides config.** When `ledger.json` exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked `budget_cfgs`. If a restart follows a config change that lowers a budget, removes a split, or changes `max_samples_per_run`, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from `budget_cfgs` as the authority, overlay persisted `remaining_*` only for matching `(split, dataset_id)` entries, and clamp remaining values to the configured totals.
How can I resolve this? If you propose a fix, please make it concise.| exp = await self.engine.evaluate_admin( | ||
| task=self.selection_task, | ||
| dataset_id=dataset_id, | ||
| split=self.selection_split, | ||
| commit=commit, | ||
| ) |
There was a problem hiding this comment.
Failed re-score aborts finalize. Any exception from
evaluate_admin inside the shortlist loop aborts _best_from_db before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as default_minimum_score or skip them after logging, and only raise NoCandidateError if every candidate fails.
Artifacts
Repro: focused auto_best verifier harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: harness output showing finalize aborts before later candidate
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 147-152
Comment:
**Failed re-score aborts finalize.** Any exception from `evaluate_admin` inside the shortlist loop aborts `_best_from_db` before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as `default_minimum_score` or skip them after logging, and only raise `NoCandidateError` if every candidate fails.
How can I resolve this? If you propose a fix, please make it concise.
Stacks on #4 (
harbor-2-sidecar). Addresses the trust-boundary findings from the review of that PR. 2 of 4 fix PRs.What this fixes
tier_for_splitfails open (unlisted split →viewable→ per-sample labels written to the agent volume)no_access, with a warningtask_project=Noneloads the@task.evaluation()scorer from the agent's committed repo;evaluate_adminthen scores the hidden split with it, so a committed scorer returning1.0winsbuild_componentsrefuses a Mode A config withouttask_project, so the hidden-split scorer is always sidecar-baked, never the agent's commitauto_bestranks candidates purely by the agent-influenced recorded score, so a candidate can inflate its own selection scoreevaluate_admin(trusted scorer) on the selection split for the top-K and ranking by admin score. Fails closedremaining_*from the persisted ledger when present (fail-safe to configured budgets)no_access→400, a cheating-agent repo provingfinalizedoesn't run the agent's scorer, token0o600/unreadableGreptile findings on PR #4, also fixed here (replies posted on #4)
server.py):_route_resultsrecreates the dir so it reflects exactly the current metered run.reward_modetyped (serve.py): nowLiteral["submit","auto_best"], rejected at parse time.auto_bestdataset scoping (verifier.py): candidates filtered byselection_dataset_idso a foreign dataset's same-named split can't select the winner.Architecture question
This PR keeps
tier_for_splitas the sidecar's write-routing resolver and flips it fail-closed; the engine-side real-time access gate lands in the core PR. See the design note (answering "should split access be a dataset property with a real-time check?") for the full reasoning. Short answer: yes, and the core PR implements the minimal version.Tests
20 harbor unit tests pass (protocol, verifier, app). The 3 real-eval serve tests pass end-to-end (built via
uvagainst a real task project): the honest path, the cheating-agent scorer-provenance proof (reward stays0.0, not the agent's1.0), and the ledger-reload-across-restart test.🤖 Generated with Claude Code
Greptile Summary
This PR tightens Harbor sidecar evaluation and reward handling. The main changes are:
no_access.task_project.auto_bestnow admin re-scores a shortlist before choosing a winner.Confidence Score: 2/5
Not merge-safe until the auto-selection and budget reload paths are corrected.
The changes address several trust-boundary issues, but runtime checks reproduced cases where candidate selection can still be controlled by recorded scores, scorer failures can abort finalization prematurely, and persisted budget state can override the current baked configuration.
vero/src/vero/harbor/verifier.pyandvero/src/vero/harbor/serve.pyneed attention.What T-Rex did
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(harbor): Mode A scorer provenance, a..." | Re-trigger Greptile