fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar] by shehabyasser-scale · Pull Request #8 · scaleapi/vero

shehabyasser-scale · 2026-06-30T10:21:12Z

Stacks on #4 (harbor-2-sidecar). Addresses the trust-boundary findings from the review of that PR. 2 of 4 fix PRs.

What this fixes

#	Sev	Finding	Fix
S1	🔴	`tier_for_split` fails open (unlisted split → `viewable` → per-sample labels written to the agent volume)	Fail closed: unlisted → `no_access`, with a warning
S2	🔴	Mode A scorer provenance (the headline). `task_project=None` loads the `@task.evaluation()` scorer from the agent's committed repo; `evaluate_admin` then scores the hidden split with it, so a committed scorer returning `1.0` wins	`build_components` refuses a Mode A config without `task_project`, so the hidden-split scorer is always sidecar-baked, never the agent's commit
S3	🔴	`auto_best` ranks candidates purely by the agent-influenced recorded score, so a candidate can inflate its own selection score	Recorded score only shortlists; the winner is decided by re-running `evaluate_admin` (trusted scorer) on the selection split for the top-K and ranking by admin score. Fails closed
S4	🟠	The budget ledger is rebuilt from config on every start, so a sidecar restart resets spent budget to full	Reload `remaining_*` from the persisted ledger when present (fail-safe to configured budgets)
S5	🟠	No adversarial/OS-level tests for the integrity claims	Added: `no_access`→400, a cheating-agent repo proving `finalize` doesn't run the agent's scorer, token `0o600`/unreadable

Greptile findings on PR #4, also fixed here (replies posted on #4)

stale result dirs (server.py): _route_results recreates the dir so it reflects exactly the current metered run.
reward_mode typed (serve.py): now Literal["submit","auto_best"], rejected at parse time.
auto_best dataset scoping (verifier.py): candidates filtered by selection_dataset_id so a foreign dataset's same-named split can't select the winner.

Architecture question

This PR keeps tier_for_split as the sidecar's write-routing resolver and flips it fail-closed; the engine-side real-time access gate lands in the core PR. See the design note (answering "should split access be a dataset property with a real-time check?") for the full reasoning. Short answer: yes, and the core PR implements the minimal version.

Tests

20 harbor unit tests pass (protocol, verifier, app). The 3 real-eval serve tests pass end-to-end (built via uv against a real task project): the honest path, the cheating-agent scorer-provenance proof (reward stays 0.0, not the agent's 1.0), and the ledger-reload-across-restart test.

🤖 Generated with Claude Code

Greptile Summary

This PR tightens Harbor sidecar evaluation and reward handling. The main changes are:

Unlisted splits now fail closed to no_access.
Mode A startup now requires a trusted task_project.
Budget ledgers are reloaded from persisted state across restarts.
auto_best now admin re-scores a shortlist before choosing a winner.
Result directories are recreated to avoid stale per-sample files.
Tests cover scorer provenance, no-access handling, token permissions, ledger reload, and verifier selection.

Confidence Score: 2/5

Not merge-safe until the auto-selection and budget reload paths are corrected.

The changes address several trust-boundary issues, but runtime checks reproduced cases where candidate selection can still be controlled by recorded scores, scorer failures can abort finalization prematurely, and persisted budget state can override the current baked configuration.

vero/src/vero/harbor/verifier.py and vero/src/vero/harbor/serve.py need attention.

T-Rex Logs

What T-Rex did

Ran a focused pytest harness against the Verifier auto_best path with five validation candidates and a recorded-score shortlist that excluded the trusted best commit.
The run showed evaluate_admin was called only for bad_1, bad_2, and bad_3, with the final scoring run on bad_3 and the true_best (0.99) not evaluated because its mean_score ranked below the shortlist.
The test failed asserting the selected winner should match the full admin best, demonstrating auto_best picked bad_3 instead of true_best.
Ran a focused ledger-restart repro that wrote a persisted ledger.json with dataset-a/test budgets (100 samples, 10 runs, max 50) and then restarted ledger loading with reduced budgets (10 samples, 1 run, max 5).
The loaded ledger still contained the persisted old values, allowed check(dataset-a, test, 6) despite the current max_samples_per_run being 5, and, when restarted with an empty budget_cfgs list, the removed dataset-a/test split remained evaluable from the persisted ledger.
Ran a Harbor Verifier auto_best harness with two candidates ordered by recorded score; monkeypatched evaluate_admin to raise for first_raises and return 0.95 for later_success, and finalize raised RuntimeError after evaluating only first_raises, so later_success was never considered.
Before: a fail-open tier showed viewable split_accesses; After: the code path moved to a fail-closed tier with no_access, and a repro script constructs the failing path for the tier switch.
Before: mode A provenance showed harbor=None, task_project=None, initialized components, token written, and exit code 0; After: head raised ValueError about missing task_project and the fail-closed guard activated, with exit code 0; Supplemental: a fail-closed pytest run appeared successful, though the trusted-scorer test did not run due to a FileNotFoundError during fixture setup.
Before: reward_mode='bogus' with no selection_dataset_id and a chosen foreign_ds2_commit; After: the head rejected the bogus reward mode with ValidationError, exposed selection_dataset_id, narrowed ranking to ds1, and selected intended_ds1_commit.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/harbor/verifier.py:135-141
**Shortlist still controls selection.** The admin re-score only runs on candidates admitted by the agent-influenced `mean_score` ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below `rescore_top_k`; the trusted scorer never evaluates that lower-recorded commit, so `auto_best` can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.

### Issue 2 of 3
vero/src/vero/harbor/serve.py:102-123
**Persisted ledger overrides config.** When `ledger.json` exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked `budget_cfgs`. If a restart follows a config change that lowers a budget, removes a split, or changes `max_samples_per_run`, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from `budget_cfgs` as the authority, overlay persisted `remaining_*` only for matching `(split, dataset_id)` entries, and clamp remaining values to the configured totals.

### Issue 3 of 3
vero/src/vero/harbor/verifier.py:147-152
**Failed re-score aborts finalize.** Any exception from `evaluate_admin` inside the shortlist loop aborts `_best_from_db` before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as `default_minimum_score` or skip them after logging, and only raise `NoCandidateError` if every candidate fails.

_{Reviews (1): Last reviewed commit: "fix(harbor): Mode A scorer provenance, a..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

…ers, budget reload Trust-boundary hardening for the eval sidecar (review findings on PR #4): - protocol: tier_for_split fails CLOSED (unlisted split -> no_access, with a warning), instead of fail-open to viewable. - serve: Mode A now requires task_project. build_components refuses to start a Mode A config without it, so the hidden-split/admin scorer is always loaded from the sidecar-baked task project, never the agent's committed repo (a committed scorer returning 1.0 can no longer win the reward). - verifier: auto_best no longer trusts the agent-influenced recorded score. It shortlists by recorded score, then re-runs evaluate_admin (the trusted scorer) on the selection split for the top-K and ranks by the admin score; fails closed. - serve: reload the persisted budget ledger on startup so a sidecar restart does not reset spent budget to full. - tests: adversarial coverage (no_access -> 400, finalize does not run an agent-supplied scorer via a cheating-agent repo, admin token 0o600/unreadable). Also folds in the valid Greptile findings on this PR: clear stale per-sample result files on re-eval, type reward_mode as Literal[submit, auto_best], and filter auto_best candidates by selection dataset id. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-30T10:27:44Z

+        # Shortlist by recorded score (cheap, agent-influenced -> not trusted as
+        # final), one row per candidate (highest recorded score wins the slot).
+        ranked = split_df.sort_values(
            by=["mean_score", "candidate_created_at"], ascending=[False, False]
-        ).iloc[0]
-        return best["candidate_commit"]
+        )
+        ranked = ranked.drop_duplicates(subset=["candidate_commit"], keep="first")
+        shortlist = ranked.head(max(1, self.rescore_top_k))


Shortlist still controls selection. The admin re-score only runs on candidates admitted by the agent-influenced mean_score ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below rescore_top_k; the trusted scorer never evaluates that lower-recorded commit, so auto_best can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores.

Artifacts

Repro: focused pytest harness constructing recorded-score shortlist that excludes the trusted best candidate

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: verbose pytest output showing shortlist membership, evaluate_admin calls, skipped true_best, selected bad_3, and failing assertion

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/verifier.py Line: 135-141 Comment: **Shortlist still controls selection.** The admin re-score only runs on candidates admitted by the agent-influenced `mean_score` ranking. An agent can run three throwaway commits with inflated recorded scores and leave the actually best commit below `rescore_top_k`; the trusted scorer never evaluates that lower-recorded commit, so `auto_best` can select a worse winner than the admin score would choose. Re-score every eligible candidate, or choose the shortlist from a source that is not ranked by agent-written scores. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-30T10:27:45Z

+    if persist_path.exists():
+        try:
+            persisted = json.loads(persist_path.read_text())
+            budgets: list[SplitBudget] = []
+            for entry in persisted:
+                b = SplitBudget(
+                    split=entry["split"],
+                    dataset_id=entry.get("dataset_id", ""),
+                    total_sample_budget=entry.get("total_sample_budget"),
+                    total_run_budget=entry.get("total_run_budget"),
+                    max_samples_per_run=entry.get("max_samples_per_run"),
+                )
+                # __post_init__ reset remaining_* to total_*; restore spent state.
+                b.remaining_sample_budget = entry.get("remaining_sample_budget")
+                b.remaining_run_budget = entry.get("remaining_run_budget")
+                budgets.append(b)
+            logger.info(
+                "Reloaded persisted budget ledger from %s (%d splits).",
+                persist_path,
+                len(budgets),
+            )
+            return BudgetLedger(budgets, persist_path=persist_path)


Persisted ledger overrides config. When ledger.json exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked budget_cfgs. If a restart follows a config change that lowers a budget, removes a split, or changes max_samples_per_run, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from budget_cfgs as the authority, overlay persisted remaining_* only for matching (split, dataset_id) entries, and clamp remaining values to the configured totals.

Artifacts

Repro: focused ledger restart script

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Stack trace captured during the T-Rex run

Keeps the raw stack trace available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/serve.py Line: 102-123 Comment: **Persisted ledger overrides config.** When `ledger.json` exists, this path rebuilds the budget ledger only from persisted entries and ignores the current baked `budget_cfgs`. If a restart follows a config change that lowers a budget, removes a split, or changes `max_samples_per_run`, the old persisted entry remains active with its old limits and can keep that split evaluable. Rebuild from `budget_cfgs` as the authority, overlay persisted `remaining_*` only for matching `(split, dataset_id)` entries, and clamp remaining values to the configured totals. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-30T10:27:46Z

+            exp = await self.engine.evaluate_admin(
+                task=self.selection_task,
+                dataset_id=dataset_id,
+                split=self.selection_split,
+                commit=commit,
+            )


Failed re-score aborts finalize. Any exception from evaluate_admin inside the shortlist loop aborts _best_from_db before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as default_minimum_score or skip them after logging, and only raise NoCandidateError if every candidate fails.

Artifacts

Repro: focused auto_best verifier harness

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness output showing finalize aborts before later candidate

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/verifier.py Line: 147-152 Comment: **Failed re-score aborts finalize.** Any exception from `evaluate_admin` inside the shortlist loop aborts `_best_from_db` before the remaining candidates are considered. Because the shortlist is chosen from recorded scores, a high-recorded commit that fails under the trusted scorer can prevent finalization even when lower-ranked candidates would re-score successfully. Treat failed re-scores as `default_minimum_score` or skip them after logging, and only raise `NoCandidateError` if every candidate fails. How can I resolve this? If you propose a fix, please make it concise.

shehabyasser-scale mentioned this pull request Jun 30, 2026

docs(harbor): honest integrity guarantees, GAIA leak caveat, Mode A example [fixes 4/4 docs] #10

Open

greptile-apps Bot reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]#8

fix(harbor): Mode A scorer provenance, auto_best admin re-score, fail-closed tiers [fixes 2/4 sidecar]#8
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecarfrom
harbor-2-sidecar-fixes

shehabyasser-scale commented Jun 30, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jun 30, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

Greptile findings on PR #4, also fixed here (replies posted on #4)

Architecture question

Tests

Greptile Summary

Confidence Score: 2/5

T-Rex Logs

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jun 30, 2026 •

edited by greptile-apps Bot

Loading