[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4varunursekar wants to merge 1 commit into
Conversation
The evaluation engine, run in a sidecar container, with the trust boundary that makes an optimization run leaderboard-gradeable: - `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing of results across the agent-readable and admin volumes. - `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring. - Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged) cannot read it; only the verifier (root, shared mode) can. - FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted); /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the engine + sidecar + verifier from a ServeConfig and runs it under uvicorn. - `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py, dataset.py) are included so the harbor package imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| commit = experiment.run.candidate.commit | ||
| dest = self.agent_volume / "results" / f"{split}__{commit[:12]}" |
There was a problem hiding this comment.
This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write 1.json and 2.json; a later sample_ids=[0] eval overwrites summary.json with n_samples=1 but leaves the old per-sample files in place. The returned result_path can then expose stale files that were not part of the current metered run.
Artifacts
Repro: generated Harbor sidecar stale results harness
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: harness execution output showing stale per-sample files after second eval
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 164-165
Comment:
**Avoid stale results**
This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agree, addressing this in the incoming fix. The dest is agent_volume / "results" / f"{split}__{commit[:12]}" with no scoping to the sample set, and the writer only rewrites summary.json plus the per-sample files for the current run, it never clears the directory first. So a subset re-eval on a commit previously evaluated in full leaves orphaned {sample_id}.json files behind, and the returned result_path then surfaces per-sample results that weren't part of the current metered run.
The fix clears the per-sample projection before writing (remove stale *.json other than the freshly written set) so the directory reflects exactly the current run. summary.json's n_samples already reflects the current run; the gap was purely the leftover per-sample files.
(Landed in #8.)
| split_df = df[df["dataset_subset_split"] == self.selection_split] | ||
| if self.base_commit is not None: | ||
| split_df = split_df[split_df["candidate_commit"] != self.base_commit] | ||
| if len(split_df) == 0: | ||
| raise NoCandidateError( | ||
| f"auto_best mode but no candidate experiments on split " | ||
| f"'{self.selection_split}'." | ||
| ) | ||
| best = split_df.sort_values( | ||
| by=["mean_score", "candidate_created_at"], ascending=[False, False] | ||
| ).iloc[0] |
There was a problem hiding this comment.
auto_best filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from other_ds/validation can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.
Artifacts
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/verifier.py
Line: 103-113
Comment:
**Filter selection dataset**
`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agree, addressing this in the incoming fix. _best_from_db filters split_df = df[df["dataset_subset_split"] == self.selection_split] and (optionally) excludes the base commit, but it never constrains by dataset id, and the Verifier doesn't currently carry a selection dataset id at all. In a deployment whose experiment DB has accumulated runs across more than one dataset on the same split name, the top score could come from a foreign dataset and select the wrong commit.
The fix threads the selection dataset id into the verifier and adds a dataset_subset_dataset_id == selection_dataset_id filter alongside the existing split filter, so auto_best only ever ranks candidates that were scored on the intended selection dataset. In the current single-dataset compiled-task topology this is latent rather than firing, but it's a correct hardening for the shared-DB case.
(Landed in #8.)
| harbor: dict | None = None # HarborConfig kwargs | ||
|
|
||
| # selection / reward | ||
| reward_mode: str = "auto_best" |
There was a problem hiding this comment.
reward_mode accepts any string here, and Verifier._select_commit() treats every value except exactly submit as auto_best. A typo such as submitted on a submit-based task will silently ignore submission.json and either select from prior experiments or fail with no auto-best candidates.
Artifacts
Repro: focused script exercising ServeConfig and Verifier reward_mode selection
- Contains supporting evidence from the run (text/x-python; charset=utf-8).
Repro: script output showing invalid reward_mode accepted and treated as auto_best
- Keeps the command output available without making the summary code-heavy.
Ran code and verified through T-Rex
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/serve.py
Line: 63
Comment:
**Validate reward mode**
`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agree, addressing this in the incoming fix. ServeConfig.reward_mode is str = "auto_best", but the contract is genuinely binary: Verifier.__init__ already types it Literal["submit", "auto_best"] and _select_commit does if self.reward_mode == "submit": ... else: return self._best_from_db(), so any non-exact value (e.g. "submitted") silently routes to auto_best and ignores submission.json.
The fix tightens ServeConfig.reward_mode to Literal["submit", "auto_best"] so a bad value is rejected at config-parse time (model_validate_json) inside the compiler rather than failing open at finalize. Good catch on the fail-open direction being the dangerous one here.
(Landed in #8.)
|
|
||
| # Mode A | ||
| task: str | None = None | ||
| task_project: str | None = None |
There was a problem hiding this comment.
Security, the headline one. In Mode A this defaults to None, so the Evaluator runs uv in the agent's own project and loads the @task.evaluation() scorer from the agent's committed code. Verifier.finalize → evaluate_admin then scores the hidden split with that agent-controlled scorer, so a committed scorer that returns 1.0 wins the leaderboard. The token/volume/object-copy machinery secures transport, not scorer provenance. For a gradeable boundary the verifier must score with a sidecar-baked task project: require task_project/inner_task for Mode A and reject configs where the scorer would resolve from the agent repo. This is the default posture and the single most important gap in the stack.
| tok = generate_token() | ||
| p = write_admin_token(tmp_path / "t", tok) | ||
| assert read_admin_token(p) == tok | ||
| assert (p.stat().st_mode & 0o777) == 0o600 |
There was a problem hiding this comment.
For the security-critical PR, the OS-level claims are asserted only as a 0o600 mode bit plus happy-path mocks. Please add adversarial coverage: (1) agent /eval on a no_access split → rejected end-to-end; (2) an integration test that finalize does not execute an agent-supplied scorer (the Mode A provenance gap); (3) a uid-drop test that agent.user cannot read the token file.
| workspace = await GitWorkspace.create(config.repo_path) | ||
|
|
||
| budget = BudgetLedger( | ||
| [SplitBudget(**b) for b in config.budgets], |
There was a problem hiding this comment.
build_components rebuilds the BudgetLedger from config.budgets on every start, and the ledger writes persist_path on each mutation but never reads it back. So a sidecar restart mid-trial silently resets spent budget — metering is not crash-safe despite the durable-ledger framing. Add a load-if-exists path that seeds from the persisted ledger when the file exists, and from config.budgets only when it does not.
| """Best candidate by recorded score on the selection split (excludes baseline).""" | ||
| if self.engine.db is None: | ||
| raise NoCandidateError("auto_best mode but no experiment database.") | ||
| df = self.engine.db.get_experiments_df(fill_score=default_minimum_score) |
There was a problem hiding this comment.
auto_best ranks candidates solely by the recorded mean_score from engine.db.get_experiments_df, but that DB is written by the agent's metered /eval calls and the verifier shares the same in-process engine/DB. The only admin-independent scoring at finalize is evaluate_admin on the test target — after selection, and only on the chosen winner. So a candidate that records an inflated selection score selects itself. (Mode B in #5 sharpens this: the recorded selection score is the reward a nested run of the candidate emitted, which the optimizer controls.) Before trusting the ranking, re-score under admin: re-run evaluate_admin on the selection split for the top-K and re-rank, or at minimum re-score the winner and fail closed on deviation.
| for sa in split_accesses: | ||
| if sa.split == split: | ||
| return sa.access | ||
| return SplitAccessLevel.viewable |
There was a problem hiding this comment.
This fails open, the wrong default for a trust boundary. tier_for_split is the gate _route_results uses to decide whether to write full per-sample SampleResults (labels included) to the agent volume; an author who budgets a split but forgets to add it to split_accesses lands here and gets viewable → ground truth on the agent-readable volume. Default to the most restrictive tier (no_access) and update the docstring, which currently advertises the fail-open as intended. If an unlisted split should ever be visible, make that explicit and logged.
Draft · Stack 2 of 4 — targets
harbor-1-core(review [1/4] first; its diff is the base here).The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.
EvaluationSidecar(server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.Verifier(verifier.py): commit selection (submit | auto_best) + hidden-split scoring.root:600— unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode)./eval/submit/status(agent; metered, redacted) and/finalize(verifier; token-gated).vero harbor serveassembles engine+sidecar+verifier from aServeConfig.HarborConfig/partition helpers so the package imports cleanly (build/runlight up in [3/4]).Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.
🤖 Generated with Claude Code
Greptile Summary
This PR adds the Harbor eval sidecar and verifier path. The main changes are:
Confidence Score: 4/5
Several correctness issues in result isolation, commit selection, configuration validation, and startup behavior should be addressed before merging.
The changed code is covered by focused tests and the review identified concrete runtime-impacting paths with reproducible behavior, while the remaining surface is relatively contained to the Harbor sidecar and verifier components.
vero/src/vero/harbor/server.py, vero/src/vero/harbor/verifier.py, and vero/src/vero/harbor/serve.py
What T-Rex did
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Harbor eval sidecar: verifier, token aut..." | Re-trigger Greptile