[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) by varunursekar · Pull Request #4 · scaleapi/vero

varunursekar · 2026-06-24T18:12:23Z

Draft · Stack 2 of 4 — targets harbor-1-core (review [1/4] first; its diff is the base here).

The evaluation engine run in a sidecar container, plus the trust boundary that makes a run leaderboard-gradeable. This is the security-critical PR — a focused/security review is worthwhile here.

EvaluationSidecar (server.py): commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) + tier-gated result write-routing across the agent-readable and admin volumes.
Verifier (verifier.py): commit selection (submit | auto_best) + hidden-split scoring.
Admin token (auth.py): per-trial, written root:600 — unreadable by the de-privileged optimizer, readable by the verifier (root, shared mode).
FastAPI (app.py): /eval /submit /status (agent; metered, redacted) and /finalize (verifier; token-gated). vero harbor serve assembles engine+sidecar+verifier from a ServeConfig.
CLI clients (cli.py) + HarborConfig/partition helpers so the package imports cleanly (build/run light up in [3/4]).

Stack: [1/4] core → this → [3/4] compiler → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the Harbor eval sidecar and verifier path. The main changes are:

New FastAPI endpoints for eval, submit, status, health, and finalize.
Admin bearer-token helpers for verifier-only finalize access.
Sidecar commit transfer from the agent repo and tier-based result routing.
Verifier commit selection and hidden-split reward scoring.
Harbor config, dataset partition helpers, CLI clients, and coverage tests.

Confidence Score: 4/5

Several correctness issues in result isolation, commit selection, configuration validation, and startup behavior should be addressed before merging.

The changed code is covered by focused tests and the review identified concrete runtime-impacting paths with reproducible behavior, while the remaining surface is relatively contained to the Harbor sidecar and verifier components.

vero/src/vero/harbor/server.py, vero/src/vero/harbor/verifier.py, and vero/src/vero/harbor/serve.py

T-Rex Logs

What T-Rex did

- Reproduced stale Harbor sidecar results behavior by running the harness twice on the same split/commit, verifying the second run reused summary.json with n_samples=1 and left stale per-sample files from the first run.
- Reproduced the cross-dataset selection path with the focused verifier harness, observing target_ds 0.51 vs other_ds 0.99 and auto_best choosing other_ds commit.
- Reproduced reward_mode handling in the repro, demonstrating that reward_mode='submitted' is accepted, submission.json is used for selection, and no candidate results were found when experiments are missing.
- Reproduced Guard Mode B startup path and observed failure due to ModuleNotFoundError: No module named 'vero.harbor.runner'.
- Reproduced sidecar trust routing pre/post captures, including a base state with ModuleNotFoundError and an after state with real git fetch and expected routing outputs.
- Inspected token-based access and endpoint behavior under trust routing, verifying token mode configuration, token round-trip, and unauthenticated vs. authenticated endpoint results.
- Reproduced verifier selection harness results, showing before lack of vero.harbor import and after submission results with NoCandidateError paths using a fake DB and evaluator.
- Reproduced mode-a serve assembly flow, with before showing missing harbor import and after demonstrating successful imports, JSON parsing, dataset construction, and serve app creation.

_{Ran code and verified through T-Rex}

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
vero/src/vero/harbor/server.py:164-165
**Avoid stale results**

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run.

### Issue 2 of 3
vero/src/vero/harbor/verifier.py:103-113
**Filter selection dataset**

`auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

### Issue 3 of 3
vero/src/vero/harbor/serve.py:63
**Validate reward mode**

`reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates.

_{Reviews (1): Last reviewed commit: "Harbor eval sidecar: verifier, token aut..." | Re-trigger Greptile}

Greptile also left 3 inline comments on this PR.

The evaluation engine, run in a sidecar container, with the trust boundary that makes an optimization run leaderboard-gradeable: - `EvaluationSidecar` (server.py): agent-facing handlers — commit transfer from the untrusted agent repo (git fetch, hooks off, object copy) and tier-gated write-routing of results across the agent-readable and admin volumes. - `Verifier` (verifier.py): commit selection (submit | auto_best) + hidden-split scoring. - Per-trial admin token (auth.py), written root:600 so the optimizer (de-privileged) cannot read it; only the verifier (root, shared mode) can. - FastAPI surface (app.py): /eval, /submit, /status for the agent (metered, redacted); /finalize for the verifier (token-gated). `vero harbor serve` (serve.py) assembles the engine + sidecar + verifier from a ServeConfig and runs it under uvicorn. - `vero harbor` CLI clients (cli.py): serve | eval | submit | status | finalize (build/run land with the compiler). HarborConfig + the Mode-B dataset partition helpers (config.py, dataset.py) are included so the harbor package imports cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-24T18:28:11Z

+        commit = experiment.run.candidate.commit
+        dest = self.agent_volume / "results" / f"{split}__{commit[:12]}"


Avoid stale results

This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write 1.json and 2.json; a later sample_ids=[0] eval overwrites summary.json with n_samples=1 but leaves the old per-sample files in place. The returned result_path can then expose stale files that were not part of the current metered run.

Artifacts

Repro: generated Harbor sidecar stale results harness

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: harness execution output showing stale per-sample files after second eval

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 164-165 Comment: **Avoid stale results** This result path only uses the split and commit, so two evaluations of the same commit on different sample sets reuse the same directory. A full visible eval can write `1.json` and `2.json`; a later `sample_ids=[0]` eval overwrites `summary.json` with `n_samples=1` but leaves the old per-sample files in place. The returned `result_path` can then expose stale files that were not part of the current metered run. How can I resolve this? If you propose a fix, please make it concise.

Agree, addressing this in the incoming fix. The dest is agent_volume / "results" / f"{split}__{commit[:12]}" with no scoping to the sample set, and the writer only rewrites summary.json plus the per-sample files for the current run, it never clears the directory first. So a subset re-eval on a commit previously evaluated in full leaves orphaned {sample_id}.json files behind, and the returned result_path then surfaces per-sample results that weren't part of the current metered run.

The fix clears the per-sample projection before writing (remove stale *.json other than the freshly written set) so the directory reflects exactly the current run. summary.json's n_samples already reflects the current run; the gap was purely the leftover per-sample files.

(Landed in #8.)

greptile-apps · 2026-06-24T18:28:12Z

+        split_df = df[df["dataset_subset_split"] == self.selection_split]
+        if self.base_commit is not None:
+            split_df = split_df[split_df["candidate_commit"] != self.base_commit]
+        if len(split_df) == 0:
+            raise NoCandidateError(
+                f"auto_best mode but no candidate experiments on split "
+                f"'{self.selection_split}'."
+            )
+        best = split_df.sort_values(
+            by=["mean_score", "candidate_created_at"], ascending=[False, False]
+        ).iloc[0]


Filter selection dataset

auto_best filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from other_ds/validation can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset.

Artifacts

Repro: focused verifier harness that seeds two dataset validation rows and asserts cross-dataset commit selection

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: command output showing seeded DB rows, selected other_ds commit, and target_ds hidden scoring call

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/verifier.py Line: 103-113 Comment: **Filter selection dataset** `auto_best` filters only by split before picking the highest score. If the DB contains validation runs for more than one dataset, a high score from `other_ds/validation` can select that commit for scoring on this target's hidden dataset. The verifier can then finalize a commit that was not the best candidate for the intended selection dataset. How can I resolve this? If you propose a fix, please make it concise.

Agree, addressing this in the incoming fix. _best_from_db filters split_df = df[df["dataset_subset_split"] == self.selection_split] and (optionally) excludes the base commit, but it never constrains by dataset id, and the Verifier doesn't currently carry a selection dataset id at all. In a deployment whose experiment DB has accumulated runs across more than one dataset on the same split name, the top score could come from a foreign dataset and select the wrong commit.

The fix threads the selection dataset id into the verifier and adds a dataset_subset_dataset_id == selection_dataset_id filter alongside the existing split filter, so auto_best only ever ranks candidates that were scored on the intended selection dataset. In the current single-dataset compiled-task topology this is latent rather than firing, but it's a correct hardening for the shared-DB case.

(Landed in #8.)

greptile-apps · 2026-06-24T18:28:12Z

+    harbor: dict | None = None  # HarborConfig kwargs
+
+    # selection / reward
+    reward_mode: str = "auto_best"


Validate reward mode

reward_mode accepts any string here, and Verifier._select_commit() treats every value except exactly submit as auto_best. A typo such as submitted on a submit-based task will silently ignore submission.json and either select from prior experiments or fail with no auto-best candidates.

Artifacts

Repro: focused script exercising ServeConfig and Verifier reward_mode selection

Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing invalid reward_mode accepted and treated as auto_best

Keeps the command output available without making the summary code-heavy.

_{Ran code and verified through T-Rex}

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/serve.py Line: 63 Comment: **Validate reward mode** `reward_mode` accepts any string here, and `Verifier._select_commit()` treats every value except exactly `submit` as `auto_best`. A typo such as `submitted` on a submit-based task will silently ignore `submission.json` and either select from prior experiments or fail with no auto-best candidates. How can I resolve this? If you propose a fix, please make it concise.

Agree, addressing this in the incoming fix. ServeConfig.reward_mode is str = "auto_best", but the contract is genuinely binary: Verifier.__init__ already types it Literal["submit", "auto_best"] and _select_commit does if self.reward_mode == "submit": ... else: return self._best_from_db(), so any non-exact value (e.g. "submitted") silently routes to auto_best and ignores submission.json.

The fix tightens ServeConfig.reward_mode to Literal["submit", "auto_best"] so a bad value is rejected at config-parse time (model_validate_json) inside the compiler rather than failing open at finalize. Good catch on the fail-open direction being the dangerous one here.

(Landed in #8.)

shehabyasser-scale · 2026-06-30T08:33:30Z

+
+    # Mode A
+    task: str | None = None
+    task_project: str | None = None


Security, the headline one. In Mode A this defaults to None, so the Evaluator runs uv in the agent's own project and loads the @task.evaluation() scorer from the agent's committed code. Verifier.finalize → evaluate_admin then scores the hidden split with that agent-controlled scorer, so a committed scorer that returns 1.0 wins the leaderboard. The token/volume/object-copy machinery secures transport, not scorer provenance. For a gradeable boundary the verifier must score with a sidecar-baked task project: require task_project/inner_task for Mode A and reject configs where the scorer would resolve from the agent repo. This is the default posture and the single most important gap in the stack.

shehabyasser-scale · 2026-06-30T08:33:37Z

+        tok = generate_token()
+        p = write_admin_token(tmp_path / "t", tok)
+        assert read_admin_token(p) == tok
+        assert (p.stat().st_mode & 0o777) == 0o600


For the security-critical PR, the OS-level claims are asserted only as a 0o600 mode bit plus happy-path mocks. Please add adversarial coverage: (1) agent /eval on a no_access split → rejected end-to-end; (2) an integration test that finalize does not execute an agent-supplied scorer (the Mode A provenance gap); (3) a uid-drop test that agent.user cannot read the token file.

shehabyasser-scale · 2026-06-30T08:33:44Z

+    workspace = await GitWorkspace.create(config.repo_path)
+
+    budget = BudgetLedger(
+        [SplitBudget(**b) for b in config.budgets],


build_components rebuilds the BudgetLedger from config.budgets on every start, and the ledger writes persist_path on each mutation but never reads it back. So a sidecar restart mid-trial silently resets spent budget — metering is not crash-safe despite the durable-ledger framing. Add a load-if-exists path that seeds from the persisted ledger when the file exists, and from config.budgets only when it does not.

shehabyasser-scale · 2026-06-30T08:33:52Z

+        """Best candidate by recorded score on the selection split (excludes baseline)."""
+        if self.engine.db is None:
+            raise NoCandidateError("auto_best mode but no experiment database.")
+        df = self.engine.db.get_experiments_df(fill_score=default_minimum_score)


auto_best ranks candidates solely by the recorded mean_score from engine.db.get_experiments_df, but that DB is written by the agent's metered /eval calls and the verifier shares the same in-process engine/DB. The only admin-independent scoring at finalize is evaluate_admin on the test target — after selection, and only on the chosen winner. So a candidate that records an inflated selection score selects itself. (Mode B in #5 sharpens this: the recorded selection score is the reward a nested run of the candidate emitted, which the optimizer controls.) Before trusting the ranking, re-score under admin: re-run evaluate_admin on the selection split for the top-K and re-rank, or at minimum re-score the winner and fail closed on deviation.

shehabyasser-scale · 2026-06-30T08:35:45Z

+    for sa in split_accesses:
+        if sa.split == split:
+            return sa.access
+    return SplitAccessLevel.viewable


This fails open, the wrong default for a trust boundary. tier_for_split is the gate _route_results uses to decide whether to write full per-sample SampleResults (labels included) to the agent volume; an author who budgets a split but forgets to add it to split_accesses lands here and gets viewable → ground truth on the agent-readable volume. Default to the most restrictive tier (no_access) and update the docstring, which currently advertises the fail-open as intended. If an unlisted split should ever be visible, make that explicit and logged.

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:17

varunursekar marked this pull request as ready for review June 24, 2026 18:21

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 4/4] architecture docs, tutorial, and the GAIA example #6

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A)#4
varunursekar wants to merge 1 commit into
harbor-1-corefrom
harbor-2-sidecar

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		commit = experiment.run.candidate.commit
		dest = self.agent_volume / "results" / f"{split}__{commit[:12]}"

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

T-Rex Logs

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading