[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler by varunursekar · Pull Request #5 · scaleapi/vero

varunursekar · 2026-06-24T18:12:44Z

Draft · Stack 3 of 4 — targets harbor-2-sidecar.

Mode B (runner.py): HarborRunner, an EvalStrategy that — per candidate — runs a nested harbor run of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero SampleResults. One Harbor task = one sample; inference is delegated, scoring comes from Harbor.
The compiler (build/): vero harbor build renders a BuildConfig into a runnable Harbor task — a Docker Compose env (optimizer workbench + eval sidecar + 3 volumes), two Dockerfiles, instruction.md, tests/test.sh, seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Mode A and Mode B.

Review focus: the nested-run + collation, and the trust-boundary plumbing baked into the generated compose (volumes, root:600 token, secrets → sidecar only). Also un-ignores src/vero/harbor/build/ (the repo's build/ rule was hiding the package).

Stack: [1/4] core → [2/4] sidecar → this → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR adds Mode B nested-harbor-run evaluation (HarborRunner) and the vero harbor build compiler that renders a BuildConfig into a self-contained Harbor task directory (Docker Compose workbench + eval sidecar, Dockerfiles, scripts, and baked ServeConfig). The trust-boundary design is sound — secrets are sidecar-only, the admin token is root:600, and the optimizer runs as a de-privileged agent user.

runner.py: HarborRunner.produce_sample_results shells out to a nested harbor run, then collates per-trial result.json files into SampleResults; resume logic skips already-persisted samples.
compiler.py: compile_task materialises the baseline repo via git archive, registers the dataset into a baked VERO_HOME, emits a ServeConfig, and Jinja2-renders all environment files in one pass.
config.py: BuildConfig cleanly separates Mode A (dataset + scorer baked in) from Mode B (partition + inner Harbor task baked sidecar-only).

Confidence Score: 3/5

The trust-boundary plumbing (secrets sidecar-only, admin token root:600, agent de-privileged) is correctly wired. Three bugs need fixing before this should run in production: two in the runner and one in the compiler.

The runner silently ignores n_attempts/max_retries from HarborConfig, so retry behaviour is always the harbor default regardless of what the caller configures — directly affecting budget usage on live runs. _load_trials overwrites earlier results for the same task_name in filesystem-iteration order, meaning a successful retry can be replaced by a failed one, producing incorrect scores. In the compiler, git archive exit code is discarded: a failed archive populates an empty baseline directory whose SHA is recorded in the ServeConfig and baked into the image without any error signal.

Both runner.py (retry flag forwarding and collation deduplication) and compiler.py (git archive error handling and mkdtemp cleanup) need attention before this ships.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/runner.py	New HarborRunner EvalStrategy for Mode B: silently drops n_attempts/max_retries from harbor run, and _load_trials has non-deterministic last-write-wins when harbor retries produce multiple result.json files for the same task_name.
vero/src/vero/harbor/build/compiler.py	New Harbor build compiler: git archive exit code is silently ignored in _prepare_baseline_repo (corrupt baseline possible), and mkdtemp() scratch dir is never deleted.
vero/src/vero/harbor/build/config.py	BuildConfig Pydantic model cleanly captures Mode A/B differences; relative-path resolution in from_file() is correct and portable.
vero/src/vero/harbor/build/templates/docker-compose.yaml.j2	Compose template correctly isolates secrets to eval-sidecar only; admin/token volumes never mounted to main; healthcheck properly gates depends_on.
vero/tests/test_harbor_build.py	Comprehensive build compiler tests covering structure, ServeConfig validation, file parsing, path rewriting, and shared baseline SHA.
vero/tests/test_harbor_runner.py	Good coverage for command building, reward extraction, collation, and resume logic; correctly tests that missing trials produce error samples.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant V as vero (HarborRunner)
    participant H as harbor run (nested)
    participant S as eval-sidecar (vero harbor serve)
    participant FS as jobs_dir (filesystem)

    V->>V: "_task_names_for(params) -> [(sample_id, task_name)]"
    V->>V: filter pending (skip existing SampleResults)
    V->>H: uv run --project worktree harbor run -i task0 -i task1 --jobs-dir ...
    H->>S: POST /eval (agent commit)
    S-->>H: reward JSON
    H->>FS: write jobs/ts/trial/result.json
    H-->>V: exit code (non-zero tolerated)
    V->>FS: "rglob result.json -> task_name to result_dict"
    V->>V: _sample_result() per (sample_id, task_name)
    V->>V: "save_sample_result() -> VERO_HOME/sessions/"

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant V as vero (HarborRunner)
    participant H as harbor run (nested)
    participant S as eval-sidecar (vero harbor serve)
    participant FS as jobs_dir (filesystem)

    V->>V: "_task_names_for(params) -> [(sample_id, task_name)]"
    V->>V: filter pending (skip existing SampleResults)
    V->>H: uv run --project worktree harbor run -i task0 -i task1 --jobs-dir ...
    H->>S: POST /eval (agent commit)
    S-->>H: reward JSON
    H->>FS: write jobs/ts/trial/result.json
    H-->>V: exit code (non-zero tolerated)
    V->>FS: "rglob result.json -> task_name to result_dict"
    V->>V: _sample_result() per (sample_id, task_name)
    V->>V: "save_sample_result() -> VERO_HOME/sessions/"

Comments Outside Diff (2)

vero/src/vero/harbor/runner.py, line 739-752 (link)

n_attempts and max_retries silently dropped from harbor run

HarborConfig exposes n_attempts and max_retries as typed fields (not extra_args pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in _build_command, so every harbor run invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets max_retries=0 to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/runner.py
Line: 739-752

Comment:
**`n_attempts` and `max_retries` silently dropped from harbor run**

`HarborConfig` exposes `n_attempts` and `max_retries` as typed fields (not `extra_args` pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in `_build_command`, so every `harbor run` invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets `max_retries=0` to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

How can I resolve this? If you propose a fix, please make it concise.

vero/src/vero/harbor/runner.py, line 799-813 (link)

Last-write-wins on duplicate task_name when harbor retries a task

rglob("result.json") visits all result files across all timestamp directories under jobs_dir. If harbor's n_attempts or max_retries causes the same task to produce multiple trial entries with the same task_name, trials[task_name] = data simply overwrites with whichever file rglob happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

Prompt To Fix With AI

This is a comment left during a code review.
Path: vero/src/vero/harbor/runner.py
Line: 799-813

Comment:
**Last-write-wins on duplicate `task_name` when harbor retries a task**

`rglob("result.json")` visits all result files across all timestamp directories under `jobs_dir`. If harbor's `n_attempts` or `max_retries` causes the same task to produce multiple trial entries with the same `task_name`, `trials[task_name] = data` simply overwrites with whichever file `rglob` happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
vero/src/vero/harbor/runner.py:739-752
**`n_attempts` and `max_retries` silently dropped from harbor run**

`HarborConfig` exposes `n_attempts` and `max_retries` as typed fields (not `extra_args` pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in `_build_command`, so every `harbor run` invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets `max_retries=0` to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

### Issue 2 of 4
vero/src/vero/harbor/build/compiler.py:120-130
**`git archive` exit code is silently discarded**

`archive.wait()` is called but its return value is not checked. If `git archive` fails (e.g., the `rel` subpath doesn't exist at `HEAD`, or the ref is invalid), `tar` receives an empty or truncated stream. Depending on the tar implementation, it may exit 0 with partial content, causing `_prepare_baseline_repo` to return a commit SHA from a near-empty directory and bake that corrupt baseline into both containers. Additionally, if `subprocess.run(tar, ..., check=True)` raises before reaching `archive.wait()`, the `git archive` subprocess is never reaped until GC collects the `Popen` object.

### Issue 3 of 4
vero/src/vero/harbor/runner.py:799-813
**Last-write-wins on duplicate `task_name` when harbor retries a task**

`rglob("result.json")` visits all result files across all timestamp directories under `jobs_dir`. If harbor's `n_attempts` or `max_retries` causes the same task to produce multiple trial entries with the same `task_name`, `trials[task_name] = data` simply overwrites with whichever file `rglob` happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

### Issue 4 of 4
vero/src/vero/harbor/build/compiler.py:237-239
**Temp directory created by `mkdtemp()` is never cleaned up**

The directory created by `tempfile.mkdtemp()` on line 239 is used only to stage the dataset before registration, but is never deleted — not on success and not on exception. Over repeated builds this accumulates large scratch directories (the dataset can be gigabytes). Wrapping `tmp` creation with `tempfile.TemporaryDirectory()` as a context manager would clean it up automatically.

_{Reviews (1): Last reviewed commit: "Harbor: Mode B (nested harbor run) + the..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

- Mode B (runner.py): `HarborRunner`, an `EvalStrategy` that — for each candidate — runs a *nested* `harbor run` of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero `SampleResult`s. One Harbor task = one sample; inference is fully delegated, scoring comes from Harbor's verifier. - The compiler (build/): `vero harbor build` renders a `BuildConfig` into a runnable Harbor task directory — a Docker Compose environment (optimizer workbench `main` + the eval sidecar + three volumes), two Dockerfiles, instruction.md, tests/test.sh, and the seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Supports Mode A (local dataset/scorer) and Mode B (a registry or local Harbor benchmark, passed through to the HarborConfig). - `.gitignore`: un-ignore src/vero/harbor/build/ (the repo's `build/` rule was hiding the compiler package). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-24T18:32:31Z

+
+    sessions = vero_home / "sessions"
+    datasets = vero_home / "datasets"
+    (sessions / SESSION_ID).mkdir(parents=True, exist_ok=True)
+    datasets.mkdir(parents=True, exist_ok=True)
+    if not isinstance(dataset, str):  # a DatasetDict -> save_to_disk first
+        path = tmp / "ds"
+        dataset.save_to_disk(str(path))
+        dataset = str(path)
+    return resolve_and_save_dataset(dataset, sessions, datasets, SESSION_ID)
+


git archive exit code is silently discarded

archive.wait() is called but its return value is not checked. If git archive fails (e.g., the rel subpath doesn't exist at HEAD, or the ref is invalid), tar receives an empty or truncated stream. Depending on the tar implementation, it may exit 0 with partial content, causing _prepare_baseline_repo to return a commit SHA from a near-empty directory and bake that corrupt baseline into both containers. Additionally, if subprocess.run(tar, ..., check=True) raises before reaching archive.wait(), the git archive subprocess is never reaped until GC collects the Popen object.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/build/compiler.py Line: 120-130 Comment: **`git archive` exit code is silently discarded** `archive.wait()` is called but its return value is not checked. If `git archive` fails (e.g., the `rel` subpath doesn't exist at `HEAD`, or the ref is invalid), `tar` receives an empty or truncated stream. Depending on the tar implementation, it may exit 0 with partial content, causing `_prepare_baseline_repo` to return a commit SHA from a near-empty directory and bake that corrupt baseline into both containers. Additionally, if `subprocess.run(tar, ..., check=True)` raises before reaching `archive.wait()`, the `git archive` subprocess is never reaped until GC collects the `Popen` object. How can I resolve this? If you propose a fix, please make it concise.

Agree, addressing this in the incoming fix. Both halves are real. archive.wait() is called but its returncode is discarded, so a failed git archive (e.g. rel not present at HEAD, or a detached/invalid ref) feeds tar a short stream and we'd commit a baseline over a near-empty tree. And because subprocess.run(["tar", ...], check=True) runs before archive.wait(), a tar failure raises with the git archive child still unreaped.

The fix checks the producer's exit code and guarantees reaping: wrap the pipeline so archive.wait() runs in a finally, and raise if archive.returncode != 0 (with stderr captured) before trusting the extracted tree. That turns a silent corrupt-baseline into a hard build failure, which is the right behavior for the compile step.

(Landed in #9.)

greptile-apps · 2026-06-24T18:32:33Z

+        name=config.name,
+        description=config.description,
+        mode=config.mode,


Temp directory created by mkdtemp() is never cleaned up

The directory created by tempfile.mkdtemp() on line 239 is used only to stage the dataset before registration, but is never deleted — not on success and not on exception. Over repeated builds this accumulates large scratch directories (the dataset can be gigabytes). Wrapping tmp creation with tempfile.TemporaryDirectory() as a context manager would clean it up automatically.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/build/compiler.py Line: 237-239 Comment: **Temp directory created by `mkdtemp()` is never cleaned up** The directory created by `tempfile.mkdtemp()` on line 239 is used only to stage the dataset before registration, but is never deleted — not on success and not on exception. Over repeated builds this accumulates large scratch directories (the dataset can be gigabytes). Wrapping `tmp` creation with `tempfile.TemporaryDirectory()` as a context manager would clean it up automatically. How can I resolve this? If you propose a fix, please make it concise.

Agree, addressing this in the incoming fix. tmp = Path(tempfile.mkdtemp()) at line 208 is the staging dir for a DatasetDict.save_to_disk before resolve_and_save_dataset copies it into the baked VERO_HOME, and nothing ever removes it, so every compile leaks a scratch tree (and for Mode A datasets that can be gigabytes).

The fix scopes the staging dir to a with tempfile.TemporaryDirectory() as tmp: covering the dataset registration block, so it's cleaned up on both success and exception. The registered dataset already lives under env_dir/sidecar/vero_home/datasets by then, so tearing down the staging copy is safe.

(Minor note for the line ref: the leaked mkdtemp is created at line 208 and consumed by _register via tmp / "ds"; line 239 is inside the template-render block, but the underlying point about the un-cleaned staging dir is correct.)

(Landed in #9.)

shehabyasser-scale · 2026-06-30T08:34:12Z

+    # Execute
+    # ------------------------------------------------------------------
+
+    def _build_command(


HarborConfig.n_attempts / max_retries are typed fields, which signals they are honored, but _build_command never emits flags for them — so every nested run uses harbor defaults. On a budget-graded run that is a real cost/correctness bug. Either map them to the real harbor run flags here, or drop the fields. (Confirms Greptile's P1.)

shehabyasser-scale · 2026-06-30T08:34:19Z

+                continue
+            task_name = data.get("task_name")
+            if task_name:
+                trials[task_name] = data


trials[task_name] = data is last-write-wins over an unordered rglob; with retries or a reused jobs_dir, a failing attempt can clobber a passing one, feeding a wrong score into auto_best. Collect all matches and pick deterministically (latest mtime / highest attempt index), and/or namespace jobs_dir per attempt. (Confirms Greptile's P1.)

shehabyasser-scale · 2026-06-30T08:34:26Z

+        values = [float(v) for v in rewards.values()]
+        return sum(values) / len(values) if values else 0.0
+
+    def _existing(self, params: EvaluationParameters, sample_id: int) -> SampleResult | None:


_existing treats a persisted error sample as 'done', so a sample whose nested run failed once is permanently skipped on resume and silently degrades the candidate's score. Either don't persist error samples, or make the pending/skip predicate treat error results as not-done.

shehabyasser-scale · 2026-06-30T08:34:32Z

+            ["tar", "xf", "-", "--strip-components", str(strip)],
+            cwd=dest, stdin=archive.stdout, check=True,
+        )
+        archive.wait()


archive.wait()'s return code is discarded; if git archive fails, tar can still exit 0 on a truncated stream and you commit a near-empty baseline whose SHA gets baked into both images. Assert archive.wait() == 0 (and reap the Popen on the tar-failure path). (Confirms Greptile.)

shehabyasser-scale · 2026-06-30T08:34:39Z

+    environment:
+      VERO_HOME_DIR: "/opt/vero_home"
+{% for secret in secrets %}
+      {{ secret }}: "${{ '{' }}{{ secret }}{{ '}' }}"


This renders SECRET: "${SECRET}", resolved from the run host's env, not anything the compiler validates. An unset var interpolates to empty and the sidecar comes up credential-less, failing every eval opaquely. Consider failing the build / emitting a required-secrets manifest when a declared secret is missing. Note test_rendered_files_parse only asserts the literal ${...} string, which hides this.

shehabyasser-scale · 2026-06-30T08:35:52Z

+# ...except locked paths (e.g. the scorer): root-owned + unwritable.
+if [ -e "/work/agent/{{ p }}" ]; then
+  chown -R root:root "/work/agent/{{ p }}"
+  chmod -R a-w "/work/agent/{{ p }}"


This chown root:root + chmod -R a-w applies to the working tree under /work/agent, but the verifier never executes the working tree: _transfer_commit fetches git blobs and the evaluator checks the commit out into a fresh temp copy. The agent can git add -f a modified scorer and commit it regardless of working-tree perms. So read_only_paths is not a tamper control — please document it as advisory-only and move scorer-provenance enforcement into the sidecar (see the Mode A scorer-provenance comment on #4).

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:15

varunursekar marked this pull request as ready for review June 24, 2026 18:22

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 2/4] eval sidecar: verifier, token auth, HTTP API (Mode A) #4

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

[Harbor 4/4] architecture docs, tutorial, and the GAIA example #6

Open

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler] #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5
varunursekar wants to merge 1 commit into
harbor-2-sidecarfrom
harbor-3-compiler

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading