Skip to content

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6

Open
varunursekar wants to merge 1 commit into
harbor-3-compilerfrom
harbor-4-docs
Open

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6
varunursekar wants to merge 1 commit into
harbor-3-compilerfrom
harbor-4-docs

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 4 of 4 — targets harbor-3-compiler. Additive, low-risk.

  • docs/harbor/architecture.md — what it is, the compiled-task topology, the two modes, the component map, and the leaderboard-integrity model.
  • docs/harbor/tutorial.md — build + run end to end (both modes, the agent-side protocol); README Harbor section.
  • examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (thin Terminus2 subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Start your reading here for the big picture, then dive into [1/4]–[3/4].

Stack: [1/4] core → [2/4] sidecar → [3/4] compiler → this.

🤖 Generated with Claude Code

Greptile Summary

This PR is the final stack entry (4/4) for the Harbor integration, adding architecture docs, a tutorial, and a runnable Mode-B example (gaia-optimization) that optimizes a GaiaAgent prompt against real GAIA tasks via a nested harbor run on Modal.

  • Documentation (docs/harbor/architecture.md, docs/harbor/tutorial.md): covers the compiled-task topology, the two evaluation modes (A = vero scores, B = nested Harbor run scores), the trust boundary / leaderboard-integrity model, and the full CLI walkthrough end to end.
  • GAIA example (examples/gaia-optimization): a thin Terminus2 subclass that redirects the prompt-template path to an editable prompts/ directory, a build.yaml wiring up Mode B on Modal, and the copied prompt templates that form the optimization surface.
  • README update: adds a Harbor integration section with a quick-start snippet and links to the new docs.

Confidence Score: 4/5

Entirely additive — new docs and an example package with no changes to vero core; safe to merge.

The changes are documentation and an example that adds no new runtime paths to vero itself. The two issues found are minor: typos in the XML prompt template (which is the optimization surface — an optimizer would fix them during a run anyway) and a potential @staticmethod vs instance-method mismatch on version() in GaiaAgent that could surface only if Harbor calls GaiaAgent.version() as a class-level static.

src/gaia_agent/agent.py and src/gaia_agent/prompts/terminus-xml-plain.txt have the two flagged issues; all other files are clean.

Important Files Changed

Filename Overview
vero/README.md Adds a Harbor integration section with install snippet and links to docs/examples; purely additive and accurate.
vero/docs/harbor/architecture.md New architecture doc covering the compiled-task topology, two modes (A/B), trust boundary, and component map; thorough and internally consistent.
vero/docs/harbor/tutorial.md New tutorial doc showing both modes end to end, the agent-side protocol, and how to inspect runs; matches the architecture doc.
vero/examples/gaia-optimization/README.md Clear example README covering prerequisites, run instructions, caveats, and attribution for the copied prompt files.
vero/examples/gaia-optimization/build.yaml Mode-B build config with sensible budget (3 train evals), correct split visibility tiers, and well-commented placeholder task IDs.
vero/examples/gaia-optimization/pyproject.toml Minimal package config; force-include correctly bundles the editable prompts directory into the wheel so agent.py's file-relative path resolves after install.
vero/examples/gaia-optimization/src/gaia_agent/agent.py Thin Terminus2 subclass redirecting the prompt-template path; accesses private _parser_name and has a version()/name() static/instance inconsistency worth aligning with the base class.
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-json-plain.txt JSON prompt template copied from Harbor's terminus_2; correctly uses double-braced literals and single-braced format variables; no issues.
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt XML prompt template with two typos ("apprpriate" and "In is always possible") that will appear verbatim in every agent call.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Dev as Developer
    participant VeroHarbor as vero harbor CLI
    participant Main as main container (optimizer)
    participant Sidecar as eval-sidecar (vero harbor serve)
    participant Modal as Modal (nested harbor run)
    participant Verifier as tests/test.sh (shared verifier)

    Dev->>VeroHarbor: vero harbor build -c build.yaml -o /tmp/task
    VeroHarbor-->>Dev: Harbor task dir (compose + Dockerfiles + instruction.md)

    Dev->>Main: harbor run -p /tmp/task -a claude-code -e docker
    activate Main
    activate Sidecar
    Note over Sidecar: vero harbor serve starts, writes per-trial admin token (root:600)

    Main->>Main: optimizer edits prompts/, commits
    Main->>Sidecar: "POST /eval?split=train"
    Sidecar->>Sidecar: git fetch commit (file://, hooks disabled)
    Sidecar->>Modal: harbor run GaiaAgent on train tasks
    Modal-->>Sidecar: per-task verifier rewards
    Sidecar-->>Main: aggregate score + remaining budget

    Note over Main: repeat edits + evals within budget

    Main->>Verifier: trial end — tests/test.sh runs
    Verifier->>Sidecar: POST /finalize (admin token)
    Sidecar->>Sidecar: select best train commit
    Sidecar->>Modal: harbor run on hidden validation tasks
    Modal-->>Sidecar: accuracy
    Sidecar-->>Verifier: reward.json
    deactivate Sidecar
    deactivate Main
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Dev as Developer
    participant VeroHarbor as vero harbor CLI
    participant Main as main container (optimizer)
    participant Sidecar as eval-sidecar (vero harbor serve)
    participant Modal as Modal (nested harbor run)
    participant Verifier as tests/test.sh (shared verifier)

    Dev->>VeroHarbor: vero harbor build -c build.yaml -o /tmp/task
    VeroHarbor-->>Dev: Harbor task dir (compose + Dockerfiles + instruction.md)

    Dev->>Main: harbor run -p /tmp/task -a claude-code -e docker
    activate Main
    activate Sidecar
    Note over Sidecar: vero harbor serve starts, writes per-trial admin token (root:600)

    Main->>Main: optimizer edits prompts/, commits
    Main->>Sidecar: "POST /eval?split=train"
    Sidecar->>Sidecar: git fetch commit (file://, hooks disabled)
    Sidecar->>Modal: harbor run GaiaAgent on train tasks
    Modal-->>Sidecar: per-task verifier rewards
    Sidecar-->>Main: aggregate score + remaining budget

    Note over Main: repeat edits + evals within budget

    Main->>Verifier: trial end — tests/test.sh runs
    Verifier->>Sidecar: POST /finalize (admin token)
    Sidecar->>Sidecar: select best train commit
    Sidecar->>Modal: harbor run on hidden validation tasks
    Modal-->>Sidecar: accuracy
    Sidecar-->>Verifier: reward.json
    deactivate Sidecar
    deactivate Main
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt:26-28
Two typos on this line: "apprpriate" should be "appropriate", and directly after on the next line "In is always possible" should be "It is always possible". Since this file is the LLM prompt template served verbatim to the inner GAIA agent, these errors appear in every agent turn and could subtly degrade instruction-following.

```suggestion
The `duration` attribute of <keystrokes> specifies the number of seconds to wait for the command to complete (default: 1.0) before the next command will be executed. On immediate tasks (e.g., cd, ls, echo, cat) set a duration of 0.1 seconds. On commands (e.g., gcc, find, rustc) set a duration of 1.0 seconds. On slow commands (e.g., make, python3 [long running script], wget [file]) set an appropriate duration as you determine necessary.

It is better to set a smaller duration than a longer duration. It is always possible to wait again if the prior output has not finished, by running <keystrokes duration="10.0"></keystrokes> on subsequent requests to wait longer. Never wait longer than 60 seconds; prefer to poll to see intermediate result status.
```

### Issue 2 of 2
vero/examples/gaia-optimization/src/gaia_agent/agent.py:30-31
`version()` signature mismatch with `name()`

`name()` is declared as a `@staticmethod` but `version()` is an instance method. If `Terminus2` defines `version()` as a `@staticmethod` (the typical pattern when `name()` is also static), then `GaiaAgent.version()` won't properly override it when called as `GaiaAgent.version()` on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however `Terminus2` declares `version()`.

Reviews (1): Last reviewed commit: "Harbor: architecture docs, tutorial, and..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

- docs/harbor/architecture.md — what the integration is, the compiled-task topology,
  the two evaluation modes, the component map, and the leaderboard-integrity model.
- docs/harbor/tutorial.md — build and run an optimization task end to end (both modes,
  the agent-side protocol), and a Harbor section in the README.
- examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (a thin Terminus2
  subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:18
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:22
Comment on lines +30 to +31
def version(self) -> str:
return "0.1.0"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 version() signature mismatch with name()

name() is declared as a @staticmethod but version() is an instance method. If Terminus2 defines version() as a @staticmethod (the typical pattern when name() is also static), then GaiaAgent.version() won't properly override it when called as GaiaAgent.version() on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however Terminus2 declares version().

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/examples/gaia-optimization/src/gaia_agent/agent.py
Line: 30-31

Comment:
`version()` signature mismatch with `name()`

`name()` is declared as a `@staticmethod` but `version()` is an instance method. If `Terminus2` defines `version()` as a `@staticmethod` (the typical pattern when `name()` is also static), then `GaiaAgent.version()` won't properly override it when called as `GaiaAgent.version()` on the class rather than on an instance — the base class static will shadow it. This is worth aligning with however `Terminus2` declares `version()`.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a non-issue, the signatures already match the base class. In Terminus2 (and its BaseAgent ABC), the two methods are intentionally declared differently: name() is a @staticmethod @abstractmethod, while version() is a plain instance method (def version(self) -> str | None:, returning "2.0.0"). GaiaAgent mirrors that exactly: static name(), instance version(). So the override is correct and there's no shadowing.

The comment's premise ("if Terminus2 defines version() as a @staticmethod") doesn't hold against the installed harbor source (harbor/agents/terminus_2/terminus_2.py, lines 354-360, and harbor/agents/base.py, lines 75-81), so no change needed here.


This lets anyone optimize a coding agent with plain `harbor run`, and makes the result
leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or
bypass its budget.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as a hard guarantee ('the optimizer cannot read hidden labels, modify the scorer, or bypass its budget'), but the code makes each best-effort and the shipped GAIA example undercuts the first one (see the build.yaml comment). Suggest softening to something like: 'vero never writes per-sample labels to the agent's volume and meters every agent evaluation; OS-level mechanisms (read-only paths, a root:600 finalize token) keep the scorer and test split out of the agent's reach on a best-effort basis.'


splits:
- { split: train, access: non_viewable } # optimizer sees aggregate scores only
- { split: validation, access: no_access } # hidden; never reaches the optimizer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'hidden; never reaches the optimizer' isn't true for this config: build.yaml is git-tracked and agent_repo is ., so vero harbor build seeds this whole file — including these validation task IDs — into /work/agent via git archive HEAD. The optimizer can read the held-out task IDs, and GAIA answers are public. Move the partition out of the agent_repo subtree, or caveat that for public benchmarks the held-out identity is visible (only per-sample scores are withheld). This is the example that backs the headline 'cannot read hidden labels' claim, so it is worth getting airtight.

Comment thread vero/README.md

- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model.
- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end.
- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples/gsm8k-agent is cited as the Mode A example but it has no build.yaml (it's the older Policy-API example). The Harbor Mode A example that ships a build.yaml is examples/doubler-agent. Repoint here, or add a build.yaml to gsm8k-agent.

The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at
the OS/process level (a container escape is out of scope):

- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth one explicit line here: tier_for_split defaults any split not listed to viewable (full per-sample results), so omission fails open. Tell authors to list every split explicitly. (Pairs with the protocol.py fail-open comment on #4.)

- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted
repo into its *own* repo with hooks disabled and `file://` (object copy, no
alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident.
- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'the scorer is sidecar-only' holds for Mode B but not Mode A, where the scorer lives in the agent's editable repo, protected only by chown root:root + chmod -R a-w on read_only_paths (which isn't a real tamper control — see #5). Recommend splitting this claim by mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants