Skip to content

feat(evals): add OdysseysBench agent benchmark#2275

Open
miguelg719 wants to merge 5 commits into
mainfrom
miguelgonzalez/evals-odysseysbench
Open

feat(evals): add OdysseysBench agent benchmark#2275
miguelg719 wants to merge 5 commits into
mainfrom
miguelgonzalez/evals-odysseysbench

Conversation

@miguelg719

@miguelg719 miguelg719 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Why

OdysseysBench is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard) where every task ships a weighted rubric (weights sum to 1.0). It slots naturally into the rubric-based verifier path — like WebTailBench — so we can score process and outcome against the published criteria instead of generating rubrics.

What Changed

  • Dataset (packages/evals/datasets/odysseysbench/): committed source snapshot (source/tasks.json, mirrored from https://odysseysbench.com/assets/data/tasks.json) plus the generated OdysseysBench_data.jsonl (200 rows).
  • Converter (scripts/build-odysseysbench-dataset.ts): deterministic transform of each task's rubrics map → the verifier's precomputed_rubric ({ items: [{ criterion, description, max_points }] }). Rubric weights scale to integer points (sum ≈ 100; scale is immaterial since the process score is a ratio). Run with --fetch to refresh the snapshot.
  • Suite (suites/odysseysbench.ts): buildOdysseysBenchTestcases, mirroring the WebTailBench suite. Env knobs: EVAL_ODYSSEYSBENCH_LIMIT (default 25), EVAL_ODYSSEYSBENCH_SAMPLE, EVAL_ODYSSEYSBENCH_LEVEL (easy/medium/hard filter), EVAL_ODYSSEYSBENCH_IDS.
  • Bench task (tasks/bench/agent/odysseysbench.ts): runs the agent through TrajectoryRecorder + V3Evaluator.verify() with the precomputed rubric.
  • Wiring: dataset fan-out in index.eval.ts (respects EVAL_DATASET=odysseysbench); external_agent_benchmarks category override in taskConfig.ts and cli-legacy.ts.

How to run

pnpm evals --eval-name agent/odysseysbench
EVAL_ODYSSEYSBENCH_LEVEL=hard EVAL_ODYSSEYSBENCH_LIMIT=10 pnpm evals --eval-name agent/odysseysbench

Tests

  • pnpm --filter @browserbasehq/stagehand-evals run typecheck — clean
  • prettier --check on all changed files — clean
  • Dataset fidelity: 200/200 rows; instructions, websites, levels match source; rubric counts + order preserved; all max_points ≥ 1; task_ids unique.
  • Discovery smoke: agent/odysseysbench registers under external_agent_benchmarks; suite builds testcases with rubric attached; EVAL_ODYSSEYSBENCH_LEVEL=hard returns exactly 109 tasks.

Summary by cubic

Adds OdysseysBench as a built-in agent benchmark with precomputed rubrics for outcome and process scoring across 200 web tasks. Tightens env parsing and rubric validation to preserve scoring fidelity and avoid sampling bypasses; fully wires modern CLI and external harness support under external_agent_benchmarks.

  • New Features

    • Dataset: committed source snapshot and generated OdysseysBench_data.jsonl; task rubrics converted to verifier precomputed_rubric.
    • Converter: packages/evals/scripts/build-odysseysbench-dataset.ts (deterministic; --fetch refreshes upstream).
    • Suite: packages/evals/suites/odysseysbench.ts with limit/sample/level/ids knobs.
    • Bench task: packages/evals/tasks/bench/agent/odysseysbench.ts via TrajectoryRecorder + V3Evaluator.verify(); success mode via EVAL_SUCCESS_MODE (outcome|process|both).
    • Wiring: dataset fan-out in packages/evals/index.eval.ts; category override to external_agent_benchmarks; run with pnpm evals --eval-name agent/odysseysbench.
  • Bug Fixes

    • Legacy CLI: register in packages/evals/evals.config.json and packages/evals/cli-legacy.ts so b:odysseysbench resolves.
    • Modern CLI: register in packages/evals/tui/commands/parse.ts and packages/evals/framework/benchPlanner.ts, and add packages/evals/framework/externalHarnessPlan.ts support so b:odysseysbench runs and external harnesses get instruction/startUrl; ensure discovery lists under external_agent_benchmarks.
    • Suite: sanitize EVAL_MAX_K/EVAL_ODYSSEYSBENCH_LIMIT/EVAL_ODYSSEYSBENCH_SAMPLE to prevent NaN from bypassing caps.
    • Bench task: hard-fail if a task is missing precomputed_rubric.
    • Rubric points: scale weights x1000 to avoid rounding distortion of small criteria.
    • Converter: validate task_id/confirmed_task; validate each rubric item (non-empty fields; weight in (0,1]); ensure weights sum to ~1.0; assert row count.

Written for commit 29ccd1a. Summary will update on new commits.

Review in cubic

OdysseysBench (https://odysseysbench.com) is a 200-task web-agent benchmark
(45 easy / 46 medium / 109 hard). Each task ships a weighted rubric whose
weights sum to 1.0; build-odysseysbench-dataset.ts converts those into the
verifier's precomputed_rubric format so process + outcome scoring use the
published criteria directly (no rubric generation).

- datasets/odysseysbench: committed source snapshot + generated JSONL
- scripts/build-odysseysbench-dataset.ts: deterministic converter (--fetch to refresh)
- suites/odysseysbench.ts: testcase builder with limit/sample/level/ids knobs
- tasks/bench/agent/odysseysbench.ts: bench task via TrajectoryRecorder + verifier
- index.eval.ts / taskConfig.ts / cli-legacy.ts: dataset fan-out + category wiring

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 29ccd1a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@browserbasehq/stagehand-evals Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 9 files

Confidence score: 3/5

  • packages/evals/tasks/bench/agent/odysseysbench.ts currently falls back to a generated rubric when precomputed_rubric is missing/invalid, which can silently change scoring behavior and undermine benchmark fidelity in published results. Treat missing rubric data as a hard failure for OdysseysBench cases before merging.
  • packages/evals/suites/odysseysbench.ts parses env limits/samples without validating numeric values, so NaN can slip through and bypass intended caps, leading to accidental full-dataset runs and unpredictable runtime/cost. Guard these inputs to finite positive integers before they reach sampling.
  • packages/evals/scripts/build-odysseysbench-dataset.ts uses path.join for repo-internal dataset paths, which can emit Windows backslashes and break the repo’s forward-slash path convention. Normalize to '/'-style paths in generated metadata/scripts to avoid cross-platform inconsistencies.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">

<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>

<file name="packages/evals/suites/odysseysbench.ts">

<violation number="1" location="packages/evals/suites/odysseysbench.ts:36">
P2: Unvalidated numeric env parsing can turn limit/sample into `NaN`, causing limit bypass and unexpected full-dataset runs. Sanitize env values to finite positive integers before passing to sampling.</violation>
</file>

<file name="packages/evals/tasks/bench/agent/odysseysbench.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/odysseysbench.ts:62">
P2: Missing/invalid `precomputed_rubric` silently triggers generated-rubric fallback, which breaks benchmark fidelity. Reject the case when OdysseysBench rubric data is absent instead of continuing.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant CLI as Eval CLI
    participant Suite as buildOdysseysBenchTestcases
    participant Dataset as OdysseysBench_data.jsonl
    participant Task as agent/odysseysbench task
    participant Verifier as runWithVerifier
    participant Agent as Stagehand Agent
    participant Browser as Browser Page

    Note over CLI,Browser: OdysseysBench Evaluation Flow

    CLI->>Suite: EVAL_DATASET=odysseysbench
    CLI->>Suite: env knobs (LIMIT, SAMPLE, LEVEL, IDS)

    Suite->>Dataset: readJsonlFile()
    Dataset-->>Suite: 200 JSONL rows with precomputed_rubric

    alt EVAL_ODYSSEYSBENCH_IDS set
        Suite->>Suite: Filter by explicit task_ids
    else EVAL_ODYSSEYSBENCH_LEVEL set
        Suite->>Suite: Filter by difficulty level
        Suite->>Suite: applySampling()
    else default
        Suite->>Suite: applySampling() limit=25
    end
    Suite->>Suite: normalizeAgentModelEntries()

    loop For each model × task combination
        Suite-->>Task: Testcase { input, params, metadata }
    end

    CLI->>Task: Execute test case
    Task->>Task: Validate confirmed_task param

    Task->>Browser: page.goto(startUrl)
    Browser-->>Task: Page loaded

    Task->>Agent: agent({ mode, model, systemPrompt })
    Task->>Task: Build TaskSpec with precomputedRubric

    Task->>Verifier: runWithVerifier({ agent, taskSpec })
    Verifier->>Agent: Execute agent on task
    Agent->>Browser: Navigate & interact
    Browser-->>Agent: Page state

    Agent-->>Verifier: Trajectory + results
    Verifier->>Verifier: V3Evaluator.verify() with rubric

    alt EVAL_SUCCESS_MODE=outcome
        Verifier-->>Task: outcomeSuccess
    else EVAL_SUCCESS_MODE=process
        Verifier-->>Task: processScore
    else EVAL_SUCCESS_MODE=both
        Verifier-->>Task: both scores
    end

    Task->>Task: evaluationResultToSuccess()
    Task-->>CLI: { _success, scores, trajectoryDir, logs }
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/evals/suites/odysseysbench.ts Outdated
Comment thread packages/evals/tasks/bench/agent/odysseysbench.ts Outdated

const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json";

const DATASET_DIR = path.join(

@cubic-dev-ai cubic-dev-ai Bot Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) .

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/build-odysseysbench-dataset.ts, line 27:

<comment>Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) .</comment>

<file context>
@@ -0,0 +1,151 @@
+
+const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json";
+
+const DATASET_DIR = path.join(
+  path.resolve(import.meta.dirname, ".."),
+  "datasets",
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining this one: path.join is the correct choice for runtime filesystem paths (it emits the OS-native separator, which is what fs wants on Windows), and it matches the sibling dev script scripts/backfill-webtailbench-rubrics.ts, which also uses path.join for its dataset paths. The forward-slash convention applies to in-code/URL/import paths; the suite loader that builds an embedded dataset path does use /. Keeping path.join here for consistency with the existing converter script.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. This comment was influenced by this learning. Open the link to edit it, or reply here to edit or delete it.

- Complete legacy CLI wiring: register odysseysbench in evals.config.json
  benchmarks + cli-legacy benchmarkMap so `b:odysseysbench` resolves instead
  of erroring 'Unknown benchmark' (CATEGORY_OVERRIDES alone left it half-wired).
- Rubric points: scale weights x1000 (was x100) so rounding no longer distorts
  the relative weighting of small criteria; per-task share error is now 0.
- Converter: validate task_id/confirmed_task and that rubric weights sum to ~1.0,
  and assert row count, so a re-fetched upstream change fails loud instead of
  silently dropping or mis-weighting tasks.
- Drop dead `key` param in toRubricItem; tighten task `level` to the union type.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">

<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/evals/scripts/build-odysseysbench-dataset.ts
miguelg719 and others added 3 commits June 25, 2026 12:12
- suites/odysseysbench.ts: sanitize EVAL_MAX_K/LIMIT/SAMPLE via parsePositiveIntEnv
  so a non-numeric value no longer becomes NaN and bypasses the sampling cap.
- tasks/bench/agent/odysseysbench.ts: hard-fail a case whose precomputed_rubric is
  missing instead of silently falling back to a generated rubric (benchmark fidelity).
- scripts/build-odysseysbench-dataset.ts: validate each rubric entry individually
  (non-empty requirement/verification, weight finite in (0,1]) in addition to the
  aggregate sum, so a bad individual weight can't slip through.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
framework/discovery.ts has its own CATEGORY_OVERRIDES (used by the cli list/run
path, separate from taskConfig.ts); without this entry agent/odysseysbench fell
into the plain 'agent' category in the modern CLI instead of external_agent_benchmarks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ss planner

The modern 'evals run' path uses its own registries, separate from taskConfig /
index.eval.ts: parse.ts SUPPORTED_BENCHMARKS (b:/benchmark: resolver),
benchPlanner suiteMap (testcase fan-out), and externalHarnessPlan (claude_code/
codex instruction+startUrl extraction). Without these, 'run b:odysseysbench' (or
--harness claude_code/codex) errored 'Unknown benchmark'. Adds odysseysbench to
all three for parity with webtailbench.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/externalHarnessPlan.ts">

<violation number="1" location="packages/evals/framework/externalHarnessPlan.ts:71">
P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

};
}

if (input.name === "agent/odysseysbench") {

@cubic-dev-ai cubic-dev-ai Bot Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/externalHarnessPlan.ts, line 71:

<comment>New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</comment>

<file context>
@@ -68,7 +68,22 @@ export function buildExternalHarnessTaskPlan(
     };
   }
 
+  if (input.name === "agent/odysseysbench") {
+    const instruction = readString(params, "confirmed_task");
+    if (!instruction) {
</file context>
Fix with cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant