feat(evals): add OdysseysBench agent benchmark by miguelg719 · Pull Request #2275 · browserbase/stagehand

miguelg719 · 2026-06-25T18:45:11Z

Why

OdysseysBench is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard) where every task ships a weighted rubric (weights sum to 1.0). It slots naturally into the rubric-based verifier path — like WebTailBench — so we can score process and outcome against the published criteria instead of generating rubrics.

What Changed

Dataset (packages/evals/datasets/odysseysbench/): committed source snapshot (source/tasks.json, mirrored from https://odysseysbench.com/assets/data/tasks.json) plus the generated OdysseysBench_data.jsonl (200 rows).
Converter (scripts/build-odysseysbench-dataset.ts): deterministic transform of each task's rubrics map → the verifier's precomputed_rubric ({ items: [{ criterion, description, max_points }] }). Rubric weights scale to integer points (sum ≈ 100; scale is immaterial since the process score is a ratio). Run with --fetch to refresh the snapshot.
Suite (suites/odysseysbench.ts): buildOdysseysBenchTestcases, mirroring the WebTailBench suite. Env knobs: EVAL_ODYSSEYSBENCH_LIMIT (default 25), EVAL_ODYSSEYSBENCH_SAMPLE, EVAL_ODYSSEYSBENCH_LEVEL (easy/medium/hard filter), EVAL_ODYSSEYSBENCH_IDS.
Bench task (tasks/bench/agent/odysseysbench.ts): runs the agent through TrajectoryRecorder + V3Evaluator.verify() with the precomputed rubric.
Wiring: dataset fan-out in index.eval.ts (respects EVAL_DATASET=odysseysbench); external_agent_benchmarks category override in taskConfig.ts and cli-legacy.ts.

How to run

pnpm evals --eval-name agent/odysseysbench
EVAL_ODYSSEYSBENCH_LEVEL=hard EVAL_ODYSSEYSBENCH_LIMIT=10 pnpm evals --eval-name agent/odysseysbench

Tests

pnpm --filter @browserbasehq/stagehand-evals run typecheck — clean
prettier --check on all changed files — clean
Dataset fidelity: 200/200 rows; instructions, websites, levels match source; rubric counts + order preserved; all max_points ≥ 1; task_ids unique.
Discovery smoke: agent/odysseysbench registers under external_agent_benchmarks; suite builds testcases with rubric attached; EVAL_ODYSSEYSBENCH_LEVEL=hard returns exactly 109 tasks.

Summary by cubic

Adds OdysseysBench as a built-in agent benchmark with precomputed rubrics for outcome and process scoring across 200 web tasks. Tightens env parsing and rubric validation to preserve scoring fidelity and avoid sampling bypasses; fully wires modern CLI and external harness support under external_agent_benchmarks.

New Features
- Dataset: committed source snapshot and generated OdysseysBench_data.jsonl; task rubrics converted to verifier precomputed_rubric.
- Converter: packages/evals/scripts/build-odysseysbench-dataset.ts (deterministic; --fetch refreshes upstream).
- Suite: packages/evals/suites/odysseysbench.ts with limit/sample/level/ids knobs.
- Bench task: packages/evals/tasks/bench/agent/odysseysbench.ts via TrajectoryRecorder + V3Evaluator.verify(); success mode via EVAL_SUCCESS_MODE (outcome|process|both).
- Wiring: dataset fan-out in packages/evals/index.eval.ts; category override to external_agent_benchmarks; run with pnpm evals --eval-name agent/odysseysbench.
Bug Fixes
- Legacy CLI: register in packages/evals/evals.config.json and packages/evals/cli-legacy.ts so b:odysseysbench resolves.
- Modern CLI: register in packages/evals/tui/commands/parse.ts and packages/evals/framework/benchPlanner.ts, and add packages/evals/framework/externalHarnessPlan.ts support so b:odysseysbench runs and external harnesses get instruction/startUrl; ensure discovery lists under external_agent_benchmarks.
- Suite: sanitize EVAL_MAX_K/EVAL_ODYSSEYSBENCH_LIMIT/EVAL_ODYSSEYSBENCH_SAMPLE to prevent NaN from bypassing caps.
- Bench task: hard-fail if a task is missing precomputed_rubric.
- Rubric points: scale weights x1000 to avoid rounding distortion of small criteria.
- Converter: validate task_id/confirmed_task; validate each rubric item (non-empty fields; weight in (0,1]); ensure weights sum to ~1.0; assert row count.

^{Written for commit 29ccd1a. Summary will update on new commits.}

OdysseysBench (https://odysseysbench.com) is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard). Each task ships a weighted rubric whose weights sum to 1.0; build-odysseysbench-dataset.ts converts those into the verifier's precomputed_rubric format so process + outcome scoring use the published criteria directly (no rubric generation). - datasets/odysseysbench: committed source snapshot + generated JSONL - scripts/build-odysseysbench-dataset.ts: deterministic converter (--fetch to refresh) - suites/odysseysbench.ts: testcase builder with limit/sample/level/ids knobs - tasks/bench/agent/odysseysbench.ts: bench task via TrajectoryRecorder + verifier - index.eval.ts / taskConfig.ts / cli-legacy.ts: dataset fan-out + category wiring Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

changeset-bot · 2026-06-25T18:45:20Z

🦋 Changeset detected

Latest commit: 29ccd1a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand-evals	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cubic-dev-ai

3 issues found across 9 files

Confidence score: 3/5

packages/evals/tasks/bench/agent/odysseysbench.ts currently falls back to a generated rubric when precomputed_rubric is missing/invalid, which can silently change scoring behavior and undermine benchmark fidelity in published results. Treat missing rubric data as a hard failure for OdysseysBench cases before merging.
packages/evals/suites/odysseysbench.ts parses env limits/samples without validating numeric values, so NaN can slip through and bypass intended caps, leading to accidental full-dataset runs and unpredictable runtime/cost. Guard these inputs to finite positive integers before they reach sampling.
packages/evals/scripts/build-odysseysbench-dataset.ts uses path.join for repo-internal dataset paths, which can emit Windows backslashes and break the repo’s forward-slash path convention. Normalize to '/'-style paths in generated metadata/scripts to avoid cross-platform inconsistencies.

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">

<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>

<file name="packages/evals/suites/odysseysbench.ts">

<violation number="1" location="packages/evals/suites/odysseysbench.ts:36">
P2: Unvalidated numeric env parsing can turn limit/sample into `NaN`, causing limit bypass and unexpected full-dataset runs. Sanitize env values to finite positive integers before passing to sampling.</violation>
</file>

<file name="packages/evals/tasks/bench/agent/odysseysbench.ts">

<violation number="1" location="packages/evals/tasks/bench/agent/odysseysbench.ts:62">
P2: Missing/invalid `precomputed_rubric` silently triggers generated-rubric fallback, which breaks benchmark fidelity. Reject the case when OdysseysBench rubric data is absent instead of continuing.</violation>
</file>

Architecture diagram

sequenceDiagram
    participant CLI as Eval CLI
    participant Suite as buildOdysseysBenchTestcases
    participant Dataset as OdysseysBench_data.jsonl
    participant Task as agent/odysseysbench task
    participant Verifier as runWithVerifier
    participant Agent as Stagehand Agent
    participant Browser as Browser Page

    Note over CLI,Browser: OdysseysBench Evaluation Flow

    CLI->>Suite: EVAL_DATASET=odysseysbench
    CLI->>Suite: env knobs (LIMIT, SAMPLE, LEVEL, IDS)

    Suite->>Dataset: readJsonlFile()
    Dataset-->>Suite: 200 JSONL rows with precomputed_rubric

    alt EVAL_ODYSSEYSBENCH_IDS set
        Suite->>Suite: Filter by explicit task_ids
    else EVAL_ODYSSEYSBENCH_LEVEL set
        Suite->>Suite: Filter by difficulty level
        Suite->>Suite: applySampling()
    else default
        Suite->>Suite: applySampling() limit=25
    end
    Suite->>Suite: normalizeAgentModelEntries()

    loop For each model × task combination
        Suite-->>Task: Testcase { input, params, metadata }
    end

    CLI->>Task: Execute test case
    Task->>Task: Validate confirmed_task param

    Task->>Browser: page.goto(startUrl)
    Browser-->>Task: Page loaded

    Task->>Agent: agent({ mode, model, systemPrompt })
    Task->>Task: Build TaskSpec with precomputedRubric

    Task->>Verifier: runWithVerifier({ agent, taskSpec })
    Verifier->>Agent: Execute agent on task
    Agent->>Browser: Navigate & interact
    Browser-->>Agent: Page state

    Agent-->>Verifier: Trajectory + results
    Verifier->>Verifier: V3Evaluator.verify() with rubric

    alt EVAL_SUCCESS_MODE=outcome
        Verifier-->>Task: outcomeSuccess
    else EVAL_SUCCESS_MODE=process
        Verifier-->>Task: processScore
    else EVAL_SUCCESS_MODE=both
        Verifier-->>Task: both scores
    end

    Task->>Task: evaluationResultToSuccess()
    Task-->>CLI: { _success, scores, trajectoryDir, logs }

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-25T18:50:08Z

+
+const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json";
+
+const DATASET_DIR = path.join(


P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) .

View Feedback

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/build-odysseysbench-dataset.ts, line 27: <comment>Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention. (Based on your team's feedback about forward-slash path separators.) .</comment> <file context> @@ -0,0 +1,151 @@ + +const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json"; + +const DATASET_DIR = path.join( + path.resolve(import.meta.dirname, ".."), + "datasets", </file context>

Declining this one: path.join is the correct choice for runtime filesystem paths (it emits the OS-native separator, which is what fs wants on Windows), and it matches the sibling dev script scripts/backfill-webtailbench-rubrics.ts, which also uses path.join for its dataset paths. The forward-slash convention applies to in-code/URL/import paths; the suite loader that builds an embedded dataset path does use /. Keeping path.join here for consistency with the existing converter script.

Thanks for the feedback. This comment was influenced by this learning. Open the link to edit it, or reply here to edit or delete it.

- Complete legacy CLI wiring: register odysseysbench in evals.config.json benchmarks + cli-legacy benchmarkMap so `b:odysseysbench` resolves instead of erroring 'Unknown benchmark' (CATEGORY_OVERRIDES alone left it half-wired). - Rubric points: scale weights x1000 (was x100) so rounding no longer distorts the relative weighting of small criteria; per-task share error is now 0. - Converter: validate task_id/confirmed_task and that rubric weights sum to ~1.0, and assert row count, so a re-fetched upstream change fails loud instead of silently dropping or mis-weighting tasks. - Drop dead `key` param in toRubricItem; tighten task `level` to the union type. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">

<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.

(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

- suites/odysseysbench.ts: sanitize EVAL_MAX_K/LIMIT/SAMPLE via parsePositiveIntEnv so a non-numeric value no longer becomes NaN and bypasses the sampling cap. - tasks/bench/agent/odysseysbench.ts: hard-fail a case whose precomputed_rubric is missing instead of silently falling back to a generated rubric (benchmark fidelity). - scripts/build-odysseysbench-dataset.ts: validate each rubric entry individually (non-empty requirement/verification, weight finite in (0,1]) in addition to the aggregate sum, so a bad individual weight can't slip through. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

framework/discovery.ts has its own CATEGORY_OVERRIDES (used by the cli list/run path, separate from taskConfig.ts); without this entry agent/odysseysbench fell into the plain 'agent' category in the modern CLI instead of external_agent_benchmarks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ss planner The modern 'evals run' path uses its own registries, separate from taskConfig / index.eval.ts: parse.ts SUPPORTED_BENCHMARKS (b:/benchmark: resolver), benchPlanner suiteMap (testcase fan-out), and externalHarnessPlan (claude_code/ codex instruction+startUrl extraction). Without these, 'run b:odysseysbench' (or --harness claude_code/codex) errored 'Unknown benchmark'. Adds odysseysbench to all three for parity with webtailbench. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/evals/framework/externalHarnessPlan.ts">

<violation number="1" location="packages/evals/framework/externalHarnessPlan.ts:71">
P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-25T23:51:08Z

    };
  }

+  if (input.name === "agent/odysseysbench") {


P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/externalHarnessPlan.ts, line 71: <comment>New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</comment> <file context> @@ -68,7 +68,22 @@ export function buildExternalHarnessTaskPlan( }; } + if (input.name === "agent/odysseysbench") { + const instruction = readString(params, "confirmed_task"); + if (!instruction) { </file context>

cubic-dev-ai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread packages/evals/scripts/build-odysseysbench-dataset.ts

miguelg719 and others added 3 commits June 25, 2026 12:12

cubic-dev-ai Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evals): add OdysseysBench agent benchmark#2275

feat(evals): add OdysseysBench agent benchmark#2275
miguelg719 wants to merge 5 commits into
mainfrom
miguelgonzalez/evals-odysseysbench

miguelg719 commented Jun 25, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 25, 2026 •

edited

Loading

Uh oh!

miguelg719 Jun 25, 2026

Uh oh!

cubic-dev-ai Bot Jun 25, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json";

		const DATASET_DIR = path.join(

Uh oh!

Conversation

miguelg719 commented Jun 25, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What Changed

How to run

Tests

Summary by cubic

Uh oh!

changeset-bot Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miguelg719 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelg719 commented Jun 25, 2026 •

edited by cubic-dev-ai Bot

Loading

changeset-bot Bot commented Jun 25, 2026 •

edited

Loading

cubic-dev-ai Bot Jun 25, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot Jun 25, 2026 •

edited

Loading