feat(evals): add OdysseysBench agent benchmark#2275
Conversation
OdysseysBench (https://odysseysbench.com) is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard). Each task ships a weighted rubric whose weights sum to 1.0; build-odysseysbench-dataset.ts converts those into the verifier's precomputed_rubric format so process + outcome scoring use the published criteria directly (no rubric generation). - datasets/odysseysbench: committed source snapshot + generated JSONL - scripts/build-odysseysbench-dataset.ts: deterministic converter (--fetch to refresh) - suites/odysseysbench.ts: testcase builder with limit/sample/level/ids knobs - tasks/bench/agent/odysseysbench.ts: bench task via TrajectoryRecorder + verifier - index.eval.ts / taskConfig.ts / cli-legacy.ts: dataset fan-out + category wiring Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 29ccd1a The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
3 issues found across 9 files
Confidence score: 3/5
packages/evals/tasks/bench/agent/odysseysbench.tscurrently falls back to a generated rubric whenprecomputed_rubricis missing/invalid, which can silently change scoring behavior and undermine benchmark fidelity in published results. Treat missing rubric data as a hard failure for OdysseysBench cases before merging.packages/evals/suites/odysseysbench.tsparses env limits/samples without validating numeric values, soNaNcan slip through and bypass intended caps, leading to accidental full-dataset runs and unpredictable runtime/cost. Guard these inputs to finite positive integers before they reach sampling.packages/evals/scripts/build-odysseysbench-dataset.tsusespath.joinfor repo-internal dataset paths, which can emit Windows backslashes and break the repo’s forward-slash path convention. Normalize to'/'-style paths in generated metadata/scripts to avoid cross-platform inconsistencies.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">
<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.
(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>
<file name="packages/evals/suites/odysseysbench.ts">
<violation number="1" location="packages/evals/suites/odysseysbench.ts:36">
P2: Unvalidated numeric env parsing can turn limit/sample into `NaN`, causing limit bypass and unexpected full-dataset runs. Sanitize env values to finite positive integers before passing to sampling.</violation>
</file>
<file name="packages/evals/tasks/bench/agent/odysseysbench.ts">
<violation number="1" location="packages/evals/tasks/bench/agent/odysseysbench.ts:62">
P2: Missing/invalid `precomputed_rubric` silently triggers generated-rubric fallback, which breaks benchmark fidelity. Reject the case when OdysseysBench rubric data is absent instead of continuing.</violation>
</file>
Architecture diagram
sequenceDiagram
participant CLI as Eval CLI
participant Suite as buildOdysseysBenchTestcases
participant Dataset as OdysseysBench_data.jsonl
participant Task as agent/odysseysbench task
participant Verifier as runWithVerifier
participant Agent as Stagehand Agent
participant Browser as Browser Page
Note over CLI,Browser: OdysseysBench Evaluation Flow
CLI->>Suite: EVAL_DATASET=odysseysbench
CLI->>Suite: env knobs (LIMIT, SAMPLE, LEVEL, IDS)
Suite->>Dataset: readJsonlFile()
Dataset-->>Suite: 200 JSONL rows with precomputed_rubric
alt EVAL_ODYSSEYSBENCH_IDS set
Suite->>Suite: Filter by explicit task_ids
else EVAL_ODYSSEYSBENCH_LEVEL set
Suite->>Suite: Filter by difficulty level
Suite->>Suite: applySampling()
else default
Suite->>Suite: applySampling() limit=25
end
Suite->>Suite: normalizeAgentModelEntries()
loop For each model × task combination
Suite-->>Task: Testcase { input, params, metadata }
end
CLI->>Task: Execute test case
Task->>Task: Validate confirmed_task param
Task->>Browser: page.goto(startUrl)
Browser-->>Task: Page loaded
Task->>Agent: agent({ mode, model, systemPrompt })
Task->>Task: Build TaskSpec with precomputedRubric
Task->>Verifier: runWithVerifier({ agent, taskSpec })
Verifier->>Agent: Execute agent on task
Agent->>Browser: Navigate & interact
Browser-->>Agent: Page state
Agent-->>Verifier: Trajectory + results
Verifier->>Verifier: V3Evaluator.verify() with rubric
alt EVAL_SUCCESS_MODE=outcome
Verifier-->>Task: outcomeSuccess
else EVAL_SUCCESS_MODE=process
Verifier-->>Task: processScore
else EVAL_SUCCESS_MODE=both
Verifier-->>Task: both scores
end
Task->>Task: evaluationResultToSuccess()
Task-->>CLI: { _success, scores, trajectoryDir, logs }
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
|
|
||
| const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json"; | ||
|
|
||
| const DATASET_DIR = path.join( |
There was a problem hiding this comment.
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.
(Based on your team's feedback about forward-slash path separators.) .
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/scripts/build-odysseysbench-dataset.ts, line 27:
<comment>Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.
(Based on your team's feedback about forward-slash path separators.) .</comment>
<file context>
@@ -0,0 +1,151 @@
+
+const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json";
+
+const DATASET_DIR = path.join(
+ path.resolve(import.meta.dirname, ".."),
+ "datasets",
</file context>
There was a problem hiding this comment.
Declining this one: path.join is the correct choice for runtime filesystem paths (it emits the OS-native separator, which is what fs wants on Windows), and it matches the sibling dev script scripts/backfill-webtailbench-rubrics.ts, which also uses path.join for its dataset paths. The forward-slash convention applies to in-code/URL/import paths; the suite loader that builds an embedded dataset path does use /. Keeping path.join here for consistency with the existing converter script.
There was a problem hiding this comment.
Thanks for the feedback. This comment was influenced by this learning. Open the link to edit it, or reply here to edit or delete it.
- Complete legacy CLI wiring: register odysseysbench in evals.config.json benchmarks + cli-legacy benchmarkMap so `b:odysseysbench` resolves instead of erroring 'Unknown benchmark' (CATEGORY_OVERRIDES alone left it half-wired). - Rubric points: scale weights x1000 (was x100) so rounding no longer distorts the relative weighting of small criteria; per-task share error is now 0. - Converter: validate task_id/confirmed_task and that rubric weights sum to ~1.0, and assert row count, so a re-fetched upstream change fails loud instead of silently dropping or mis-weighting tasks. - Drop dead `key` param in toRubricItem; tighten task `level` to the union type. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 5 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/scripts/build-odysseysbench-dataset.ts">
<violation number="1" location="packages/evals/scripts/build-odysseysbench-dataset.ts:27">
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.
(Based on your team's feedback about forward-slash path separators.) [FEEDBACK_USED].</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
- suites/odysseysbench.ts: sanitize EVAL_MAX_K/LIMIT/SAMPLE via parsePositiveIntEnv so a non-numeric value no longer becomes NaN and bypasses the sampling cap. - tasks/bench/agent/odysseysbench.ts: hard-fail a case whose precomputed_rubric is missing instead of silently falling back to a generated rubric (benchmark fidelity). - scripts/build-odysseysbench-dataset.ts: validate each rubric entry individually (non-empty requirement/verification, weight finite in (0,1]) in addition to the aggregate sum, so a bad individual weight can't slip through. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
framework/discovery.ts has its own CATEGORY_OVERRIDES (used by the cli list/run path, separate from taskConfig.ts); without this entry agent/odysseysbench fell into the plain 'agent' category in the modern CLI instead of external_agent_benchmarks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ss planner The modern 'evals run' path uses its own registries, separate from taskConfig / index.eval.ts: parse.ts SUPPORTED_BENCHMARKS (b:/benchmark: resolver), benchPlanner suiteMap (testcase fan-out), and externalHarnessPlan (claude_code/ codex instruction+startUrl extraction). Without these, 'run b:odysseysbench' (or --harness claude_code/codex) errored 'Unknown benchmark'. Adds odysseysbench to all three for parity with webtailbench. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/evals/framework/externalHarnessPlan.ts">
<violation number="1" location="packages/evals/framework/externalHarnessPlan.ts:71">
P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| }; | ||
| } | ||
|
|
||
| if (input.name === "agent/odysseysbench") { |
There was a problem hiding this comment.
P3: New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/evals/framework/externalHarnessPlan.ts, line 71:
<comment>New OdysseysBench planner path lacks focused unit coverage. Add tests for success mapping and missing confirmed_task failure to lock behavior.</comment>
<file context>
@@ -68,7 +68,22 @@ export function buildExternalHarnessTaskPlan(
};
}
+ if (input.name === "agent/odysseysbench") {
+ const instruction = readString(params, "confirmed_task");
+ if (!instruction) {
</file context>
Why
OdysseysBench is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard) where every task ships a weighted rubric (weights sum to 1.0). It slots naturally into the rubric-based verifier path — like WebTailBench — so we can score process and outcome against the published criteria instead of generating rubrics.
What Changed
packages/evals/datasets/odysseysbench/): committed source snapshot (source/tasks.json, mirrored fromhttps://odysseysbench.com/assets/data/tasks.json) plus the generatedOdysseysBench_data.jsonl(200 rows).scripts/build-odysseysbench-dataset.ts): deterministic transform of each task'srubricsmap → the verifier'sprecomputed_rubric({ items: [{ criterion, description, max_points }] }). Rubric weights scale to integer points (sum ≈ 100; scale is immaterial since the process score is a ratio). Run with--fetchto refresh the snapshot.suites/odysseysbench.ts):buildOdysseysBenchTestcases, mirroring the WebTailBench suite. Env knobs:EVAL_ODYSSEYSBENCH_LIMIT(default 25),EVAL_ODYSSEYSBENCH_SAMPLE,EVAL_ODYSSEYSBENCH_LEVEL(easy/medium/hard filter),EVAL_ODYSSEYSBENCH_IDS.tasks/bench/agent/odysseysbench.ts): runs the agent throughTrajectoryRecorder+V3Evaluator.verify()with the precomputed rubric.index.eval.ts(respectsEVAL_DATASET=odysseysbench);external_agent_benchmarkscategory override intaskConfig.tsandcli-legacy.ts.How to run
Tests
pnpm --filter @browserbasehq/stagehand-evals run typecheck— cleanprettier --checkon all changed files — cleanmax_points ≥ 1; task_ids unique.agent/odysseysbenchregisters underexternal_agent_benchmarks; suite builds testcases with rubric attached;EVAL_ODYSSEYSBENCH_LEVEL=hardreturns exactly 109 tasks.Summary by cubic
Adds OdysseysBench as a built-in agent benchmark with precomputed rubrics for outcome and process scoring across 200 web tasks. Tightens env parsing and rubric validation to preserve scoring fidelity and avoid sampling bypasses; fully wires modern CLI and external harness support under
external_agent_benchmarks.New Features
OdysseysBench_data.jsonl; task rubrics converted to verifierprecomputed_rubric.packages/evals/scripts/build-odysseysbench-dataset.ts(deterministic;--fetchrefreshes upstream).packages/evals/suites/odysseysbench.tswith limit/sample/level/ids knobs.packages/evals/tasks/bench/agent/odysseysbench.tsvia TrajectoryRecorder +V3Evaluator.verify(); success mode viaEVAL_SUCCESS_MODE(outcome|process|both).packages/evals/index.eval.ts; category override toexternal_agent_benchmarks; run withpnpm evals --eval-name agent/odysseysbench.Bug Fixes
packages/evals/evals.config.jsonandpackages/evals/cli-legacy.tssob:odysseysbenchresolves.packages/evals/tui/commands/parse.tsandpackages/evals/framework/benchPlanner.ts, and addpackages/evals/framework/externalHarnessPlan.tssupport sob:odysseysbenchruns and external harnesses get instruction/startUrl; ensure discovery lists underexternal_agent_benchmarks.EVAL_MAX_K/EVAL_ODYSSEYSBENCH_LIMIT/EVAL_ODYSSEYSBENCH_SAMPLEto prevent NaN from bypassing caps.precomputed_rubric.task_id/confirmed_task; validate each rubric item (non-empty fields; weight in (0,1]); ensure weights sum to ~1.0; assert row count.Written for commit 29ccd1a. Summary will update on new commits.