SWE Atlas Multiturn

SWE Atlas Multiturn is a benchmark of 75 tasks for evaluating coding agents on multi-turn software engineering tasks in a realistic user-driven development setting. The repository contains the task data and example Harbor run configs needed to run the tasks.

Repository Layout

data/
  multiturn/   
run_configs/
  multiturn/    # example configs for data/multiturn

Requirements

Install Harbor:

git clone https://github.com/laude-institute/harbor.git
cd harbor
uv tool install .

Set up Modal for sandbox environments:

uv pip install modal
modal setup

Environment Variables

Run configs load credentials from harbor/.env relative to this repository root. Create that file before launching a run:

mkdir -p harbor
$EDITOR harbor/.env

Common User/Rubric Settings

Put this block in harbor/.env. It covers the shared simulated-user setup (GPT 5.5 high) and RF rubric grading (the RF task default uses Anthropic Opus 4.5, matching the original SWE Atlas Refactoring task):

OPENAI_API_KEY=<your-gateway-api-key>
OPENAI_API_BASE=<openai-compatible-gateway-url>/v1

OPENAI_API_BASE must support both openai/gpt-5.5 and the RF rubric default model, anthropic/claude-opus-4-5-20251101. A LiteLLM gateway works for this; direct https://api.openai.com/v1 does not support the Anthropic rubric model unless you override EVAL_MODEL to an OpenAI model.

Per-Model Settings

Add only the variables needed for the agent config you run:

Config	Additional `harbor/.env` setting
`gpt-5p5-high_codex.sh`	None beyond the common block
`opus-4p8-high_claude-code.sh`	`ANTHROPIC_API_KEY=<your-anthropic-api-key>`
`sonnet-4p6-high_claude-code.sh`	`ANTHROPIC_API_KEY=<your-anthropic-api-key>`
`gemini-3p5-flash-high_opencode.sh`	`GEMINI_API_KEY=<your-gemini-api-key>`
`kimi-k2p6_kimi-cli.sh`	`OPENAI_API_KEY=<key-for-openai-compatible-endpoint>` and `OPENAI_API_BASE=<endpoint-url>` or `OPENAI_BASE_URL=<endpoint-url>`

Running

Run commands from the repository root.

Multi-turn example:

bash run_configs/multiturn/gpt-5p5-high_codex.sh

Multi-turn run configs set the simulated user model to openai/gpt-5.5 via SIM_USER_MODEL.

To run the baseline single-turn example:

bash run_configs/singleturn/gpt-5p5-high_codex.sh

The scripts write outputs under results/. To make a custom config, copy an existing script and update the agent, model, sampling count, or Harbor arguments.

License

This repository is released under the Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
run_configs		run_configs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.toml		dataset.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWE Atlas Multiturn

Repository Layout

Requirements

Environment Variables

Common User/Rubric Settings

Per-Model Settings

Running

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SWE Atlas Multiturn

Repository Layout

Requirements

Environment Variables

Common User/Rubric Settings

Per-Model Settings

Running

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages