ParallelKernelBench (PKB) is a benchmark with the goal of enabling LLMs to optimize multi-GPU kernels. Specifically, we investigate model capabilities on turning existing PyTorch + NCCL reference code into fine-grained CUDA (or related DSLs).
The design is heavily inspired by KernelBench.
π Paper Β Β·Β π€ Hugging Face Β Β·Β π Project website
PKB asks models to optimize multi-GPU kernels: each problem has a PyTorch + NCCL reference under reference/; candidates go in solutions_<backend>/ (CUDA, Triton, ParallelKittens, or run-specific trees from generation).
Correctness: eval mode runs reference and candidate on the same inputs and compares per-rank outputs (rank_*.pt) within --atol / --rtol.
Performance: optional timing reports speedup vs reference. We follow ThunderKittens 2 β benchmark rigor: 500 warmup iterations, 100 timed iterations (see worker / perf utilities).
Roofline (approximate): reference_rooflines_code/ provides utilization estimates; contributions welcome.
PKB uses uv for reproducible Python environments.
- OS: Linux with NVIDIA GPUs (multi-GPU runs need matching
torchrun/ NCCL). - Driver: Recent enough for CUDA 12.8 wheels (H100 nodes typically satisfy this).
- ParallelKittens backend (optional): clone ThunderKittens and set
THUNDERKITTENS_ROOTto the repo root (Modal/Together images do this automatically).
# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
cd ParallelKernelBench
cd kernelgen
git clone https://github.com/SWE-agent/mini-swe-agent.git
cd ..
uv sync
# Verify the environment
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
uv run pytest tests/ -qThis creates .venv/ at the repo root and installs all dependencies from the pinned uv.lock (commit that file when you change pyproject.toml).
- Google models:
GEMINI_API_KEYorGOOGLE_API_KEY. - Together models:
TOGETHER_API_KEY(seeALLOWED_MODELSinkernelgen/generate_kernel.py). - Anthropic models:
ANTHROPIC_API_KEY. - OpenAI models:
OPENAI_API_KEY.
[kernelgen/generate_kernel.py](kernelgen/generate_kernel.py) assembles a kernel-generation prompt from [kernelgen/prompts.toml](kernelgen/prompts.toml) and optionally calls an LLM. You can (1) print the prompt only (--print-prompt) or (2) generate a solution file for one problem and backend.
# [print-prompt] Inspect the assembled user prompt; no API call, no file written
# --precision: fp32 | fp16 | bf16 (must match an entry in prompts.toml)
# --hardware: h100_8 | b200_72 (optional; omitted β "none" in the output directory name)
# --backend: cuda | triton | parallelkittens (must match [backends] in prompts.toml)
# --model: must be in ALLOWED_MODELS in generate_kernel.py (ignored with --print-prompt)
uv run python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem 1 \
--model zai-org/GLM-5.1 \
--backend cuda \
--print-prompt
# [generate] Call the LLM and write e.g. solutions_cuda_bf16_h100_8_together_zai-org_GLM-5.1/1_allreduce_cuda.py
uv run python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem 1 \
--model zai-org/GLM-5.1 \
--backend cuda
# Other backends (same flags; different output dir prefix and filename suffix)
uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend triton
uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend parallelkittens
# Optional: custom prompt template
uv run python kernelgen/generate_kernel.py --paths-to-prompts-template /path/to/prompts.toml --precision bf16 --problem 1 --backend cuda --print-promptOutputs: without --print-prompt, each run writes under solutions_<backend>_<precision>_<hardware|none>_<provider>_<model_slug>/ as {stem}_{backend}.py (for example 1_allreduce_cuda.py). Pass that directory to [run_local.py](run_local.py) via --solutions-dir when evaluating.
We provide a script ([kernelgen/generate_kernel_agent.py](kernelgen/generate_kernel_agent.py)) that uses mini-swe-agent in generating a kernel. Note that we verified functionality on Google models.
To use this script, you must install mini-swe-agent separately:
pip install -e kernelgen/mini-swe-agent
cd kernelgen
git clone https://github.com/SWE-agent/mini-swe-agent.gitAn example command:
python kernelgen/generate_kernel_agent.py \
--problem 1 \
--backend cuda \
--model gemini-3-flash-preview \
--step-limit 3 \
--timeout 600 \
--remote-dryrun-command \
'python run_local.py --num-procs-per-node 4 --mode dryrun --problem {problem_arg} --solution {backend} --measure-perf' \
--remote-eval-command \
'python run_local.py --num-procs-per-node 4 --mode eval --problem {problem_arg} --solution {backend} --measure-perf'We provide convienient functionality to the --problem flag to make it simple to generate for multiple problems:
# generate every problem under reference/
python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem all \
--model deepseek-ai/DeepSeek-V4-Pro \
--backend cuda
# generate a specific subset of problems
python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem '[72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]' \
--model zai-org/GLM-5.1 \
--backend cuda[run_local.py](run_local.py) is the local multi-GPU harness (via torchrun). Assuming you have an environment with multiple GPUs connected via NVLink, you can (1) run one backend in isolation (dryrun) or (2) compare a candidate kernel against the reference (eval).
# [dryrun] Launches a single torchrun job for --solution. Does not compare to reference. Use this to verify a kernel compiles, runs, and produces rank outputs.
# Reference PyTorch + NCCL baseline (default solutions dir: reference/)
uv run python run_local.py \
--nproc-per-node 4 \
--mode dryrun \
--problem 1 \
--dtype bfloat16 \
--solution reference \ # (reference / cuda / triton / parallelkittens)
--download \ # Optional: save rank_*.pt under logs/problem_<stem>/<solution>/
--measure-perf # Optional: warmup + timed iterations; prints and saves rank_*_perf.json summary
# CUDA solution (in a custom directory)
uv run python run_local.py \
--nproc-per-node 4 \
--mode dryrun \
--problem 1 \
--dtype bfloat16 \
--solution cuda \
--solutions-dir /path/to/custom/cuda/solution \
--download \
--measure-perf \
--profile # Optional: save PyTorch profiler traces (written under traces/)
# [eval] For each of --trials random inputs (default 5), runs reference then --solution.
# Compares rank_*.pt tensors, and stops on first mismatch. --solution must NOT be reference.
python run_local.py \
--nproc-per-node 4 \ # number of GPUs to evaluate on
--mode eval \
--problem 1 \
--dtype bfloat16 \
--solution cuda \
--solutions-dir /path/to/custom/cuda/solution \
--trials 5 \ # number of RNG trials (default 5)
--atol 1e-5 \ # tensor compare tolerances (default 1e-5 each)
--rtol 1e-5 \
--download \ # keep per-trial artifacts under logs/
--measure-perf # time reference and solutionArtifacts: with --download, outputs land in logs/problem_<stem>/{reference|cuda|triton|...}/ (per-rank .pt files, optional rank_*_perf.json). Inspect them offline with [utils/compare_outputs.py](utils/compare_outputs.py) and [utils/compare_performance.py](utils/compare_performance.py) (see Inspect downloaded outputs).
Unfortunately, not everyone can buy a DGX 8xH100 compute node at their local Costco. So we provide the ability to evaluate using Together AI's cloud container service (Together Containters). Dedicated GPU jobs use the Together Jig flow (together beta jig).
# find your together API key from your profile/dashboard
export TOGETHER_API_KEY="..."
# creates a dockerfile, then build, push, deploy to the cloud
together beta jig deploy
# once the image is uploaded ot the cloud server, you can evaluate just like run_local.py
# e.g.
python run_together.py --mode dryrun --problem 1
python run_together.py --mode dryrun --problem 1 --solution cuda --download
python run_together.py --mode dryrun --problem 1 --solution cuda --download --measure-perf --profile
python run_together.py --mode eval --problem 1
python run_together.py --mode eval --problem 1 --solution cuda --download
python run_together.py --mode eval --problem 1 --solution cuda --download --measure-perf --profile
# monitor cloud status
together beta jig status
# view cloud machine logs
together beta jig logs --follow
# save resources
together beta jig destroyYou can also reuse a tagged image without a full rebuild:
together beta jig build --tag pkb_no_nvshmem
together beta jig push --tag pkb_no_nvshmem
together beta jig deploy --image registry.together.xyz/<your-registry-path>:pkb_no_nvshmemAfter run_local.py --download (and optionally --measure-perf), artifacts are organized as:
logs/problem_<stem>/
reference/ # rank_0.pt, rank_1.pt, β¦ and optional rank_*_perf.json
cuda/ # same layout for the candidate backend
comparison_cuda.json # written by compare_performance.py
--problem accepts the same ids as run_local.py (e.g. 1 resolves to 1_allreduce β logs/problem_1_allreduce/).
# Compare reference vs CUDA tensors (default --solution cuda)
uv run python utils/compare_outputs.py --problem 1
# Compare against another backend; tighten tolerances if needed
uv run python utils/compare_outputs.py --problem 1_allreduce --solution triton --atol 1e-3 --rtol 1e-3
# Inspect one backend folder without comparing (print shapes / sample values per rank)
uv run python utils/compare_outputs.py --problem 1 --inspect reference
uv run python utils/compare_outputs.py -p 1 -i cuda
# Compare wall-clock perf from rank_*_perf.json (requires --measure-perf on both runs)
uv run python utils/compare_performance.py --problem 1
uv run python utils/compare_performance.py --problem 1 --solution cuda
# Custom logs root (default: <repo>/logs)
uv run python utils/compare_outputs.py --problem 1 --logs-dir /path/to/logsThis started as an internship project: we would greatly welcome contributions!
- Internode-specific problems (e.g. prefill-decode disaggregation)
- Internode support (NCCL GIN, NVSHMEM)
- More backends for modern DSLs (e.g. CuTe)
- Add
init_<backend>/finalize_<backend>inutils/init_and_finalize_backends.pyand wire them inscripts/worker.pyinsiderun_worker(). - Modify the prompt template in
kernelgenand add at least one sample solution if applicable. - The current infra assumes we have a single
.pyfile per solution.
- Fix tensor shapes and allocation in
utils/input_tensors.py(shared across backends). - Define performance rules in
utils/performance.pyif needed. - Add the reference under
reference/and optional solutions under the appropriatesolutions_* tree.
If you use ParallelKernelBench in your work, please cite:
@misc{chan2026parallelkernelbench,
title={ParallelKernelBench: Can LLMs Write Fast Multi-GPU Kernels?},
author={Willy Chan and Nathan Paek and Simon Guo and Simran Arora and Daniel Fu},
year={2026},
howpublished={\url{https://github.com/Willy-Chan/ParallelKernelBench}},
note={Benchmark and evaluation harness for multi-GPU kernel generation.},
}Special thanks to Stuart Sul, William Hu, Austin Silveria, Hayden Prairie, Jonah Yi for their invaluable assistance and feedback! And of course, thank you to Together AI for sponsoring the compute for this project.

