[MLX] Gemma4-31B ondevice sampling by kiymetakdemir · Pull Request #20561 · pytorch/executorch

kiymetakdemir · 2026-06-27T01:59:44Z

Summary

Lets the MLX-exported Gemma 4 31B model sample the next token on-device instead of returning logits for host-side sampling. Sampling is opt-in at export (--sample); temperature, top_p, and seed are runtime inputs, and the runner increments the seed per token.

Changes

export.py --sample flag wraps the model so forward(tokens, input_pos, temperature,
top_p, seed) → int64 token; records a use_sampling constant-method flag. Non-sample export unchanged.
gemma4_31b_engine.cpp reads use_sampling from metadata; when set, consumes the int64 token directly instead of logits_to_token, feeds the scalar inputs in prefill/decode (across the min/max prefill chunking), and manages the per-token seed schedule. top_k is still rejected; top_p is range-checked to (0, 1]; top_p/seed are rejected on non-sample models.
main.cpp --top_p / --seed flags wired into SamplingConfig; an unset seed is randomized
only for sampling models (non-sample keep seed 0, so they don't trip the guard).
tests/test_mlx_pipeline.py adds test_export_to_pte_with_sampling: tiny-model MLX export with --sample, asserting the use_sampling flag, int64 token output, and same-seed reproducibility.

pytorch-bot · 2026-06-27T01:59:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20561

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Pending, 2 Unclassified Failures

As of commit 373b79d with merge base 825bd30 ():

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Aarch64 Linux Wheels / pytorch/executorch / build-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/__w/executorch/executorch/pytorch/executorch/backends/apple/coreml/runtime/inmemoryfs/inmemory_filesystem.cpp:722:48: error: ‘inmemoryfs::InMemoryFileSystem::InMemoryNode::Kind’ has not been declared
Build Aarch64 Linux Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu-aarch64 (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_aarch64

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-27T02:00:41Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-27T22:18:51Z

    print("Done.")


+class _MLXSampleWrapper(nn.Module):


Ideally SamplingHead from backends/mlx/llm would be directly usable. The fact that we keep defining wrappers suggest that maybe it should be refactored to work?

In terms of shape, I expect sampling to work on (B, vocab) and return (B) (tokens).

I don't see a case for (B, S, vocab).

metascroy · 2026-06-27T22:19:17Z

@claude review this code

claude · 2026-06-27T22:20:05Z

Claude finished @metascroy's task in 2m 18s —— View job

Code review

Read changed files (export.py, engine, main.cpp, tests)
Analyze sampling logic, seed handling, guards
Address reviewer's question re: reusing SamplingHead
Post review feedback

Overall this is a clean, well-guarded port of the Qwen3.5 MoE on-device sampling pattern (#20545). The runtime guards, range-checks, and the per-token seed schedule are consistent with that PR. A few notes below.

On @metascroy's question: why not reuse `SamplingHead`?

You're right, and the shape mismatch is the reason a wrapper is needed today:

backends/mlx/llm/sampling.py:36-40 — SamplingHead.forward expects the wrapped model to return (B, S, vocab) and does last = logits[:, -1, :] to get (B, vocab) before calling torch.ops.mlx.sample.
The Gemma MLX source transform already returns the last-token logits: mlx_source_transformations.py:150 (last = self.lm_head(x[:, -1, :])) produces (B, vocab) — there is no S axis. Feeding that into SamplingHead would make [:, -1, :] index the wrong axis (or fail), so _MLXSampleWrapper (export.py:312) skips the slice and calls mlx.sample(logits, ...) directly.

So the two runners disagree on the head's input contract: Qwen feeds (B, S, vocab) and relies on SamplingHead to slice; Gemma pre-slices to (B, vocab). That's exactly the (B,S,vocab) vs (B,vocab) split you flagged.

The clean fix is the one you suggested: make SamplingHead operate on (B, vocab) → (B) and drop the internal [:, -1, :] slice (which is really a model-output concern, not a sampling concern). Then:

Gemma can use SamplingHead directly (its MLX forward already yields (B, vocab)).
Qwen keeps its last-token slice in the model forward (or a tiny adapter), matching how mlx.sample itself is documented (custom_ops.py:405 — logits: [B, vocab]).

That removes _MLXSampleWrapper here and the divergent contract. Worth doing as a small follow-up so the next runner doesn't add a third wrapper. If you'd like, I can take that refactor on (touching SamplingHead, the Gemma export, and the Qwen export).

Other observations

1. Sampling runs on every prefill chunk, output discarded. In run_prefill_chunk (gemma4_31b_engine.cpp:527+), each chunk feeds the scalar inputs and the baked-in sampler runs, but only the final chunk's token is used. The seed is held constant across chunks and incremented once after prefill (:355), so this is correct, just mildly wasteful. Sampling cost is negligible vs. the model, so fine to leave — flagging only for awareness.

2. No dtype/shape check on the int64 token read. read_sampled_token (:104-105) does output.const_data_ptr<int64_t>()[0] when use_sampling. This trusts the metadata flag matches the actual output tensor. Given use_sampling_ is read from the model's own constant method that's a safe coupling, but a defensive ET_CHECK on scalar_type() == Long would fail loudly instead of reinterpreting bytes if the two ever drift.

3. Test executes forward on the lowered program directly. test_export_to_pte_with_sampling guards on MLXPartitioner import but then runs forward.execute(...) via Runtime. The reproducibility assertion relies on the host reference mlx.sample (custom_ops.py:416-419 explicitly notes it is not bit-identical to the on-device graph). On non-Mac CI this only exercises the CPU reference — worth a comment in the test that same-seed reproducibility is being checked against the host reference, not the delegate, so the assertion's scope is clear.

4. Seed semantics look correct. -1 → random only for sample models, non-sample stays at 0 to avoid the guard (main.cpp:178-188), explicit --seed 0 on a sample model is reproducible (not re-randomized), and top_p/seed are correctly rejected on non-sample models in both prefill_tokens and decode_one. valid_top_p of (0, 1] matches the op's documented domain. 👍

None of these are blockers. The main actionable item is the SamplingHead refactor you raised — I'd recommend landing this as-is if it unblocks you and doing that refactor as a fast follow across both runners.
· branch gemma-ondevice-sampling

Add on-device sampling to the Gemma 4 31B MLX runner

373b79d

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2026

metascroy reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLX] Gemma4-31B ondevice sampling#20561

[MLX] Gemma4-31B ondevice sampling#20561
kiymetakdemir wants to merge 1 commit into
pytorch:mainfrom
kiymetakdemir:gemma-ondevice-sampling

kiymetakdemir commented Jun 27, 2026

Uh oh!

pytorch-bot Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

metascroy Jun 27, 2026

Uh oh!

metascroy commented Jun 27, 2026

Uh oh!

claude Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kiymetakdemir commented Jun 27, 2026

Uh oh!

pytorch-bot Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20561

❌ 2 Pending, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 27, 2026

This PR needs a release notes: label

Uh oh!

metascroy Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

metascroy commented Jun 27, 2026

Uh oh!

claude Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

On @metascroy's question: why not reuse SamplingHead?

Other observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jun 27, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 27, 2026 •

edited

Loading

On @metascroy's question: why not reuse `SamplingHead`?