Dynamic MTP by shihaobai · Pull Request #1375 · ModelTC/LightLLM

shihaobai · 2026-06-30T15:01:43Z

No description provided.

Co-authored-by: shihaobai <1798930569@qq.com>

…1265) Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com> Co-authored-by: hiworldwzj <30762946+hiworldwzj@users.noreply.github.com> Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: wzj <wzjhelloworld@qq.com>

Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

…ax_total_token_num (#1300)

…tor + test model skills. (#1301)

#1307) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

…nce (#1308)

…ize (#1309)

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

…t models (#1337)

…ft_model (revert #1337) (#1339)

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

…1344) Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: wzj <wzjhelloworld@qq.com>

…ent decode (#1349) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

gemini-code-assist

Code Review

This pull request introduces support for the Gemma-4 model, including multimodal vision processing, custom prefill attention with image bidirectional masking, and logit softcapping. It also refactors memory management by introducing KvCacheAllocator and decoupled operators, adds SymmMem and FlashInfer all-reduce fast paths, supports SM100 GPUs with FP4/FP8 Mega MoE, and implements an experimental Anthropic Messages API compatibility layer. Feedback on the changes highlights several critical issues: a runtime TypeError in prefill_cuda_graph.py from using set[int] as a callable, potential AttributeErrors in basemodel.py and grouped_fused_moe_ep.py due to missing attributes or uninitialized buffers, and an ignored stride parameter in the gather_token_id.py Triton kernel.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-30T15:05:28Z

+        graph_handle_token_nums = [e for e in graph_handle_token_nums if e <= self.max_handle_token_num]
        graph_handle_token_nums.append(self.max_handle_token_num)
+
+        graph_handle_token_nums = list(set[int](graph_handle_token_nums))


Using set[int] as a callable at runtime will raise a TypeError: 'GenericAlias' object is not callable in Python 3.9+ (or a TypeError: 'type' object is not subscriptable in older versions). It should be changed to the standard set constructor.

Suggested change

graph_handle_token_nums = list(set[int](graph_handle_token_nums))

graph_handle_token_nums = list(set(graph_handle_token_nums))

gemini-code-assist · 2026-06-30T15:05:28Z

+        if self.args.enable_prefill_decode_mixed and model_input.b_is_decode_req is not None:
+            gather_token_prefill_decode_mixed(
+                input_ids=model_input.input_ids,
+                req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids,


The ReqManager class does not define a req_sampling_params_manager attribute. Accessing self.req_manager.req_sampling_params_manager.req_to_next_token_ids will raise an AttributeError. You should access req_to_next_token_ids directly from self.req_manager.

Suggested change

req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids,

req_to_next_token_ids=self.req_manager.req_to_next_token_ids,

gemini-code-assist · 2026-06-30T15:05:28Z

+    if use_sm100_mega_moe(quant_method):
+        return mega_moe_impl(hidden_states, w13, w2, topk_weights, topk_idx, quant_method)
+
+    buffer = dist_group_manager.ep_buffer if is_prefill else dist_group_manager.ep_low_latency_buffer


If is_sm100_gpu() is True but use_sm100_mega_moe(quant_method) is False (for example, when running FP8 MoE on an SM100 GPU), dist_group_manager.ep_low_latency_buffer remains None because it is only initialized when not is_sm100_gpu(). This will cause an AttributeError during decode. Consider ensuring that ep_low_latency_buffer is initialized or falling back to ep_buffer when ep_low_latency_buffer is None.

gemini-code-assist · 2026-06-30T15:05:29Z

+    cur_next_token_id = tl.load(
+        req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask
+    )


The stride parameter req_to_next_token_ids_stride_1 is passed to the kernel but ignored inside the load instruction, effectively hardcoding the stride of the second dimension to 1. If req_to_next_token_ids is non-contiguous, this will result in incorrect memory access.

Suggested change

cur_next_token_id = tl.load(

req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask

)

cur_next_token_id = tl.load(

req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index * req_to_next_token_ids_stride_1, mask=block_mask

)

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

hiworldwzj and others added 30 commits April 15, 2026 18:02

tpsp optimization (#1269)

bbc9eba

optimization prefill dp banlance, support multimodal dp balance. (#1271)

529f9ca

feat(api): add Anthropic Messages API compatibility endpoint (#1272)

6fa2f23

fix: upgrade flashinfer to 0.6.8.post1 (#1280)

5034706

qwen3 omni support long audio (#1268)

3368043

feat(api): consolidate HTTP API endpoints and fixes (#1282)

e28f984

Co-authored-by: shihaobai <1798930569@qq.com>

fix: typo prefll -> prefill in cudagraph option (#1283)

4208b76

add --performance_mode start args (#1285)

0d5e122

auto set tool call parser and reasoning_parser (#1284)

1f54d60

Co-authored-by: shihaobai <1798930569@qq.com>

fix: honor visual infer batch size (#1293)

162df8b

use pinned device_ptr to init cpu cache tensor (#1287)

3d08cba

Communication opt (#1286)

28254a9

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

feat(triton): support 256 headdim in attention decode kernels (#1291)

b8eee5c

Co-authored-by: shihaobai <1798930569@qq.com>

fix(httpserver): quiet client-disconnect log path, return 499 (#1288)

447fc40

Co-authored-by: wzj <wzjhelloworld@qq.com>

remove lightllm_kernel (#1296)

cc7e8f4

support prefill cudagraph for gdn (#1294)

e1f8723

Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

auto-derive max_req_total_len from model config (#1297)

38609d1

Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

fix(basemodel): Format AssertionError message for max_seq_length vs m…

592cad2

…ax_total_token_num (#1300)

feat: support invalid_token_ids in sampling params (#1305)

8bcd28b

refactor(kv-cache): embed KvCacheAllocator in MemoryManager as alloca…

70cdb07

…tor + test model skills. (#1301)

fix(multimodal): detect truncated images at the frontend via pixel-le… (

8141c56

#1307) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

feat(multimodal): add max_image_token_count guard with OOM risk guida…

f41b8c4

…nce (#1308)

improve multimodal image preprocessing with max_image_pixels auto-res…

45e8cca

…ize (#1309)

Fix window size for sliding attention layer (#1311)

171204e

Fix sliding window size for token attention kernel (#1312)

eaf0f42

muliturn benchmark (#1313)

f850264

fix: fix cache length (#1314)

4c069d3

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

support gemma4 (#1304)

1b38e8d

add enable_prefill_decode_mixed start args (#1315)

eaa3b28

blueswhen and others added 27 commits May 26, 2026 17:58

opt: refine cpu cache start time (#1321)

e696aed

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

fix: fp8 group_fuse_moe (#1323)

375ad57

fix health check (#1322)

520c041

feat: deep_ep v2 (#1303)

466651c

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>

Refine prefill CUDA graph capture sizes (#1331)

63269d5

fix: v32 tokenizer for transformers 5.x (#1326)

b16ccfa

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

fix: update ci to cuda13.0 (#1332)

105d57f

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

pd nixl upgrade write mode to transfer kv (#1324)

3863844

fix prefill_params when prefill num_reqs > 1024 (#1336)

5514e24

refactor(mtp): extract BaseMTPModel mixin shared by existing MTP draf…

e03ef9a

…t models (#1337)

revert(mtp): drop shared BaseMTPModel base, keep per-model is_mtp_dra…

da9dfb8

…ft_model (revert #1337) (#1339)

nixl pd support qwen3.5 (#1340)

78e34a7

add Flashinfer sampling backend (#1328)

2740083

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

remove nccl pd mode. (#1342)

8196f35

fix lmeval start speed (#1343)

316b398

fix: correct 'Unsupport' typo to 'Unsupported' in error messages (#1320)

6a70412

feat(metrics): add model_name label and new throughput/cache metrics (#…

d471c21

…1344) Co-authored-by: shihaobai <1798930569@qq.com>

fix duplicate reasoning and reasoning_content (#1345)

3a15cb0

fix(linear-att): fix latent prefix-cache ref/buffer leaks (#1348)

b0231de

Co-authored-by: wzj <wzjhelloworld@qq.com>

basic Profiler support (#1247)

41ed8e9

Return 400 for chat template build errors (#1356)

630f7b8

Fix config utils (#1357)

9e4f552

fix: truncate oversized output token strings (#1359)

b28eeac

perf(qwen3next): drop q/k/v/a/b contiguous copies in GDN fused_recurr…

9ae6a5d

…ent decode (#1349) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

feat: add fused moe shared-expert and add-rmsnorm optimization (#1353)

84c7fe8

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

improve moe align (#1369)

7876db3

Fix linear attention CPU cache tail index buffer (#1372)

13c7017

gemini-code-assist Bot reviewed Jun 30, 2026

View reviewed changes

hiworldwzj and others added 2 commits July 1, 2026 18:46

fix position_delta in decode. (#1377)

cf55326

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>

feat: opt flashinfer (#1367)

1ff35a0

Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic MTP#1375

Dynamic MTP#1375
shihaobai wants to merge 61 commits into
mtp_optimizationfrom
main

shihaobai commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

	graph_handle_token_nums = list(set[int](graph_handle_token_nums))
	graph_handle_token_nums = list(set(graph_handle_token_nums))

	req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids,
	req_to_next_token_ids=self.req_manager.req_to_next_token_ids,

Uh oh!

Conversation

shihaobai commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants