Skip to content

Dynamic MTP#1375

Open
shihaobai wants to merge 61 commits into
mtp_optimizationfrom
main
Open

Dynamic MTP#1375
shihaobai wants to merge 61 commits into
mtp_optimizationfrom
main

Conversation

@shihaobai

Copy link
Copy Markdown
Collaborator

No description provided.

hiworldwzj and others added 30 commits April 15, 2026 18:02
Co-authored-by: shihaobai <1798930569@qq.com>
…1265)

Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: hiworldwzj <30762946+hiworldwzj@users.noreply.github.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
blueswhen and others added 27 commits May 26, 2026 17:58
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
…ent decode (#1349)

Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Gemma-4 model, including multimodal vision processing, custom prefill attention with image bidirectional masking, and logit softcapping. It also refactors memory management by introducing KvCacheAllocator and decoupled operators, adds SymmMem and FlashInfer all-reduce fast paths, supports SM100 GPUs with FP4/FP8 Mega MoE, and implements an experimental Anthropic Messages API compatibility layer. Feedback on the changes highlights several critical issues: a runtime TypeError in prefill_cuda_graph.py from using set[int] as a callable, potential AttributeErrors in basemodel.py and grouped_fused_moe_ep.py due to missing attributes or uninitialized buffers, and an ignored stride parameter in the gather_token_id.py Triton kernel.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

graph_handle_token_nums = [e for e in graph_handle_token_nums if e <= self.max_handle_token_num]
graph_handle_token_nums.append(self.max_handle_token_num)

graph_handle_token_nums = list(set[int](graph_handle_token_nums))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using set[int] as a callable at runtime will raise a TypeError: 'GenericAlias' object is not callable in Python 3.9+ (or a TypeError: 'type' object is not subscriptable in older versions). It should be changed to the standard set constructor.

Suggested change
graph_handle_token_nums = list(set[int](graph_handle_token_nums))
graph_handle_token_nums = list(set(graph_handle_token_nums))

if self.args.enable_prefill_decode_mixed and model_input.b_is_decode_req is not None:
gather_token_prefill_decode_mixed(
input_ids=model_input.input_ids,
req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ReqManager class does not define a req_sampling_params_manager attribute. Accessing self.req_manager.req_sampling_params_manager.req_to_next_token_ids will raise an AttributeError. You should access req_to_next_token_ids directly from self.req_manager.

Suggested change
req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids,
req_to_next_token_ids=self.req_manager.req_to_next_token_ids,

if use_sm100_mega_moe(quant_method):
return mega_moe_impl(hidden_states, w13, w2, topk_weights, topk_idx, quant_method)

buffer = dist_group_manager.ep_buffer if is_prefill else dist_group_manager.ep_low_latency_buffer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If is_sm100_gpu() is True but use_sm100_mega_moe(quant_method) is False (for example, when running FP8 MoE on an SM100 GPU), dist_group_manager.ep_low_latency_buffer remains None because it is only initialized when not is_sm100_gpu(). This will cause an AttributeError during decode. Consider ensuring that ep_low_latency_buffer is initialized or falling back to ep_buffer when ep_low_latency_buffer is None.

Comment on lines +162 to +164
cur_next_token_id = tl.load(
req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stride parameter req_to_next_token_ids_stride_1 is passed to the kernel but ignored inside the load instruction, effectively hardcoding the stride of the second dimension to 1. If req_to_next_token_ids is non-contiguous, this will result in incorrect memory access.

Suggested change
cur_next_token_id = tl.load(
req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask
)
cur_next_token_id = tl.load(
req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index * req_to_next_token_ids_stride_1, mask=block_mask
)

hiworldwzj and others added 2 commits July 1, 2026 18:46
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants