Dynamic MTP#1375
Conversation
Co-authored-by: shihaobai <1798930569@qq.com>
…1265) Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com> Co-authored-by: hiworldwzj <30762946+hiworldwzj@users.noreply.github.com> Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: wzj <wzjhelloworld@qq.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
…ax_total_token_num (#1300)
…tor + test model skills. (#1301)
#1307) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
…1344) Co-authored-by: shihaobai <1798930569@qq.com>
Co-authored-by: wzj <wzjhelloworld@qq.com>
…ent decode (#1349) Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Gemma-4 model, including multimodal vision processing, custom prefill attention with image bidirectional masking, and logit softcapping. It also refactors memory management by introducing KvCacheAllocator and decoupled operators, adds SymmMem and FlashInfer all-reduce fast paths, supports SM100 GPUs with FP4/FP8 Mega MoE, and implements an experimental Anthropic Messages API compatibility layer. Feedback on the changes highlights several critical issues: a runtime TypeError in prefill_cuda_graph.py from using set[int] as a callable, potential AttributeErrors in basemodel.py and grouped_fused_moe_ep.py due to missing attributes or uninitialized buffers, and an ignored stride parameter in the gather_token_id.py Triton kernel.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| graph_handle_token_nums = [e for e in graph_handle_token_nums if e <= self.max_handle_token_num] | ||
| graph_handle_token_nums.append(self.max_handle_token_num) | ||
|
|
||
| graph_handle_token_nums = list(set[int](graph_handle_token_nums)) |
There was a problem hiding this comment.
Using set[int] as a callable at runtime will raise a TypeError: 'GenericAlias' object is not callable in Python 3.9+ (or a TypeError: 'type' object is not subscriptable in older versions). It should be changed to the standard set constructor.
| graph_handle_token_nums = list(set[int](graph_handle_token_nums)) | |
| graph_handle_token_nums = list(set(graph_handle_token_nums)) |
| if self.args.enable_prefill_decode_mixed and model_input.b_is_decode_req is not None: | ||
| gather_token_prefill_decode_mixed( | ||
| input_ids=model_input.input_ids, | ||
| req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids, |
There was a problem hiding this comment.
The ReqManager class does not define a req_sampling_params_manager attribute. Accessing self.req_manager.req_sampling_params_manager.req_to_next_token_ids will raise an AttributeError. You should access req_to_next_token_ids directly from self.req_manager.
| req_to_next_token_ids=self.req_manager.req_sampling_params_manager.req_to_next_token_ids, | |
| req_to_next_token_ids=self.req_manager.req_to_next_token_ids, |
| if use_sm100_mega_moe(quant_method): | ||
| return mega_moe_impl(hidden_states, w13, w2, topk_weights, topk_idx, quant_method) | ||
|
|
||
| buffer = dist_group_manager.ep_buffer if is_prefill else dist_group_manager.ep_low_latency_buffer |
There was a problem hiding this comment.
If is_sm100_gpu() is True but use_sm100_mega_moe(quant_method) is False (for example, when running FP8 MoE on an SM100 GPU), dist_group_manager.ep_low_latency_buffer remains None because it is only initialized when not is_sm100_gpu(). This will cause an AttributeError during decode. Consider ensuring that ep_low_latency_buffer is initialized or falling back to ep_buffer when ep_low_latency_buffer is None.
| cur_next_token_id = tl.load( | ||
| req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask | ||
| ) |
There was a problem hiding this comment.
The stride parameter req_to_next_token_ids_stride_1 is passed to the kernel but ignored inside the load instruction, effectively hardcoding the stride of the second dimension to 1. If req_to_next_token_ids is non-contiguous, this will result in incorrect memory access.
| cur_next_token_id = tl.load( | |
| req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index, mask=block_mask | |
| ) | |
| cur_next_token_id = tl.load( | |
| req_to_next_token_ids + cur_req_idx * req_to_next_token_ids_stride + cur_mtp_index * req_to_next_token_ids_stride_1, mask=block_mask | |
| ) |
Co-authored-by: niushengxiao <niushengxiao@sensetime.com>
Co-authored-by: niushengxiao <niushengxiao@sensetime.com> Co-authored-by: wangzaijun <wzjhelloworld@qq.com>
No description provided.