Loading / 加载中

vLLM Integrates aiter Backend for NVFP4 MOE on AMD gfx950 (MI350/MI355), Delivering 30-40% End-to-End Speedup | thinkgap

ReposSource: GITHUBJuly 3, 2026Importance: 4/5

vLLM Integrates aiter Backend for NVFP4 MOE on AMD gfx950 (MI350/MI355), Delivering 30-40% End-to-End Speedup

English summary

This pull request adds support for the aiter backend in vLLM to accelerate NVFP4 Mixture of Experts (MOE) inference on AMD's gfx950 GPUs (MI350/MI355). It incorporates a 2-stage fused NVFP4 MOE implementation from the ROCm/aiter project, activated via the --moe-backend aiter flag. Benchmarking shows approximately 2-3x speedup on the fused MOE operation and a 30-40% improvement in end-to-end throughput for NVFP4 MOE models like Qwen3-30B-A3B-FP4. The implementation still uses BF16 MFMA instructions and relies on upstream integration of the aiter PR #4021.

Chinese summary

该拉取请求为vLLM添加了对aiter后端的支持，以加速AMD gfx950 GPU（MI350/MI355）上的NVFP4混合专家（MOE）推理。它集成了ROCm/aiter项目中的两阶段融合NVFP4 MOE实现，通过--moe-backend aiter标志启用。基准测试显示融合MOE操作速度提升约2-3倍，NVFP4 MOE模型（如Qwen3-30B-A3B-FP4）端到端吞吐量提升30-40%。实现仍使用BF16 MFMA指令，并依赖aiter PR #4021的上游集成。

Key points

Adds aiter backend (--moe-backend aiter) for NVFP4 MOE on AMD gfx950 (MI350/MI355) in vLLM.
在vLLM中为AMD gfx950（MI350/MI355）添加aiter后端（--moe-backend aiter），用于NVFP4 MOE。
Provides ~2-3x speedup on fused NVFP4 MOE kernels and 30-40% end-to-end runtime improvement for NVFP4 MOE models.
融合NVFP4 MOE内核提速约2-3倍，NVFP4 MOE模型端到端运行速度提升30-40%。
Implementation relies on BF16 MFMA and requires ROCm/aiter PR #4021 to be merged.
实现基于BF16 MFMA，并需要ROCm/aiter PR #4021合并。
Benchmarks demonstrate higher throughput (e.g., output token throughput from 713 to 1057 tok/s at high concurrency) and lower latency.
基准测试显示更高吞吐量（如高并发下输出token吞吐从713提升至1057 tok/s）和更低延迟。

Open original