vLLM Integrates aiter Backend for NVFP4 MOE on AMD gfx950 (MI350/MI355), Delivering 30-40% End-to-End Speedup
English summary
This pull request adds support for the aiter backend in vLLM to accelerate NVFP4 Mixture of Experts (MOE) inference on AMD's gfx950 GPUs (MI350/MI355). It incorporates a 2-stage fused NVFP4 MOE implementation from the ROCm/aiter project, activated via the --moe-backend aiter flag. Benchmarking shows approximately 2-3x speedup on the fused MOE operation and a 30-40% improvement in end-to-end throughput for NVFP4 MOE models like Qwen3-30B-A3B-FP4. The implementation still uses BF16 MFMA instructions and relies on upstream integration of the aiter PR #4021.
Chinese summary
该拉取请求为vLLM添加了对aiter后端的支持,以加速AMD gfx950 GPU(MI350/MI355)上的NVFP4混合专家(MOE)推理。它集成了ROCm/aiter项目中的两阶段融合NVFP4 MOE实现,通过--moe-backend aiter标志启用。基准测试显示融合MOE操作速度提升约2-3倍,NVFP4 MOE模型(如Qwen3-30B-A3B-FP4)端到端吞吐量提升30-40%。实现仍使用BF16 MFMA指令,并依赖aiter PR #4021的上游集成。
Key points
Adds aiter backend (--moe-backend aiter) for NVFP4 MOE on AMD gfx950 (MI350/MI355) in vLLM.
在vLLM中为AMD gfx950(MI350/MI355)添加aiter后端(--moe-backend aiter),用于NVFP4 MOE。
Provides ~2-3x speedup on fused NVFP4 MOE kernels and 30-40% end-to-end runtime improvement for NVFP4 MOE models.
融合NVFP4 MOE内核提速约2-3倍,NVFP4 MOE模型端到端运行速度提升30-40%。
Implementation relies on BF16 MFMA and requires ROCm/aiter PR #4021 to be merged.
实现基于BF16 MFMA,并需要ROCm/aiter PR #4021合并。
Benchmarks demonstrate higher throughput (e.g., output token throughput from 713 to 1057 tok/s at high concurrency) and lower latency.
基准测试显示更高吞吐量(如高并发下输出token吞吐从713提升至1057 tok/s)和更低延迟。