ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving
中文标题: ELDR:面向预填-解码分离式混合专家模型服务的专家局部性感知解码路由
英文摘要
ELDR is a decode router for prefill-decode disaggregated serving of mixture-of-experts (MoE) models that addresses latency differences caused by the expert activation patterns per batch. It constructs an expert signature from a request's prefill activations to predict which experts will be used during generation, then uses offline balanced K-means to partition signature space across decode workers and a locality-band online policy that routes each request to the least-loaded worker among those best matching its signature. A signature cache co-indexed with the KV cache at KV-block granularity maintains exact signatures under prefix caching. Implemented in vLLM and tested with up to 40 GPUs across three MoE models and two workloads, ELDR reduced median time-per-output-token (TPOT) by 5.9–13.9% over the strongest of four load-balancing baselines while keeping model outputs unchanged.
中文摘要
ELDR 是一种面向预填-解码分离式混合专家模型服务的解码路由器,解决了因每批次不同的专家激活模式导致的延迟差异问题。它根据请求的预填激活构建专家签名,预测生成阶段将要激活的专家,然后通过离线平衡 K-means 将签名空间划分到解码节点上,并采用在线局部性带宽策略将请求路由到与其签名最匹配且负载最低的节点。签名缓存以 KV 块粒度与 KV 缓存协同索引,保证了前缀缓存下的精确签名。在 vLLM 中实现并在最多 40 块 GPU 上对三个混合专家模型和两种工作负载进行了评估,ELDR 将中位单 Token 生成时间 (TPOT) 比四种负载均衡基线中最优者降低了 5.9% 至 13.9%,且模型输出保持不变。
关键要点
Existing load-balancing decode routers for PD-disaggregated MoE serving ignore expert locality, leading to latency variance between equally loaded workers.
现有的面向预填-解码分离式混合专家模型服务的负载均衡解码路由器忽略了专家局部性,导致负载相同的节点间存在延迟差异。
ELDR uses a request's prefill expert activations to form an expert signature that predicts which experts will be activated during decoding.
ELDR 利用请求的预填专家激活形成专家签名,预测解码阶段将激活哪些专家。
Offline balanced K-means partitions signature space across workers; online locality-band routing sends each request to the least-loaded worker among those best matching its signature.
离线阶段用平衡 K-means 将签名空间划分到各节点;在线阶段采用局部性带宽路由,将请求发送到与签名最匹配且负载最低的节点。
A signature cache co-indexed with the KV cache at block granularity preserves exact signatures when prefix caching is used.
签名缓存以块粒度与 KV 缓存协同索引,确保使用前缀缓存时签名仍保持精确。
Implemented in vLLM and tested on up to 40 GPUs with three MoE models and two workloads, median TPOT improved by 5.9–13.9% over the best load-balancing baseline with identical outputs.
在 vLLM 中实现并在最多 40 块 GPU、三个混合专家模型和两种工作负载上测试,中位 TPOT 较最优负载均衡基线降低 5.9%–13.9%,模型输出完全一致。