DeepSeek联合北大发布DSpark推理加速框架,高并发下单用户生成速度提升60-85%
英文摘要
DeepSeek and Peking University have jointly released DSpark, an inference acceleration framework targeting LLM bottlenecks in high-concurrency production environments. DSpark introduces a semi-autoregressive draft model that combines a parallel trunk with lightweight sequential dependency heads (Markov or RNN) to improve acceptance rates and verification efficiency compared to fully parallel draft models like DFlash. A confidence-based dynamic verification scheduler allocates target model computation to tokens with highest survival probability, maximizing global throughput. Deployed in DeepSeek-V4-Flash and V4-Pro preview services, DSpark achieves 60-85% faster single-user generation speed versus the MTP-1 baseline, with up to 51% aggregate throughput improvement at 80 tok/s SLA and a 661% advantage under strict 120 tok/s SLA on V4-Flash. Training code, evaluation scripts, and model checkpoints are open-sourced on GitHub under the DeepSpec project.
中文摘要
DeepSeek联合北京大学发布DSpark推理加速框架,旨在解决大模型在高并发生产环境中的推理效率瓶颈。DSpark采用半自回归草稿模型,通过并行主干与轻量级顺序依赖注入头(马尔可夫头或RNN头)相结合,缓解了纯并行草稿模型候选令牌接受率随位置衰减的问题。框架引入置信度调度验证机制,动态分配目标模型计算资源给全局存活概率最高的令牌,最大化吞吐量。DSpark已部署于DeepSeek-V4-Flash与V4-Pro预览版,相比MTP-1基线,单用户生成速度提升60-85%,在V4-Flash上80 tok/s SLA时聚合吞吐量提升51%,120 tok/s严格SLA下优势达661%。相关代码、脚本和模型检查点已在GitHub DeepSpec项目开源。
关键要点
DSpark is a collaborative release by DeepSeek and Peking University, targeting inference acceleration for LLMs in high-concurrency production scenarios.
DSpark由DeepSeek与北京大学联合发布,针对大模型在高并发生产环境下的推理加速。
It uses a semi-autoregressive draft model with parallel trunk and sequential dependency heads, outperforming both Eagle3 (autoregressive) and DFlash (parallel) on acceptance length across math, code, and chat benchmarks.
采用半自回归草稿模型,结合并行主干与顺序依赖头,在数学、代码和对话基准上接受长度均优于自回归式Eagle3和并行式DFlash基线。
A confidence-based dynamic verification scheduler decides how many tokens to verify per request, adapting to load to maximize throughput while maintaining generation quality.
置信度驱动的动态验证调度器自适应决定每个请求的验证长度,负载高时缩减以避免资源争用,最大化全局吞吐量。
Production deployments on DeepSeek-V4-Flash/Pro achieve 60-85% single-user speedup over MTP-1; under SLA of 120 tok/s, throughput advantage reaches 661%.
在DeepSeek-V4-Flash/Pro生产部署中,相比MTP-1基线单用户速度提升60-85%;在120 tok/s SLA下吞吐量优势达661%。
Training code, evaluation scripts, and checkpoints for DSpark, DFlash, and Eagle3 are open-sourced on GitHub under the DeepSpec project.
DSpark、DFlash和Eagle3的训练代码、评估脚本及模型检查点已在GitHub DeepSpec项目开源。