Self-Hosted Gemma 2 9B: FP8 Quantization Imposes 58% Prefill Latency Penalty on NVIDIA L4 but Improves Decoding and Frees VRAM
English summary
A real-world evaluation compared unquantized Gemma 2 9B with an FP8 quantized variant served via vLLM on a single NVIDIA L4 GPU for a resume generation platform. Time to First Token (TTFT) for long-context prompts increased from 867 ms to 1,372 ms under FP8, a 58% penalty attributed to dequantization overhead in the compute-bound prefill phase, and a short-context extreme spike reached 1,740 ms. End-to-end latency for medium-length generations improved, dropping from 12,290 ms to 11,526 ms, as FP8 accelerates the memory-bandwidth-bound decoding loop. Quality remained effectively unchanged, with negligible semantic drift across persona-specific resume tailoring tasks. FP8's primary gain is VRAM liberation, enabling higher KV cache utilization and concurrency on the L4; it is recommended for asynchronous or short-to-medium context workloads, while unquantized models are preferred for interactive, long-input scenarios.
Chinese summary
一项基于简历生成平台的实际评测对比了在单张 NVIDIA L4 GPU 上通过 vLLM 服务未量化与 FP8 量化版 Gemma 2 9B 的表现。FP8 量化工况下,长文本首个 token 延迟(TTFT)从 867 毫秒升至 1372 毫秒,增加了 58%,源于预填充阶段的计算密集反量化开销;短上下文中曾出现 1740 毫秒的极端尖峰。中等长度生成的总端到端延迟则从 12,290 毫秒降至 11,526 毫秒,因为 FP8 加速了受内存带宽限制的解码循环。质量几乎无下降,在特定人物简历适配中语义漂移可忽略。FP8 的主要收益是释放显存,从而在 L4 上提升 KV 缓存利用率和并发数,适合异步或短中上下文任务;而交互式长输入场景应优先使用未量化模型。
Key points
FP8 quantization on an NVIDIA L4 increases Time to First Token by 58% (866.93 ms vs 1372.12 ms) for complex long-context prompts due to dequantization overhead in the compute-bound prefill phase.
在 NVIDIA L4 上,FP8 量化使复杂长上下文提示的首个 token 时间增加 58%(866.93 毫秒对 1372.12 毫秒),原因是计算密集型预填充阶段的反量化开销。
End-to-end generation latency for medium sequences improved from 12,290 ms to 11,526 ms with FP8, thanks to faster memory-bandwidth-bound decoding.
借助 FP8 加快受内存带宽限制的解码过程,中等序列的端到端生成延迟从 12,290 毫秒改善至 11,526 毫秒。
FP8 quantization introduced negligible semantic drift for the domain-specific resume tailoring task, maintaining formatting and persona fidelity.
在特定领域的简历适配任务中,FP8 量化带来的语义漂移几乎可忽略,保持了格式和人物特征的高度一致。
The primary practical gain of FP8 is VRAM liberation, freeing memory for KV cache to support higher concurrency on an L4; unquantized models are recommended for interactive long-input applications.
FP8 的主要实际收益是释放显存,为 KV 缓存腾出空间以在 L4 上支持更高并发;对于交互式长输入应用,建议使用未量化模型。