Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder
English summary
Interfaze has open-sourced diffusion-gemma-asr-small, the first multilingual diffusion-based ASR model. The system fine-tunes only a 42M-parameter adapter on a frozen Whisper-small encoder and Google’s 26B DiffusionGemma backbone, using a parallel denoising decoder instead of autoregressive generation. With a CTC-aided training trick to overcome convergence issues, it transcribes English, German, French, Spanish, Hindi, and Mandarin from a single adapter. At 16 denoising steps, it reaches 6.6% WER on LibriSpeech test-clean, leading other diffusion ASR models but trailing autoregressive Whisper. Transcription cost is governed by the number of denoising steps rather than audio length, converging in roughly 8 parallel passes.
Chinese summary
Interfaze 开源了 diffusion-gemma-asr-small,这是首个多语言扩散语音识别模型。该系统仅在冻结的 Whisper-small 编码器和 Google 的 26B DiffusionGemma 骨干上微调了 42M 参数的适配器,使用并行去噪解码器替代自回归生成。通过 CTC 辅助训练突破收敛难题,单个适配器支持英语、德语、法语、西班牙语、印地语和普通话的转录。在 16 个去噪步骤下,LibriSpeech test-clean 上的词错误率(WER)为 6.6%,领先其他扩散 ASR 模型,但落后于自回归 Whisper。转录成本取决于去噪步骤数而非音频长度,约 8 个并行轮次即可收敛。
Key points
diffusion-gemma-asr-small is the first open-source multilingual diffusion ASR model, released by Interfaze.
diffusion-gemma-asr-small 是 Interfaze 发布的第一个开源多语言扩散语音识别模型。
It uses a frozen Whisper-small encoder, a 19M-parameter trainable projector, and a frozen 26B DiffusionGemma backbone with only 42M total trainable adapter parameters.
采用冻结的 Whisper-small 编码器、19M 参数可训练投影器和冻结的 26B DiffusionGemma 骨干,仅训练 42M 适配器参数。
Transcription is performed via parallel denoising of a fixed-length canvas, making cost independent of transcript length.
通过并行去噪固定长度的文本画布进行转录,成本与转录长度无关。
With 16 denoising steps, it achieves 6.6% WER on LibriSpeech test-clean, the best among diffusion ASR models but behind autoregressive Whisper.
在 16 个去噪步骤下,LibriSpeech test-clean 上 WER 为 6.6%,在扩散 ASR 模型中最佳,但次于自回归 Whisper。
A single adapter handles six languages: English, German, French, Spanish, Hindi, and Mandarin.
单个适配器处理六种语言:英语、德语、法语、西班牙语、印地语和普通话。