AI 精选动态
智能评分 68
实时语音模型发布
AI 推荐理由
因其在同类专有模型中单价最低 2美元/千分钟,提供了显著的成本竞争优势。核心解读
Soniox 发布 Soniox v5 Real-Time 实时语音转文字模型;在 AA‑WER Streaming 基准中 First Final WER 为 4.5% 于 0.05 s 延迟,支持 60+ 语言,售价 2 美元/千分钟,低于所有测试的专有模型;该模型在 Pareto 前沿中部实现比 Deepgram Flux、ElevenLabs Scribe v2 Realtime 等竞争模型更高精度且延迟更低的平衡。
全文
Soniox has released Soniox v5 Real-Time: a low latency streaming Speech to Text model on the Pareto frontier for accuracy and latency, at the lowest price of any proprietary model tested
Soniox v5 Real-Time is @soniox_ai's latest streaming Speech to Text (STT) model, joining Soniox v5 Async, their non-streaming model released last week. On AA-WER Streaming it occupies the middle of the Pareto frontier: faster than the most accurate models (Cartesia Ink-2, ElevenLabs Scribe v2 Realtime) and more accurate than the fastest (Deepgram Flux, Nova-3), while at a lower price than all of them.
AA-WER Streaming Overview
AA-WER Streaming reports WER and latency as a pair, measured from Silero VAD-detected end of speech on the same ~8 hours of audio as our non-streaming STT benchmark, AA-WER v2.0. We report both at two points: First Final (first final-denoted transcript, best for accuracy) and First Partial (first transcript-bearing event, best for when speed matters most).
Key takeaways
➤ First Final Transcription: Soniox v5 Real-Time achieves a 4.5% WER at 0.05s after end of speech, more accurate than the faster Deepgram Flux (7.4%, 0.02s) and Deepgram Nova-3 Realtime (6.7%, 0.06s), and faster than the more accurate Cartesia Ink-2 external endpoints (3.7%, 0.09s) and ElevenLabs Scribe v2 Realtime (3.6%, 0.14s)
➤ First Partial Transcription: The model achieves a 4.7% WER at 0.05s after end of speech, behind only Cartesia Ink-2 external endpoints (4.3%, 0.07s) and ElevenLabs Scribe v2 Realtime (3.6%, 0.13s) on accuracy, while faster than both
➤ Price: The model costs $2 per 1,000 minutes representing the lowest of any proprietary streaming model tested, below Cartesia Ink-2 ($4), Deepgram Nova-3 Realtime ($4.80) and ElevenLabs Scribe v2 Realtime ($6.50)
➤ Language support: The model supports over 60 languages, providing language identification and real-time translation across multilingual conversation.
See more details below ⬇️

Artificial Analysis (@ArtificialAnlys): Soniox v5 Real-Time is available for $2 per 1,000 minutes of audio via the Soniox console. https://t.co/fUEyXwrk0l
Artificial Analysis (@ArtificialAnlys): Full results: https://t.co/wDb6a2nhqV
Methodology: https://t.co/ePPoyfUXXm