AI 精选动态智能评分 68

实时语音模型发布

来源: twitter关注列表

作者: Artificial Analysis (@ArtificialAnlys)

发布于: 2026-06-17

收录于: 2026-06-17

AI 推荐理由

因其在同类专有模型中单价最低 2美元/千分钟，提供了显著的成本竞争优势。

核心解读

Soniox 发布 Soniox v5 Real-Time 实时语音转文字模型；在 AA‑WER Streaming 基准中 First Final WER 为 4.5% 于 0.05 s 延迟，支持 60+ 语言，售价 2 美元/千分钟，低于所有测试的专有模型；该模型在 Pareto 前沿中部实现比 Deepgram Flux、ElevenLabs Scribe v2 Realtime 等竞争模型更高精度且延迟更低的平衡。

全文

Soniox has released Soniox v5 Real-Time: a low latency streaming Speech to Text model on the Pareto frontier for accuracy and latency, at the lowest price of any proprietary model tested Soniox v5 Real-Time is @soniox_ai's latest streaming Speech to Text (STT) model, joining Soniox v5 Async, their non-streaming model released last week. On AA-WER Streaming it occupies the middle of the Pareto frontier: faster than the most accurate models (Cartesia Ink-2, ElevenLabs Scribe v2 Realtime) and more accurate than the fastest (Deepgram Flux, Nova-3), while at a lower price than all of them. AA-WER Streaming Overview AA-WER Streaming reports WER and latency as a pair, measured from Silero VAD-detected end of speech on the same ~8 hours of audio as our non-streaming STT benchmark, AA-WER v2.0. We report both at two points: First Final (first final-denoted transcript, best for accuracy) and First Partial (first transcript-bearing event, best for when speed matters most). Key takeaways ➤ First Final Transcription: Soniox v5 Real-Time achieves a 4.5% WER at 0.05s after end of speech, more accurate than the faster Deepgram Flux (7.4%, 0.02s) and Deepgram Nova-3 Realtime (6.7%, 0.06s), and faster than the more accurate Cartesia Ink-2 external endpoints (3.7%, 0.09s) and ElevenLabs Scribe v2 Realtime (3.6%, 0.14s) ➤ First Partial Transcription: The model achieves a 4.7% WER at 0.05s after end of speech, behind only Cartesia Ink-2 external endpoints (4.3%, 0.07s) and ElevenLabs Scribe v2 Realtime (3.6%, 0.13s) on accuracy, while faster than both ➤ Price: The model costs $2 per 1,000 minutes representing the lowest of any proprietary streaming model tested, below Cartesia Ink-2 ($4), Deepgram Nova-3 Realtime ($4.80) and ElevenLabs Scribe v2 Realtime ($6.50) ➤ Language support: The model supports over 60 languages, providing language identification and real-time translation across multilingual conversation. See more details below ⬇️ ![photo](https://pbs.twimg.com/media/HLBfdxxacAAwlGh.jpg) Artificial Analysis (@ArtificialAnlys): Soniox v5 Real-Time is available for $2 per 1,000 minutes of audio via the Soniox console. https://t.co/fUEyXwrk0l Artificial Analysis (@ArtificialAnlys): Full results: https://t.co/wDb6a2nhqV Methodology: https://t.co/ePPoyfUXXm

#模型发布#技术#行业动态

阅读原始全文