AI 精选动态
智能评分 60
Artificial Analysis 发布语音转语音模型基准
AI 推荐理由
该基准提供了主流语音转语音模型的性能、速度和成本对比,有助于选择适合场景的模型。核心解读
Artificial Analysis 发布 Speech to Speech Index,综合 Big Bench Audio、Full Duplex Bench 和 τ-Voice 三个数据集评估原生语音转语音模型质量。GPT-Realtime-2 (High) 以 77.2% 领先,Grok Voice Think Fast 1.0 以 75.7% 紧随其后,GPT-Realtime-1.5 和 Gemini 3.1 Flash Live Preview (High) 分别为 72.0% 和 69.5%。速度方面 Deepslate Opal 最快(TTFA 0.44s),成本方面 Gemini 3.1 Flash Live Preview (Minimal) 最低($1.50)。
全文
Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice
The index provides a single measure of how well native Speech to Speech models perform, assessing Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench subset), and Agentic Performance (𝜏-Voice). Weighting is equal across all three datasets, and models must have valid results for all three to be included.
Key takeaways
➤ Model performance: @OpenAI GPT-Realtime-2 (High) leads at 77.2%, followed by @xAI Grok Voice Think Fast 1.0 at 75.7%, GPT-Realtime-1.5 at 72.0%, and @GoogleAI Gemini 3.1 Flash Live Preview (High) at 69.5%. Conversational Dynamics and Agentic Performance are key differentiators of frontier models, with GPT-Realtime-2 leading in Conversational Dynamics, and Grok Voice Think Fast 1.0 leading in Agentic Performance.
➤ Speed: Deepslate Opal is the fastest model in the index with a TTFA of 0.44s, followed by GPT-Realtime-1.5 at 0.82s and Grok Voice Think Fast 1.0 at 1.25s. GPT-Realtime-2 (High) records 2.33s, with Gemini 3.1 Flash Live Preview (High) recording 2.98s.
➤ Cost: Gemini 3.1 Flash Live Preview (Minimal) is the lowest cost model in the index at $1.50, then Gemini 3.1 Flash Live Preview (High) at $1.75, Grok Voice Think Fast 1.0 at $3.00, GPT-Realtime-2 (High) at $4.14.
➤ Datasets incorporated: Big Bench Audio - 1,000 reasoning questions across Formal Fallacies, Navigate, Object Counting, and Web of Lies; Full Duplex Bench - pause handling, turn taking, interruption and backchannel handling; 𝜏-Voice - end-to-end customer service task completion across Airline, Retail, and Telecom situations.
As always, we will continue to iterate on these benchmarks and plan to add more models.
More details below ⬇️

Artificial Analysis (@ArtificialAnlys): Full breakdown: https://t.co/Ld90Hvwwsh
Methodology: https://t.co/XcGPHYRZtO
Artificial Analysis (@ArtificialAnlys): Gemini 3.1 Flash Live Preview (Minimal) has the lowest cost per hour of input audio in the index at $1.50, scoring 56.6%. Gemini 3.1 Flash Live Preview (High) costs $1.75 at 69.5%, Grok Voice Think Fast 1.0 costs $3.00 at 75.7%, and GPT-Realtime-2 (High) costs $4.14 at 77.2%. https://t.co/zH0wYis3GQ