AI 精选动态智能评分 60

Artificial Analysis 发布语音转语音模型基准

来源: twitter关注列表

作者: Artificial Analysis (@ArtificialAnlys)

发布于: 2026-06-23

收录于: 2026-06-23

AI 推荐理由

该基准提供了主流语音转语音模型的性能、速度和成本对比，有助于选择适合场景的模型。

核心解读

Artificial Analysis 发布 Speech to Speech Index，综合 Big Bench Audio、Full Duplex Bench 和 τ-Voice 三个数据集评估原生语音转语音模型质量。GPT-Realtime-2 (High) 以 77.2% 领先，Grok Voice Think Fast 1.0 以 75.7% 紧随其后，GPT-Realtime-1.5 和 Gemini 3.1 Flash Live Preview (High) 分别为 72.0% 和 69.5%。速度方面 Deepslate Opal 最快（TTFA 0.44s），成本方面 Gemini 3.1 Flash Live Preview (Minimal) 最低（$1.50）。

全文

Announcing the Artificial Analysis Speech to Speech Index, our new synthesis metric for native Speech to Speech model quality, comprising of Big Bench Audio, Full Duplex Bench, and 𝜏-Voice The index provides a single measure of how well native Speech to Speech models perform, assessing Speech Reasoning (Big Bench Audio), Conversational Dynamics (Full Duplex Bench subset), and Agentic Performance (𝜏-Voice). Weighting is equal across all three datasets, and models must have valid results for all three to be included. Key takeaways ➤ Model performance: @OpenAI GPT-Realtime-2 (High) leads at 77.2%, followed by @xAI Grok Voice Think Fast 1.0 at 75.7%, GPT-Realtime-1.5 at 72.0%, and @GoogleAI Gemini 3.1 Flash Live Preview (High) at 69.5%. Conversational Dynamics and Agentic Performance are key differentiators of frontier models, with GPT-Realtime-2 leading in Conversational Dynamics, and Grok Voice Think Fast 1.0 leading in Agentic Performance. ➤ Speed: Deepslate Opal is the fastest model in the index with a TTFA of 0.44s, followed by GPT-Realtime-1.5 at 0.82s and Grok Voice Think Fast 1.0 at 1.25s. GPT-Realtime-2 (High) records 2.33s, with Gemini 3.1 Flash Live Preview (High) recording 2.98s. ➤ Cost: Gemini 3.1 Flash Live Preview (Minimal) is the lowest cost model in the index at $1.50, then Gemini 3.1 Flash Live Preview (High) at $1.75, Grok Voice Think Fast 1.0 at $3.00, GPT-Realtime-2 (High) at $4.14. ➤ Datasets incorporated: Big Bench Audio - 1,000 reasoning questions across Formal Fallacies, Navigate, Object Counting, and Web of Lies; Full Duplex Bench - pause handling, turn taking, interruption and backchannel handling; 𝜏-Voice - end-to-end customer service task completion across Airline, Retail, and Telecom situations. As always, we will continue to iterate on these benchmarks and plan to add more models. More details below ⬇️ ![photo](https://pbs.twimg.com/media/HLgcEL1bkAEh-r4.jpg) Artificial Analysis (@ArtificialAnlys): Full breakdown: https://t.co/Ld90Hvwwsh Methodology: https://t.co/XcGPHYRZtO Artificial Analysis (@ArtificialAnlys): Gemini 3.1 Flash Live Preview (Minimal) has the lowest cost per hour of input audio in the index at $1.50, scoring 56.6%. Gemini 3.1 Flash Live Preview (High) costs $1.75 at 69.5%, Grok Voice Think Fast 1.0 costs $3.00 at 75.7%, and GPT-Realtime-2 (High) costs $4.14 at 77.2%. https://t.co/zH0wYis3GQ

#基准测试#AI#模型

阅读原始全文