返回精选
AI 精选动态 智能评分 60

MaineCoon-22B 实时文生音视频模型发布

来源: twitter关注列表
作者: Rohan Paul (@rohanpaul_ai)
发布于: 2026-06-23
收录于: 2026-06-23
AI 推荐理由
关键差异点是实现了实时流式音视频生成,成本极低,适用于 AI 角色实时交互场景,值得深入阅读其技术博客和论文。
核心解读
catnips_ai 发布 MaineCoon,一个 22B 参数的实时文本到音视频模型,采用双流 Diffusion Transformer 和强化在线策略蒸馏(ROPD)。模型在单张 H100 GPU 上达到 47.5 FPS,首帧生成低于 1 秒,音频-视频生成成本低于 $0.001/秒并持续下降。支持超 10 分钟连续流式生成,通过代理缓存管理和长上下文回滚保持一致性。
全文
AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video. @catnips_ai just introduced MaineCoon, a 22B real-time text-to-audio-video model built for live AI characters, not offline video generation i.e. to make AI video feel live by generating synced speech and visuals in real time. A record-breaking frame rate of up to 47.5 FPS on a single H100 GPU. Audio-visual generation cost drops significantly below $0.001 per second and continues to fall. It positions the paradigm of social world models for social-interactive purposes. MaineCoon serves as the first generative core toward this paradigm and provides a technical foundation for next-generation AI-native social platforms. It proposes a multi-stage forcing-free streaming training paradigm that includes self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). These components enable 22B-scale native and efficient streaming audio-visual training. It designs an agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning. The big deal is long-duration streaming at low cost. Text goes in, the first frame appears in under 1s, and the model keeps producing synced video and audio while playback is already happening. So it is not making a full video first, then dubbing it later. It generates forward in small chunks, and each chunk continues from the last one. That is hard because tiny chunks usually break consistency. Faces drift. Voices change. Motion gets weird. Audio and mouth movement separate. MaineCoon tries to solve this with a dual-stream Diffusion Transformer: one stream for video, one stream for audio, and cross-stream attention between them so expression, lip motion, voice, timing, and body movement stay tied together. It also uses a history key-value cache and an attention sink. In plain words, the model keeps useful memory from previous chunks, so the next chunk does not feel like a new disconnected clip. The speed claim is also big: up to 47.5 fps on a single H100, and real-time 30 fps on a single RTX Pro 6000 GPU. That is the low-cost part. You do not need a huge multi-GPU serving setup just to get real-time audio-video generation. They also describe an agentic streaming system that can keep generation going for more than 10 minutes while holding identity, voice, scene state, visual quality, and synced audio. If the stream starts drifting, the system repairs future chunks instead of editing already-shown frames. So MaineCoon is best understood as a streaming-native visual reaction layer: fast first frame, continuous audio-video output, long-horizon memory, and low inference cost. 🧵 1/n. ![photo](https://pbs.twimg.com/media/HLhSRMmakAAl_uo.jpg) https://video.twimg.com/amplify_video/2069494554965966848/vid/avc1/480x832/2fqKCMXqSfqCPn79.mp4?tag=28 https://video.twimg.com/amplify_video/2069494554932445185/vid/avc1/480x832/uWlg2VEYPQAeqmPk.mp4?tag=28 https://video.twimg.com/amplify_video/2069494554915708929/vid/avc1/960x1664/_qOqWMxrUbhMN1nj.mp4?tag=28 Rohan Paul (@rohanpaul_ai): Some more results from MaineCoon-22B https://t.co/BRZjLFqTzl Rohan Paul (@rohanpaul_ai): Read the detailed technical blog here of MaineCoon-22B and also you can apply for access. https://t.co/rDA12pGEeP Rohan Paul (@rohanpaul_ai): 🧵 10. From the official technical paper of MaineCoon-22B https://t.co/rF66wwi6ie https://t.co/sr0V0qYxsK Rohan Paul (@rohanpaul_ai): 🧵 7. The full runtime of MaineCoon-22B becomes a fast/slow system. MaineCoon handles sub-second audio-video reaction. A slower planning brain runs behind it, using the feeling simulator to guide future chunks without stopping the live stream. https://t.co/q6ru395OHa
#技术突破#模型发布#多模态