RE: LeoThread 2025-11-09 20-32

Part 7/12:

In the realm of voice AI, Microsoft has launched Vibe Voice 1.5B, an open-source model capable of generating up to 90-minute sustained, natural conversations involving multiple speakers. Unlike traditional text-to-speech models, Vibe can simulate multi-speaker dialogues with emotional nuance, switching seamlessly between voices and languages—including cross-lingual quick translations and even singing.

Built on the Quinn 2.51B language model, Vibe uses sophisticated audio tokenization—compressing raw sound data efficiently—and semantic tokenization to grasp speech meaning. Through a diffusion-based decoder, it injects lifelike details like emotion and intonation, producing speech indistinguishable from human voices.