Part 7/12:
In the realm of voice AI, Microsoft has launched Vibe Voice 1.5B, an open-source model capable of generating up to 90-minute sustained, natural conversations involving multiple speakers. Unlike traditional text-to-speech models, Vibe can simulate multi-speaker dialogues with emotional nuance, switching seamlessly between voices and languages—including cross-lingual quick translations and even singing.
Built on the Quinn 2.51B language model, Vibe uses sophisticated audio tokenization—compressing raw sound data efficiently—and semantic tokenization to grasp speech meaning. Through a diffusion-based decoder, it injects lifelike details like emotion and intonation, producing speech indistinguishable from human voices.