Microsoft has made remarkable progress in AI speech generation through its VALL-E 2 text-to-speech (TTS) system. VALL-E 2 has achieved human parity, allowing it to generate voices that are indistinguishable from real people. By analyzing just a few seconds of audio, the system can learn and replicate a speaker’s voice.
Extensive tests conducted on speech datasets such as LibriSpeech and VCTK have demonstrated that VALL-E 2’s voice quality matches or even surpasses that of human speech. The system incorporates advanced features like “Repetition Aware Sampling” and “Grouped Code Modeling” to handle complex sentences and repetitive phrases naturally, resulting in smooth and realistic speech output.
Despite sharing audio samples, Microsoft has decided not to release VALL-E 2 to the public at this time, citing concerns about potential misuse, such as voice spoofing. This cautious approach aligns with the broader industry’s recognition of the ethical implications surrounding voice technology, as exemplified by OpenAI’s restrictions on their own voice technology.While VALL-E 2 represents a significant breakthrough, it remains a research project for the time being.