ElevenLabs, a year-old voice cloning and synthesis startup founded by former Google and Palantir employees, today announced the launch of AI Dubbing, a dedicated product that can translate any speech, including long-form content, into more than 20 different languages.
Available to all platform users, the offering comes as a new way to dub audio and video content and can transform an area that has largely been manual for years.
More importantly, it can break language barriers for smaller content creators who don’t have the resources to hire manual translators to convert their content and take it global.
“We have tested and iterated this feature in collaboration with hundreds of content creators to dub their content and make it more accessible to wider audiences,” Mati Staniszewski, CEO and co-founder of ElevenLabs, told VentureBeat. “We see huge potential for independent creatives – such as those creating video content and podcasts – all the way through to film and TV studios.”
ElevenLabs claims the feature can deliver high-quality translated audio in minutes (depending on the length of the content) while retaining the original voice of the speaker, complete with their emotions and intonation.
However, in this age of AI, when almost every enterprise is looking at language models to drive efficiencies, it is not the only one exploring speech-to-speech translation.
AI Dubbing: How it works
While AI-driven translation involves multiple layers of work, starting from noise removal to speech translation, users at the front end don’t have to go through any of those steps. They just have to select the AI Dubbing tool on ElevenLabs, create a new project, select the source and target languages and upload the file of the content.
Once the content is uploaded, the tool automatically detects the number of speakers and gets to work with a progress bar appearing on the screen. This is just like any other conversion tool on the internet. After completion, the file can be downloaded and used.
Behind the scenes, the tool works by tapping ElevenLabs’ proprietary method to remove background noise, differentiating music and noise from actual dialogue from speakers. It recognizes which speakers speak when, keeping their voices distinct, and transcribes what they say in their original language using a speech-to-text model. Then, this text is translated, adapted (so lengths match) and voiced in the target language to produce the desired speech while retaining the speaker’s original voice characteristics.
Finally, the translated speech is synced back with the music and background noise originally removed from the file, preparing the dubbed output for use. EvenLabs claims this work is the culmination of its research on voice cloning, text and audio processing and multilingual speech synthesis.
Source: VentureBeat