Meta releases multilingual speech translation model

Illustration by Alex Castro / The VergeMeta released a new speech-to-text model that can translate nearly 100 languages called SeamlessM4T, as the company continues to try to make a universal translator. SeamlessM4T, which stands for Massively Multilingual and Multimodal...

Meta releases multilingual speech translation model

Meta released a new speech-to-text model that can translate nearly 100 languages called SeamlessM4T, as the company continues to try to make a universal translator

SeamlessM4T, which stands for Massively Multilingual and Multimodal Machine Translation, that the company said can translate speech-to-text and text-to-text for nearly 100 languages. For speech-to-speech and text-to-speech actions, it recognizes 100 input languages and converts them into 35 output languages.

It is released under a Creative Commons CC BY-NC 4.0 license, allowing researchers to iterate upon it. 

Along with SeamlessM4T, Meta also released the metadata for its open translation dataset SeamlessAlign. 

“Building a universal language translator, like the fictional Babel Fish in The Hitchhiker’s Guide to the Galaxy, is challenging because existing speech-to-speech and speech-to-text systems only cover a small fraction of the world’s languages,” Meta said. 

The Hitchhiker’s Guide Babel Fish, as conceived by author Douglas Adams, is a fish you can place in your ear to instantly understand any language. If you’re a Doctor Who fan, you could compare Meta’s tool to a translation matrix in the TARDIS that turns even alien words into English.

Meta said SeamlessM4T represents “a significant breakthrough” because this new model performs the entire translation task in one go, unlike other large translation models that divide translation across different systems. 

One of the interesting features of SeamlessM4T, if it can function correctly, is its alleged ability to recognize when a speaker is code-switching or when someone moves between two or more languages in one sentence. For instance, Meta demonstrated in a video that the model immediately differentiates between Hindi, Telugu, and English. I haven’t tested the model, but I frequently code-switch between my two native languages (Filipino and English) — as do most people who speak different languages — and from personal experience, it’s not something most AI speech recognition software picks up on quickly. 

SeamlessM4T builds on previous translation models from Meta. Last year, Meta released its No Language Left Behind text-to-text machine translation model, which supported 200 languages. It developed SpeechMatrix, a dataset for multilingual speech-to-speech translation and Massively Multilingual Speech for speech recognition. Meta demoed its Universal Speech Translator last year, converting spoken Hokkien, a widely used language in China that does not have an official writing system, to English. 

Language translation is important for companies like Meta, which employ thousands of people to moderate a flood of Facebook and Instagram posts in different languages. Very often, non-major languages have smaller teams and end up relying on automated moderation that works poorly with those languages. AI, if given access to a dataset of these smaller languages, can be a tool for companies like Meta to improve moderation.

To build SeamlessM4T, Meta said it redesigned its Fairseq sequence modeling toolkit to create more lightweight models and handle more information. 

While developing SeamlessM4T, Meta said it built a system that identifies toxic or sensitive words. Meta defines toxic words as instances where the “translation may incite hate, violence, profanity, or abuse.” The goal is to be able to detect when the output translation introduces toxicity that wasn’t present in the original material.

“We filtered unbalanced toxicity in training data. If input or output contained different amounts of toxicity, we removed that training sequence,” Meta said. 

Researchers also tried to clean up datasets that mistranslate some profanity so it more accurately detects when it is being used.

Meta claims it also recognizes gender bias in languages and said the model can quantify gender bias in translations. SeamlessM4T can check if the sentence used a gendered form of a word, say doctora in Spanish, and assign a female pronoun in a target language without equivalently gendered grammar if needed. Approaching it similarly to toxicity, Meta said SeamlessM4T counts how many times a translation adds gendered words into terms that were not specifically gendered in the original language, i.e., automatically assuming doctor is male when it has no gender distinction in the English language.

Meta has been releasing many of its AI models to developers and researchers in a more or less open-source fashion. It recently put out AudioCraft, code that allows for text-to-sound generation. Meta also provided access to its large language model Llama 2