Join thousands of students in our LangChain and Vector DBs in Production course, with over 50+ lessons and practical projects for FREE!.


Meet SeamlessM4T: Meta AI’s New Foundation Model for Speech Translation
Artificial Intelligence   Latest   Machine Learning

Meet SeamlessM4T: Meta AI’s New Foundation Model for Speech Translation

Last Updated on August 25, 2023 by Editorial Team

Author(s): Jesus Rodriguez

Originally published on Towards AI.

The model provides a unique architecture and breakthrough performance across different speech translation tasks.

Created Using Midjourney

I recently started an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers, and concepts. Please give it a try by subscribing below:

TheSequence | Jesus Rodriguez | Substack

The best source to stay up-to-date with the developments in the machine learning, artificial intelligence, and data…

Speech is rapidly becoming one of the next frontiers of foundation models. While domains such as language and computer vision still dominate the headlines, speech is becoming an increasingly important domain. Areas such as speech-to-speech translation(S2ST) have relied on cascaded architectures that combine a large number of components to perform translation progressively. The result is that the space hasn't shown the same progress as other areas of foundation models. Recently, Meta AI Research unveiled the research behind SeamlessM4T — Massively Multilingual & Multimodal Machine Translation, a unified speech foundation model for different speech translation tasks.

In today’s foundation model ecosystem, existing machine translation (MT) systems predominantly revolve around text, sidelining speech support, if it exists at all. The integration of speech into the MT landscape has often been relegated to secondary status when compared to its text-based counterpart. Despite the accomplishments of solo unimodal models, the realization of unified S2ST models achieving comparable breadth and efficacy remains distant. This discrepancy rooted in modalities can be attributed to various factors, yet the scarcity of audio data and the confines of modeling persist as prominent hurdles. The very complexity that renders speech a more challenging endeavor from an MT perspective — its capacity to encode richer information and expressive elements — is also what renders it superior in conveying intent and cultivating robust social connections between conversational participants.

The current landscape of such systems is marked by three principal deficiencies.

1. The focus of speech translation models predominantly gravitates toward high-resource languages such as English, Spanish, and French, often neglecting low-resource languages.

2. They predominantly cater to translations from source languages into English, rather than the reciprocal direction.

3. The majority of S2ST systems at present heavily lean on cascaded frameworks, comprised of multiple successive subsystems that handle translation in stages — beginning with automatic speech recognition (ASR), transitioning to T2TT, and culminating in text-to-speech (TTS) synthesis as part of a three-tiered architecture.

Efforts to unify these multifaceted capabilities within a single cohesive entity have given rise to initial versions of end-to-end speech translation systems. However, these systems have not yet matched the performance benchmarks set by their cascaded counterparts.


SeamlessM4T (Massively Multilingual and multimodal Machine Translation) is an integrated platform encompassing ASR, T2TT, speech-to-text translation (S2TT), text-to-speech translation (T2ST), and S2ST functionalities. The model builds on a long history of Meta AI’s breakthrough in the speech translation space. Notably, Meta AI introduced No Language Left Behind (NLLB) in the previous year — a text-to-text machine translation model engineered to encompass an impressive 200 languages. In the following months, Meta AI showcased the pioneering Universal Speech Translator. This groundbreaking system facilitated direct speech-to-speech translation for Hokkien, a language characterized by its absence of a widely adopted writing system. This endeavor also yielded the creation of SpeechMatrix, a monumental multilingual speech-to-speech translation dataset. This dataset, born from the innovation of SpeechLASER, marked a milestone in the realm of supervised representation learning. A subsequent stride materialized earlier in the current year with the unveiling of Massively Multilingual Speech. This comprehensive offering encompassed automatic speech recognition, language identification, and speech synthesis capabilities spanning an expansive array of over 1,100 languages.

Image Credit: Meta AI

SeamlessM4T emerges, synthesizing insights gleaned from these diverse projects. The outcome is a transformative multilingual and multimodal translation experience, stemming from a singular model. This model is meticulously constructed, drawing from an extensive spectrum of spoken data sources, and culminating in state-of-the-art outcomes.

To construct a unified model, Meta AI requires a lightweight sequence modeling toolkit that can seamlessly integrate with other modern PyTorch ecosystem libraries. To fulfill this need, Meta AI has reengineered fairseq, its original sequence modeling toolkit. By incorporating more efficient modeling and data loader APIs, fairseq2 now plays a pivotal role in driving the underlying modeling processes of SeamlessM4T.

Image Credit: Meta AI

At the core of the model lies the multitask UnitY model architecture, designed to perform a range of functions, including generating translated text and speech. This architecture also facilitates automatic speech recognition, text-to-text translation, text-to-speech conversion, speech-to-text translation, and speech-to-speech translation — features that are already inherent in the vanilla UnitY model. The multitask UnitY model is structured around three primary sequential components. Text and speech encoders are entrusted with the task of recognizing speech input across nearly 100 languages. Subsequently, the text decoder transforms that meaning into various languages for textual content, followed by a text-to-unit model that decodes it into discrete acoustic units tailored for 36 speech languages. Through pre-training of the self-supervised encoder, speech-to-text, text-to-text translation components, and text-to-unit model, the quality of the model is enhanced, and its training stability is ensured. The resultant decoded discrete units are then transformed into speech using a multilingual HiFi-GAN unit vocoder.

Meta AI employs a self-supervised speech encoder known as w2v-BERT 2.0 — an enhanced iteration of w2v-BERT distinguished by improved training stability and representation quality. This encoder is trained to discern structure and meaning within speech patterns, drawing insights from vast volumes of multilingual speech spanning millions of hours. Functionally, the encoder dissects the audio signal into smaller segments, constructing an internal representation of the spoken content. Given that spoken language comprises various sounds and characters, a length adaptor is employed to map these elements to corresponding words, albeit in an approximate manner.

Similarly, Meta AI employs a text encoder grounded in the NLLB model. This text encoder is trained to comprehend textual content spanning nearly 100 languages, generating representations that prove valuable in translation tasks.

Meta AI’s text decoder is adept at processing encoded speech representations or textual representations. This capability is harnessed for tasks within the same language, including automatic speech recognition and multilingual translation endeavors. For instance, when a speaker utters the word “bonjour” in French, the corresponding translated text in Swahili, “habari,” is seamlessly generated. Through multitask training, Meta AI leverages the prowess of a robust text-to-text translation model (NLLB) to guide the speech-to-text translation model via token-level knowledge distillation.

In the context of speech production, Meta AI leverages acoustic units to represent the target speech. The text-to-unit (T2U) component within the UnitY model orchestrates the creation of discrete speech units based on the textual output. This component undergoes pre-training on ASR data prior to the UnitY fine-tuning phase. Subsequently, a multilingual HiFi-GAN unit vocoder is employed to convert these discrete units into audio waveforms.

Data-driven models like SeamlessM4T derive significant benefits from substantial volumes of high-quality end-to-end data — specifically speech-to-text and speech-to-speech data. However, relying solely on human-transcribed and translated speech data is inadequate to address the complexities of speech translation for 100 languages. In response, Meta AI builds upon its pioneering work in text-to-text mining, employing a similarity measure in a unified embedding space, alongside initial explorations in speech mining, to generate additional resources for SeamlessM4T model training.

The Results

With a singular model, Meta AI’s SeamlessM4T attains cutting-edge outcomes across an impressive spectrum of nearly 100 languages. This accomplishment is augmented by its multitasking capabilities, spanning automatic speech recognition, speech-to-text, speech-to-speech, text-to-speech, and text-to-text translation functionalities.

Notably, the system extends its advancements to encompass languages with low and mid-level resource availability, significantly enhancing their performance. This augmentation is accompanied by the system’s unwavering excellence in delivering robust outcomes for high-resource languages.

In the pursuit of accurate system evaluation, Meta AI introduces an extended metric, BLASER 2.0, that transcends text-based assessments. This evolved metric empowers the evaluation of both speech and text units with accuracy akin to its predecessor. Through rigorous testing for robustness, the system showcases exceptional resilience in speech-to-text tasks. Against the backdrop of background noises and variances in speaker characteristics, the system notches substantial enhancements — average improvements of 37% and 48%, respectively — outperforming the present state-of-the-art model.

Image Credit: Meta AI

SeamlessM4T is certainly one of the most exciting foundation models in speech translation ever built. Hopefully, we will see it integrated into Meta AI’s multimodal efforts.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓