Meta: Leading the Race in AI Research

Jan 29, 2024

Fahim Ahmed Aurko

Established in December of 2015, Meta AI is an artificial intelligence research center connected to Meta Platforms, Inc. (formerly known as Facebook, Inc.). The center is focused on developing and enhancing artificial and augmented reality technology. The advancement of AI based research would not only aid in the advancement of the greater scientific community but of the world as a whole.

The Fundamental AI Research (FAIR) team at Meta is dedicated towards furthering our fundamental understanding of AI technology and covering the entire spectrum of topics related to AI. The primary objective of the FAIR team is to engage in cutting-edge applied research that can improve and power new product experiences for the Meta community. There are a number of cutting edge research models that the FAIR team have worked on.

Audiobox is a unified audio generation AI model based on flow-matching that is capable of generating various audio modalities. Audiobox is the successor of Voicebox, a state-of-the-art AI model that could perform speech generation tasks like editing, sampling, and stylizing. However, unlike its predecessor, Audiobox unites speech generation with high level editing for a variety of sound effects such as car horns or lightning strikes.

A notable feature of Audiobox is that it allows people to use natural language prompts to describe the sound or speech they want to generate. For example if someone wants to generate a soundscape, they can give the model a text prompt like, “A busy café with loud customers and an espresso machine”.

Audiobox users can also combine audio voice inputs with a text style prompt to synthesize speech of that voice in any environment or any emotion. As of this writing, Audiobox is the first AI speech generation model to enable dual inputs of voice prompts and text description prompts for freeform voice restyling. Audiobox demonstrates maximized controllability on speech and sound effects generation. The research team’s experiments showed that it significantly surpasses prior best models (AudioLDM2, VoiceLDM, and TANGO) on quality and relevance (faithfulness to text description) in subjective evaluations. Audiobox also outperformed its predecessor, Voicebox, on style similarity by over 30 percent.

Seamless is a family of AI translation models that enable cross-lingual communication in real time while retaining expressive elements of speech, such as tone, pauses and emphasis. The family of models are built atop SeamlessM4T v2, the latest version of the foundational model released August of 2023.

The SeamlessM4T is a massive multilingual multimodel machine translation model that can support approximately 100 languages. Alongside the SeamlessM4T v2, the family of audio translation models contains the SeamlessExpressive and SeamlessStreaming.

The SeamlessExpressive is a model made for preserving expression in speech-to-speech translation. It can preserve a speaker’s emotion and style while also maintain the speaker’s speech rate and pauses for rhythm. SeamlessStreaming is a model that generates responses from someone who speaks a different language while the speaker is still talking with only two seconds of latency. SeamlessExpressive and SeamlessStreaming are combined into Seamless, a unified model featuring real-time multilingual and expressive translations.

Emu is an AI based image foundation model that is utilized for text to image generation. Emu is capable of generating high quality images through a “quality tuning” process. Unlike traditional text to image generation models that are trained with large numbers of image-text pairs, Emu focuses on “aesthetic alignment” after pre-training, using a set of relatively small but visually appealing images.

Emu also has a massive pretraining dataset of 1.1 billion text-image pairs collected from Instagram and Facebook. Further additions to the base model included Emu Video and Emu Edit. Emu Video is a high quality text-to-video generator based on diffusion models. The unified architecture for video generation tasks can respond to a variety of inputs: text only, image only, and both text and image. Emu Edit is a precise image editing model that is also capable of free-form editing through instructions. The model is capable of local and global editing, removing or adding backgrounds, colours, geometry transformations, detection, segmentation, and more. Emu Edit precisely follows instructions and ensures that pixels in the image unrelated to the editing instruction remain untouched. Emu Edit itself contains a dataset of 10 million synthesized samples, the largest of its kind to date.

Tags: AI Technology