Machine Learning

MIT’s Deep Neural Network Reconstructs Faces Using Only Voice Audio

13 Jun 2019 10:48am, by

Even if we’ve never laid eyes on a certain person, the sound of their voice can relay a lot of information: whether they are male or female, old or young, or perhaps an accent indicating which nation they might hail from. While it is possible for us to haphazardly guess at someone’s facial features, it’s likely that we won’t be able to clearly piece together what someone’s face looks like based on the sound of their voice alone.

However, it’s a different matter when machines are put to the task, as researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have discovered in developing an AI that can vividly reconstruct people’s faces with relatively impressive detail, using only short audio clips of their voices as reference.

The team explains in their preprint paper how they trained a deep neural network — a type of multilayered artificial neural network that mimics the non-linear architecture of the human brain — using millions of Internet videos featuring over 100,000 talking heads. It is from these videos that the team’s Speech2Face AI is able to “learn” the correlations between someone’s facial features and the sounds these features will most likely produce.

“There is a strong connection between speech and appearance, part of which is a direct result of the mechanics of speech production: age, gender (which affects the pitch of our voice), the shape of the mouth, facial bone structure, thin or full lips — all can affect the sound we generate,” wrote the team. “In addition, other voice-appearance correlations stem from the way in which we talk: language, accent, speed, pronunciations — such properties of speech are often shared among nationalities and cultures, which can, in turn, translate to common physical features.”

Self-Supervised Machine Learning

While there has been previous work done in predicting the associations between faces and voices, one of the big hurdles is that these approaches require humans to manually classify and label the audio input information, linking it to some particular attribute, whether it’s a facial feature, gender or age of the person. However, as one might imagine, this would be a costly and time-consuming process for human supervisors — not to mention that such an approach would limit the output of the predicted face along with a rigidly predefined set of facial attributes.

To overcome this limitation, Speech2Face uses self-supervised learning, a relatively new machine learning technique that is still considered a subset of supervised learning, but where the training data is autonomously labeled by the machine itself, by identifying and extracting the connections between various input signals, without having to model these attributes explicitly. It’s an approach that is particularly suited to situations where the AI is gathering information on its own in a dynamic and diverse environment, such as the one found on the Internet.

Besides using self-supervised learning techniques, Speech2Face has been built using VGG-Face, an existing face recognition model that has been pre-trained on a large dataset of faces. Speech2Face also has a “voice encoder” that uses a convolutional neural network (CNN) to process a spectrogram, or a visual representation of the audio information found in sound clips running between 3 to 6 seconds in length. A separately trained “face decoder” then takes that translated information to generate a predicted version of what someone’s face might look like, using AVSpeech, a dataset of millions of speech-face pairs.

Comparing results: Original video screenshot of person speaking is in the first column; second column is reconstruction from image; third column is reconstruction from audio.

As one can see, some of the outputs from the team’s experiments bear an eerie likeness to the actual person, while others are a bit off. But overall, the results are quite impressive — even in special cases where someone might speak two different languages without an accent, the system was able to predict with relative accuracy the facial structure and even the ethnicity of the speaker.

As the team points out: “Our goal is not to predict a recognizable image of the exact face, but rather to capture dominant facial traits of the person that are correlated with the input speech.” Ultimately, such technology would be useful in a variety of situations, such as in telecommunications, where a reconstructed image or caricatured avatar of the person speaking might appear on the receiving cellular device, or in video-conferencing scenarios.

Images via MIT CSAIL.