How to Synthesize a Fake Obama Video with Artificial Neural Networks

It seems that nowadays, there isn’t a day that passes by without someone proclaiming “fake news” — that now-infamous phrase that rose to prominence during the last American election and is now being bandied about ad nauseum.
But as any intelligent person knows, it’s true that you can’t always believe what you read or see on (or off) the Internet. Fake Photoshopped images abound on the internet, thanks to photo-editing technology that allows people to create staged situations that look real — but never actually happened.
Now, with the help of artificial intelligence, we might be facing the prospect of an explosion of fake news videos too. At least that’s what we might assume from these new findings from researchers from University of Washington, who created this rather convincing but bogus video of former U.S. president Barack Obama, using an artificial neural net trained on many hours of video footage featuring the former president, overlaid with an actual audio clip of him speaking last year about the Orlando mass shootings. Watch and see if you can determine what’s real and what’s not, and how it was done:
According to the researchers’ paper, they used what is called a recurrent neural network (RNN), a type of artificial neural network that arranges nodes of artificial neurons to function in a way that resembles the human brain. These networks are fed massive amounts of data in order to ‘learn’ how to perform a task or solve a problem.
We’ve seen recurrent neural networks applied to things like speech recognition, text-to-speech synthesis — anything that requires some kind of internal memory to process varying sequences of inputs.
In this case, the researchers lifted the audio of Obama speaking in a separate video, and dubbed it over another video of him in a completely different location. Using about 14 hours of footage in the public domain and sourced from Obama’s weekly announcements, the recurrent neural net was able to “learn” how to recreate a composite of the facial and mouth movements that corresponded to various sounds.
To do this, the neural network synthesized a “sparse mouth shape,” on top of which mouth textures could be then applied and blended into an altered target video, giving the talking head an appearance of natural movement. The result is an eerily plausible lip sync.
Surprisingly though, this isn’t the first time that researchers have tried to do this kind of thing. As mentioned in the video above, there have been other versions of the same concept, but this time around, the University of Washington team added a time-delay to the process to make the results look much more realistic.
In addition, the neural network focused on synthesizing the parts of the face most associated with producing speech — namely, the mouth and the surrounding area, lips and teeth, with special attention being paid to the subtle wrinkles and shadows in the skin that would be made while speaking. Even the jaw line is warped to match the chin in the target video.
“Given the mouth shape at each time instant, we synthesize high-quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track,” wrote the team. “Our approach produces photorealistic results.”
But manufacturing fake news isn’t the main intention here. The research team foresees that the technology could be used for other, more practical, applications.
“Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio,” said study co-author Ira Kemelmacher-Shlizerman on ScienceDaily. “This is the kind of breakthrough that will help enable those next steps.”
And even if the technology is used for manipulating the masses for political ends, that same technology can be used to determine whether a video is real or if it’s been faked — by detecting the blended teeth and mouth movements.
“This may be not noticeable by human eyes, but a program that compares the blurriness of the mouth region to the rest of the video can easily be developed and will work quite reliably,” paper co-author Supasorn Suwajanakorn told IEEE Spectrum.
Cold comfort, perhaps, but at least it’s a fair warning for what we might have to expect for the future.
Images: University of Washington