Deep Learning AI Generates Convincing Deepfake Videos of Mona Lisa

More often not, the promise of artificial intelligence is portrayed as something positive and cutting-edge: after all, who doesn’t like self-driving cars and smart machines that can quickly diagnose cancer or predict earthquake aftershocks?
Of course, a tool is only as good as the intention of the person that is using it, and AI can certainly be used for less altruistic ends. A great example of this is the phenomenon of deepfakes — where images and videos of real people can be synthesized into fake videos using deep learning AI techniques. Such eerily convincing, counterfeit videos have been used as revenge porn, and could also be utilized to generate hoaxes or even “fake news” videos of politicians — making deepfakes a potentially dangerous weapon in the wrong hands.
Fortunately, there are some limitations to the process. The deep learning models used in such applications often need a prohibitively large amount of training data and plenty of GPU-hours in order to generate deepfakes — some of which may have millions of distinct parameters to juggle. However, such constraints may soon fall away, as researchers over at the Samsung AI Center in Moscow recently demonstrated with a series of jaw-droppingly impressive deepfake videos that were produced using only a handful of images —including some that were created from well-known historical paintings. Behold:
Out of the Uncanny Valley
As the researchers point out in their preprint paper, synthesizing photorealistic models is no easy task, as human heads are geometrically complex, especially when they are in motion. Besides that, there is the added complexity of modeling clothes, hair and the subject’s mouth while speaking. Moreover, the human eye is actually quite sensitive to any minor discrepancies in an artificial representation, so rather than creating a sense of affinity, a badly done model can create a sense of eerie revulsion in the viewer that’s known as the uncanny valley effect.
To overcome these problems, the team utilized a convolutional neural network (CNN or ConvNet) — a type of deep learning neural network that typically used for analyzing visual images. In addition, the team’s system is refined using techniques gleaned from generative adversarial networks (GANs), a type of machine learning system that has two neural networks pitted against the other, working concurrently to generate images that look increasingly closer to the original training images. Not surprisingly, GANs have been used to produce so-called adversarial images that can fool both humans and computers.
By designing it in such a way, the team’s system was able to achieve what is called “few-shot learning” capability, where it is able to “learn” and train itself on only a few images before going on to synthesize completely new, artificially generated images. In fact, the system is also capable of “one-shot learning,” where it can generate a reasonably result from only one source image, though the addition of more images increases the accuracy of the final representation.
The system works by first processing a source image or images through an off-the-shelf “face landmark tracker” algorithm, which traces the placement of the subject’s eyes, eyebrows, nose, jaws and lips. The system’s few-shot learning capabilities are then gained through an extensive “meta-learning” stage, where it is exposed to videos of another completely different person, where that person’s face landmarks are similarly extracted, frame by frame.
Just as a tightly coupled generative adversarial network might function, the system is then able to produce a completely new video of the source subject or a “learned talking head,” by using first an “embedder” network to translate the facial landmarks taken from the videos to create vectors, which are then adapted by a “generator” network to create a sequence of moving images, based on the original photo. Yet another “discriminator” network is used as an adversarial component, where it learns to identify which videos created by the generator network are real or faked. These results are then fed back into the system, essentially upping the ante for the generator network to continue producing more and more realistic results that will ‘fool’ the discriminator network.

Diagram showing the system architecture.
As one can see, the system spits out some pretty remarkable results, from creating synthetic videos of Marilyn Monroe, Salvador Dali, Rasputin and Einstein, to the rather disconcerting moving images of an ever-smiling Mona Lisa. As one might expect, the team’s novel system is a huge improvement over comparable systems, achieving “perfect realism” in a user study, when using 32 source images. As the team explains, such an approach could be ultimately used for special effects in films, or to create photorealistic, animated avatars for users of telepresence applications — such as in video conferencing or multiplayer games — and of course, more deepfakes that are unsettlingly indistinguishable from the real thing.
Read more in the paper.