Microsoft's new AI creates the Mona Lisa rap. How does it work?

(CNN) — Microsoft's new artificial intelligence technology can make the Mona Lisa do more than smile.

Last week, Microsoft researchers unveiled a new artificial intelligence model that can take a still image of a person's face and an audio clip of someone speaking and automatically create a realistic video of that person speaking. Videos can be created from animated faces, cartoons or illustrations – complete with solid lip sync and natural face and head movements.

In a demonstration video, the researchers showed how they animated the Mona Lisa to sing a comedic rap by actress Anne Hathaway.

The results of the AI model, called VASA-1, they are as funny as a little shock in their reality. According to Microsoft, the technology could be used in education or to “improve accessibility for people with communication difficulties” or create virtual assistants for humans. But it's easy to see how this tool can be misused and used to impersonate real people.

This is a concern that goes beyond Microsoft: As more tools emerge to create AI-generated images, videos, and audio, Experts are concerned Its misuse can lead to new forms of misinformation. Some also worry that technology will further disrupt creative industries, from film to advertising.

At this time, Microsoft does not plan to release the VASA-1 prototype to the public immediately. This move is similar to the one managed by Microsoft partner OpenAI Concerns surround its AI-generated video tool, Sora. OpenAI launched Sora in February, but so far it's only been available to a handful of professional users and cybersecurity academics for testing purposes.

“We oppose any behavior that creates misleading or harmful content from real people,” Microsoft researchers said in a blog post. But they added, “Until we are confident that the technology will be used responsibly and in accordance with appropriate regulations, the company has no plans to publicly release the product.”

Faces move

Microsoft's new AI model was trained on several videos of people's faces speaking, and was designed to “recognize natural movements of the face and head, including lip movement, (non-lip) expression, gaze and blinking,” the researchers explained. . When the VASA-1 animates a still photo the result is a more realistic video.

For example, a video clip containing a talking face with furrowed brows and pursed lips looks agitated while playing video games.

The AI tool can create a video in which the subject is looking in a certain direction or expressing a certain emotion.

If you look closely, there are still signs that the videos are machine-generated, such as the occasional blink and exaggerated eyebrow movements. But Microsoft believes its model “outperforms” other similar tools and “paves the way for real-time interaction with realistic avatars that mimic human conversational behaviors.”