DeepMind’s new AI creates audio clips and dialogue for videos

DeepMind, Google’s AI research lab, says it is developing AI technology to create audio clips for videos.

in mail On its official blog, DeepMind says it sees V2A (video to audio) technology as a key piece of the AI-generated media puzzle. While many organizations, including DeepMind, have developed AI models for video creation, these models cannot create sound effects to sync with the videos they create.

“Video generation models are advancing at an amazing pace, but many current systems can only generate silent output,” DeepMind wrote. “V2A technology [could] “It has become a promising way to bring life to the films that have been produced.”

DeepMind’s V2A technology takes a description of the soundtrack (e.g., “jellyfish pulsing underwater, marine life, ocean”) combined with a video to create music, sound effects, and even dialogue that matches the characters and tone of the video, watermarked by DeepMind’s deepfake. -Anti-SynthID technology. DeepMind says the AI model powering V2A, a publishing model, was trained on a combination of voices and dialogue transcripts as well as video clips.

“By training on video, audio, and additional annotations, our technology learns to associate specific audio events with different visual scenes, responding to information provided in the annotations or text,” according to DeepMind.

Mom is the word on whether any of the training data is copyrighted — and whether the data creators knew about DeepMind’s work. We’ve reached out to DeepMind for clarification and will update this post if we hear back.

AI-powered audio generation tools are nothing new. Startup Stability AI released one just last week, and ElevenLabs launched one in May. There are no templates for creating video sound effects. Microsoft project It can create talking and singing videos from a still image and platforms like Becca And TypeX They trained models to take a video and make a best guess of what music or effects are appropriate in a given scene.

But DeepMind claims its V2A technology is unique in that it can understand the raw pixels of a video and sync the resulting sounds to the video automatically, optionally without description.

V2A isn’t perfect, and DeepMind acknowledges that. Because the base model isn’t trained on a lot of videos that contain artifacts or distortions, it doesn’t create particularly high-quality audio for these clips. In general, the resulting sound is not excellent masked; My colleague Natasha Lomas described it as “a variety of stereotypical voices,” and I can’t say I disagree.

For these reasons, and to prevent misuse, DeepMind says it will not release the technology to the public anytime soon, if ever.

“To ensure our V2A technology can have a positive impact on the creative community, we gather diverse perspectives and insights from leading creators and filmmakers, and use this valuable feedback to guide our ongoing research and development,” DeepMind wrote. “Before we consider opening access to the wider public, our V2A technology will undergo rigorous safety evaluations and testing.”

DeepMind offers its V2A technology as a particularly useful tool for archivists and people working with historical footage. But generative AI along these lines also threatens to upend the film and television industry. It’s going to take some seriously strong worker protections to ensure that generative media tools don’t eliminate jobs — or, as the case may be, entire careers.