Nvidia brings a new AI model to 'revolutionize' the audio industry: capable of creating music and modifying vocals

芊芊551 发表于 2024-12-8 18:59:55

202 0 0

According to reports, Nvidia has developed a new type of artificial intelligence (AI) model that can create sound effects, change people's pronunciation, and generate music using natural language prompts.
This model is named Fugatto, which stands for Founding Generative Audio Transformer Opus 1, and is a research project. Nvidia stated that it will not announce any plans to release this technology, but it may have a wide-ranging impact on industries ranging from music, entertainment to translation services.
Bryan Catanzaro, Vice President of Applied Deep Learning Research at NVIDIA, said in an interview, "The most exciting thing about Fugatto is that it has a model that you can ask it to make sound in some way, which really opens up your imagination of its application scope
He further explained that other models on the market, some can synthesize speech, some can add sound effects to music, but Fugatto can do all of them. Catanzaro said that it can be seen as a supplement to video and image generation models such as Stability AI's Stable Video Diffusion or OpenAI's Sora.
The most fundamental improvement here is... we are able to use language to synthesize audio, which I believe opens up new prospects for tools that people can use to create amazing audio, "he added.
According to Nvidia, Fugatto is the first basic model with emerging features, which means it can mix trained elements and follow "free-form instructions".
Specifically, the model can generate audio through standard text prompts and also handle the audio files you upload. So, if you have a document of someone speaking, you can translate that person's words into another language while making it sound like their voice. You can also choose a simple tune to make it sound like an orchestral performance, or add different beats to the music.
In addition, you can also upload a document for the model to read aloud in any voice you like. More importantly, you can instruct the model to produce sounds with emotional components.
However, Catanzaro also added that this model is not always perfect. Moreover, just like models that generate images and videos, Fugatto also raises concerns among artists, sound engineers, and professionals in related fields. But Catanzaro pointed out that his original intention was to hope that this technology could help musicians.
I hope this is a new tool for artists to explore. "" I think audio has always been a productive field of exploration. You know, when we acquire new audio tools, sometimes we acquire new forms of music, "he said.