“It’s a safe prediction that generative AI will soon dominate the field of audio production. It’s only a matter of time before you can upload your mockup—of any quality—to the cloud, and generate multiple versions, all sounding like perfectly authentic orchestral recordings.”
…I noticed some skepticism in the replies regarding the potential of generative AI.
For this reason, I’d like to show the skeptics this project from just a few months ago: it’s called ViolinDiff (search it on Google and GitHub). It includes a free installable implementation, with audio examples.
As you can hear, the quality is impressive, and there’s really no doubt that it will become hyper-realistic very soon. Arne is absolutely right.
And this was done using just a simple MIDI! Just imagine what this will become once expressive input is added—like what NotePerformer already does.
I’m curious to hear your thoughts and to know if you’re aware of similar experiments involving other instruments.
I suggest you look into the links that illustrate the project in detail because your doubts really don’t make any sense.
Just as with VEO 3 you can already barely tell the difference between a real video and an artificial one, the same will happen with generative AI applied to sounds. And if something doesn’t convince you, you’ll tell the AI yourself what to modify.
“I really do not want to hear a realistically bad performance” <— this is exactly the opposite of what generative AI allows you to achieve, since you can intervene on every detail. So, I have to repeat myself: your observation makes no sense. If you are skeptical, that’s another matter and I respect your skepticism, but the statement above is simply false.
I think it’s just stubbornness and denial for most.
In time, AI generated (or assisted, or whatever) music and video will be indistinguishable from real life for 98% of the population. Look at the tech advancements that have come in just the last 5 years. What will it be in 50 years? It will happen.
For example, no one sitting in a theater listening to an AI assisted score is going to have any clue what’s happening.
“Hey wait, I think that third clarinet’s timbre is off…”
To put it into different words: A good performance isn’t about good or bad sounds; it happens when you can feel the performer as a person telling you a story of their life— or when a group of artists, like a choir or an orchestra, collectively convinces you to change your life.
These are exactly the points I am not convinced of. No, you cannot intervene on every detail by prompting. The AI model is just a black box mimicking the real performances submitted as training data. The first attempts generated by AI may sound very good, however, as soon as you want to modify the output a little bit, you are stuck in a endless loop of try-and-error. Here, the combinatorics of possible interactions inevitably starts to show the artificial nature of AI. The model does not understand you. It just tries to provide you with the most probable answer.
I like it when generative AI outputs are transparent, so that you can freely modify the outcome to your liking. Therefore, if I wanted an AI-assisted mockup, I would ask the model to generate automation data for all the tracks. This is the correct and ethical use of AI. I would not want it to produce the final rendering from my score since this would be the wrong use of AI in my opinion.
Regarding the points Valsoslim is not convinced: this is, indeed, stubbornness—sorry for using explicit words. The human brain does exactly this: it creates events probabilistically. In fact, no performance of any musical phrase is ever identical to another, even when played by the same person. What truly matters is CONVERGENCE toward a result that is considered correct—and rest assured, AI converges even more effectively than a human. Therefore, the probabilistic model is precisely what is needed to achieve a performance that is not only hyper-realistic, but also well-executed. There will be no need for trial and error; it will simply be a matter of choosing the response that is most convincing in terms of interpretation (unless a hallucination occurs, which would affect realism), from a set of increasingly convergent responses.
I think the main point is more that you as a performer/composer, etc., enjoy the process of doing this, no matter if AI could do it fast/better. I don’t feel the same excitement and satisfaction if I hit a button and get the results immediately as when I work on something for months and then see the final results. I want to have the personal experience and fun doing this myself.
I cannot see any artistic value in these attempts. Okay, we trained a generative AI model on large amount of human performances of a particular style, and prompt this model with harmonic changes and lyrics from another style. Here, I can imagine that no fine-tuning and long interactions with human were needed. They just took what the model generated and that was it.
To be clear, I am convinced that AI is able to produce a human-like performance (just like a WAV file is) but I am not sure that you can easily produce a realistic mockup directly with an AI renderer exactly as you want it as an author. In other words, the fine-tuning loop does not work with (current) models, IMHO.
Imitating to the point that we can no longer distinguish the copy from the original, might be very interesting from a technical point of view. From an artistic one, is totally irrelevant.
For things that are ‘secondary’, I would welcome AI tools. So: rendering a performance for a mockup with AI would be great if it’s better than what we already have. Likewise, I wouldn’t object to music for film being generated by AI (some people would argue that’s not secondary, I know).
But for things that are ‘primary’, I don’t think AI has an artistic value - so, AI creating a piece of music for performance in it’s own right would not interest me. Images generated by AI leave me cold.
AI is a tool, nothing more in spite of the tech involved.
How the tool is used will determine the artistry of the result, and like all art, it will be subject to radically different reactions from various perspectives.
Beyond that, I doubt anyone knows enough about what is in store to be able to do more than speculate what is ahead.