So…we have Spleeter, demucs, various 3rd party code variations…and apparently…Peter Jackson’s team secret demixing code.
While I primarily use Spectralayers for various manual signal cleanup tasks, I often use its spleeter-tweaked routine for some types of quick demixing. And then on to other 3rd party solutions to decide if one thing works better than another.
I’ve been listening to Giles work on Beatles stuff, including Revolver. I’m not personally convinced that the PJ code is all that lightyears ahead of everything else.
But if it is…how and more importantly…why?
There are certainly more brains out there, working on demix code than those on the PJ team.
Anything up your sleeve Robin?
I don’t see any particular reason PJ would have a better demixing model than the known state of the art at a given time. AI for demixing is doing progress (there are papers on that matter almost every month), and sometime a particular AI model can fit into a product, but generally speaking technology progress at approximately the same pace for everyone.
To be honest, it doesnt matter how many algorithms are created I dont believe A.I. (or any “secret algorithms”) would ever perfect it. Music in general has way too many variations and there are so many variables that play a part that even the best algorithms struggle with a clean extraction.
The noise profile alone (thats all the effects and reverb and delay and distortion and white noise) makes it damn near impossible to get a clean extraction. The best algorithms would have to decipher whats noise (white noise) from other sources. With speech (for example “the cocktail problem”), its easier for algorithms to decipher sources (because the human voice is unique and you can trace two different sources within a recording across the spectrum) but when you add music and other instruments and noise, then even the best algorithms struggles.
obviously we are just speculating - but you have to remember that Peter Jackson had access to the vast resources of Weta when making Get Back.
Their developers have a extensive experience of AI/machine learning and I think it’s very likely they used Spleeter but who knows how they optimised it for that very specific job.
They also had the co-operation of Apple (records !) so they had access to lots of very specific Beatles material to further train it, plus lots (and lots) of very high power compute to train it on. The amount of compute needed for audio is a tiny percentage of what is needed for video.
It’s possible that this training means that this version only works well for Beatles records so it may not have any advantages for a general user.
I don’t know what ‘perfect’ is ? but I think it’s shortsighted to think ‘near perfect’ isn’t possible. The results using spleeter can already be astounding - and if somebody had demonstrated that 30 years ago it would have felt like magic.
AI or machine learning isn’t a simple (or secret) linear algorithm - it (basically!) ‘learns’ what to do from ‘marking it’s own homework’, and it improves…it’s getting better and better all the time- and compute power continues to increase at an incredible rate…
The amount of compute needed for audio is a tiny percentage of what is needed for video.
Not true. In fact, audio is just as intensive on cpu power than video (with all the effects/after effects and post processing). Resampling and interpolation in real time alone causes the best server grade gpus to struggle. Morphing audio in REAL TIME (especially on a pcm wave level) is inensive. When you factor in other things (such as de-bleeding or convolution reverb which has to be done offline) it should be made obviously clear that audio is just as intensive on compute power as video.
I’m guessing you haven’t spent much time in the world of Weta Digital (VFX) ? If you think you can do that kind of VFX in real time I’d like to see what kind of equipment you have at home
Just so you’re aware, rendering a single frame of VFX in a movie @ 4K (1/24th of second!) could easily take a couple of hours…or very occasionally a couple of days.
Yeah, sometimes the algorithms gets it almost close, however its mostly under circumstances when the music is simple and not a lot of variation and not a lot of noise (for example music from the 70’s or 80’s) . With todays music (for example house music where the vocal is being sidechained/ducked under other instruments), it would be extremely difficult for the best algorithms to decipher what is a vocal from what is other instruments. Then when you factor in noise, the best algorithms would miss every time.
Try this experiment, look at white noise on a spectrogram and look at a quite part of a song (for example a filler or a breakdown where the vocal is fading out or in and a downlifter/upriser is playing at the same time) and tell me if you see much of a difference.
Of course something simple is easier for the AI to learn, the processing is more straightforward and the results will usually be better
the question isn’t wether I (a human !?) can tell the difference - it’s wether the AI can learn to tell the difference and with an AI that is sophisticated enough, lots of computing power, it most likely can. There is nothing ‘special’ about digital audio - it’s just ones-and-noughts. All the data it needs is most likely in those ones-and-noughts, it just needs to extract it.
Hopefully we can agree that the current technology is amazing…it can definitely do things I couldn’t do manually with EQ or spectral editing. AI is better at many, many things than humans already. It sees (or hears) things that we can’t. It can process in ways that the human brain cannot.
That will continue to improve. More compute power and more LEARNING will ensure that. As Robin mentioned there is lots of research in this area too…so the AI will improve.
I mean this very respectfully but I suggest you take a look at Spleeter :
it comes “pre trained” but you can train it further with your own examples. The more it learns the better it gets.
This is not some kind of fringe technology, this stuff is mainstream. Interesting times that we live in