ASIO Guard and Threading issues are not just a windows issue

So……

I bought a nice Mac Studio M3 ultra…

Very expensive here in the UK… My reasoning was that my M1 max-10 core machine threaded really well and could in some instances out perform my 16/32 core 9950x. So I presumed that the issues were to do with windows/Cubase… so the option was to buy a Mac Studio it seemed…

well… After setting it all up … I fired up my Mix bus test…

Exactly the same issue on Mac OS with a high core count CPU…Facepalm smiley… tha’s an expensive experiment for me here…

check out the screen grab. The project was idle when I took it and it was just about hitting the red when in play… Look how much CPU is NOT being used again. The exact same issue i was seeiing in Windows with my 9950x.

Eye-roll smiley

now if this were a single core issue i.e. the buses used a single core each then surely you’d see 8 cores maxing out, but you don’t, no single core is above 30% so why is ASIO guard at 100% .

This is definitely something to do with bussing IMHO but not sure what. If I remove the busses from this project I can copy the entire project into hundreds of tracks/plugins and it threads fine with 3x as many plugins overall. As soon as you start routing to busses etc then ASIO guard falls over.

As a comparison the exact project replicated in Reaper i can add over 100 mix bus plugins no problem… I can also do the same in Cubase if I host them locally in audio gridder.

5 Likes

Are you saying you’re seeing different behavior on your M1?

What are your channel latencies?

Oh Man, thats one expensive gamble, partly due to the Macheads with larger systems refusing to help with the testing over the last 18 months.

Mic Drop !

I understand now why some were so reluctant to help, because I suspect they already knew it was also broken on MacOS !

They keep cycling back to it all being about non parallelization of serial processing, etc, but then as you noted, why isn’t the processing being properly assigned across at least the number of busses.

Yeh, as soon as group/busses are introduced the issue is 2 fold, one being the thread management and scheduling routines, and the 2nd being the premature accumulation and overrun of ASIOGuard, which have to be directly related.

And no , we don’t have to know the exact under bonnet code of why its happening, we can simply observe it in RW application !

The above wasn’t directed at you obviously Mate :slight_smile:

Yep, there should be no need to require 3rd party hosts to tap into the remaining resource over heads, and its across a large range of session layouts and workflows.

Not all, but enough to be a significant problem.

I have 3 end user RW sessions now replicated in DAWbench, yours and 2 other, all different, all collapsing ASIOGuard prematurely, all have some level of Grouping/Busses. The most recent is Magnus’ session he posted about on threads that I reached out to help, and is now also a DAWbench session using alternate 3rd party plugins, to eliminate the Acoustica out of the equation.

The end result was identical to his RW project, which we even knocked AudioGridder out of the equation to be 100% on the same page.

Its also worth noting that Magnus’ session is pretty light re Grouping/Bussing, its just a few Groups , nowhere near as complex as your session, and it still fell over.

Steinberg are well aware of what is collapsing ASIOGuard, and now that its also evident on Higher Core Macs, this has just got a whole lot messier !

2 Likes

yes, lower core count means Cubase threads more evenly across the board so openning a similar mix on the M1 woulds show all 10 cores, probably nearing max 90% , which would more or less align with the ASIO guard meter.

This is reason I presumed this was a windows issue and went ahead and bought the M3 ultra…..

M

2 Likes

Can you upload the project you’re using for this test?

I find it rather comforting that there is no major difference between mac and Windows, because I always thought that the general code of the audio processing engine should be quite identical on both platforms, abstracted away from the specific threading APIs… So it makes sense to me.

I have done my own test, and had a look at Cubase threads in WinDbg…

On my system (AMD 7950X with 32 HT), Cubase creates exactly 32 threads named “Audio Prefetch 0-31”:

 228  Id: 4cdc.6378 Suspend: 1 Teb: 00000000`209d2000 Unfrozen "Audio Prefetch 0"

I assume those are the ASIOGuard threads, one for each core, maybe with a lower priority than the real time threads, which are called “Audio Realtime 1-14. Why 14? No idea. maybe 16 physical cores minus two for other tasks.

Then there are two “Audio Prefetch Trigger” threads, and six “Audio Catchup 0-5”. No idea what they are good for.

Test 1

I created a test project with 8 audio tracks going into 8 groups, loaded with the very CPU intensive IK Tape 440 plugin, one instance per group. No realtime usage, no control room.

You can roughly see where the plugin threads run, there are tow or three more than the 8 groups, but that is to be expecte. ASIO load is ~30%, a bit more than the highest core saturation, but ok, not far off.

Test 2

8 audio track going into 8 groups, two plugin instances each.

the number of loaded cores is roughly the same as with test 1, just with more usage.

ASIO load hasn’t even doubled. most saturated core is maybe 60-65%.

I think this confirms that Cubase - as we assumed - processes plugins on one channel always serially in one thread, effectively calling the processing functions in a loop.

From a computing point, this is very efficient, there is very little overhead, and you benefit from cache and memory locality.

The disadvantage is that the other cores stay underutilized, but if you want to spread the plugins of one channel evenly on other threads, you potentially lose the cache and memory locality, and things get really complicated with thread and data synchronization, which definitely introduces overhead of its own. Maybe there are other ways that I don’t know, maybe it would be worth it for modern CPUs, I cannot say, I am not a developer.

Test 3

16 audio tracks going into 16 groups with one plugin each.

I count ca. 19 rather loaded cores, three more than the groups. probably whatevery cubase needs to do else, dunno. But this test shows that it scales fine in width.

ASIO load is ~75%, most loaded core around 65%, similar as test 2. Why ASIO load is higher here I cannot explain, maybe internal thread synchronization overhead.

Test 4

For fun: 32 tracks into 32 groups, one plugin.

This is completely overloading. All cores saturated to +90%. Somewhat ecpected.

24/24 works, with ASIO load ~90%, but playback OK.

Not that I didn’t really check for dropouts, this more more a simple scaling test to see how plugins get distribute to cores. If you have more ASIOGuard tracks than prefetch threads, things will be different again. I assume some of you have done similar tests…

2 Likes

@fese That makes sense and is what I’d expect to see but if you look at my screen grab, ASIO guard is max and no core is above 30% . This was exatly the same on my 9950x machine, ASIO guard was max and no core was above 20-30%

From the test project, loading the busses as we did I’d expect to see those busses on an individual core and seeing them max out in the task manager……. this wasn’t the case though. You can , similar to your results, see that the busses/groups are going to individual cores with the extra ones like you said, but they’re only at 20-30% when ASIO guard is maxed out…… that’s the issue.

M

I think @TAFKAT hasn’t released this yet as part of the DAW bench package. I think he’ll probably reply to this and give you an update as to when it will be.

M

1 Like

Yes, i noticed that, and I also fail to find a reasonable explanation for this… It probably depends very much on how the rest of the project is structured, routings, plugins on source track and so on.

I still think it might be related to thread synchronization, as Cubase needs to synchronize the real time threads with the prefetch threads, do the summing and in the end write the correctly calculated buffer to the ASIO driver. This is e.g. something that Audiogridder doesn’t need to care about at all, it just needs to return the values whenever it is ready, with its extra latency, and Cubase has to do the hard work of making sure the whole plugin graph is correctly in time and sync. It is no wonder that AG is more efficient here, but of course for us users it would be desirable if Cubase could handle those situations itself more efficiently.

In the end, this is of course something only the Steinberg engine devs can solve, and as Matthias Quellmann already has confirmed that this isn’t an easy task for them (understandably, with code that is probably in parts +25 years old), and of course they have been burned in the past with changes to the audio engine.

It is most likely also a management prioritization thing, because as annoying as this may be for power users, it doesn’t necessarily sell updates…

1 Like

Wow, O.K, each to their own I guess.

Realtime threads are locked to MMCSS , which is limited to 14 total threads available within the Cubendo engine. This was covered during the whole MMCSS thread limiting mess in 2017-18. MMCSS in W10+ is limited to 32 total threads, 4 are reserved leaving 28 , each call takes 2 threads, leaving 14 Logical Cores Total*. This was covered extensively back then.

See Steinberg article below.

Your tests are no different to what we have done already using my standard DAWbench empirical saturation tests, where we can easily configure/replicate your style of tests to show preferable thread management routines, which Steinberg just dismiss as “theoretical”.

This is where the DAWbench MIX test ( that I developed using a RW session of Marcus’ ) comes into play. The session ( and numerous others I have since developed) are based on Real World session logistics/layouts and work flows, all different but all which trigger the ASIOGuard accumulation and premature overrun. This has been the focus of numerous other threads here in recent weeks.

This thread however is specifically about the thread management behavior also being evident on MacOs, which we believed was not the case until Marcus hit the same wall with his new Ultra system.

Why are you posting about your 7950X here ?

1 Like

Sorry I’m not familiar with the back story on windows. So if this issue doesn’t occur on a 10 core Mac, what is the core count where it becomes noticeable?

that’s a good point and one I’ve been thinking about too. I would suspect it will be the same on Intel/AMD too.

Be interesting to test a lowercore intel/AMD cpu and see if that exhibits the same.

Interesting that dissabling the ‘hyperthreading’ didn’t help in anyway as I remember that was one of the first things I tested when I became aware of this.

M

M

Not really Marcus, I have a 12600K 6/12P : 4E dev system that displays similar behavior and also a Snapdragon X 12 Core WoA laptop that shows the same dynamic. What ever is causing this is inherent to Cubendo’s threading routines. The only thing that has changed now is that initially we thought it was reserved to Windows, but you now have discovered that its also misbehaving on the higher core Macs.

Probably still above what ever trigger/ceiling they have imposed ?

If you remember I did a stack of testing on a 10980XE , dropping core counts, disabling HT and monitoring the threading behaviour, and it was all over the place.

But I digress, lets focus on the topic at hand here.

Thread management is not behaving as expecting on the higher core Mac, I am doing my best to be polite, LOL !

The reason I ask is: if this is only apparent when using an Ultra AS CPU, it may be difficult to find many people who could attempt to reproduce your results.

By the way, and again I don’t know the backstory of this on windows, so this probably a stupid naive question, but isn’t the simple explanation that this is just an affinity issue?

One thread that is maxed out which is moving across cores could give the results you’re seeing, right? That might also explain why this only happens when you do bussing to intentionally unbalance the thread loads. Sorry if this is something that has already been discussed.

No, look at the screen grab ,you can clearly see no thread is higher than 20-30%

m

I see cores that are running at no more than 30%, but unless I’m missing something, I don’t see anything that shows the cpu utilization of threads.

The cores are the threads, there’s no hyperthreading with ARM cpu …same as intel these days. You just have Performance and E cores. My M3 ultra has 20 P cores and 8 E.

M

The word “thread” means different things depending on the context, so it can sometimes be confusing to use that word without the additional context.

CPU companies talk about threads of execution on the silicon. HT means two threads per CPU, non-HT means one thread. “CPU Threads”

But OS threads are a different thing. Any given OS running right now has hundreds, if not thousands, of simultaneous threads running. The threads have different priorities and different allocated time-slices. “OS Threads” or “Application Threads”. If you talk with a software developer, this is what they usually mean by “Thread”. Similarly, this is what Task Manager shows in the Details tab if you show the Threads column.

  • Opening up Cubase 15 and without creating a new project, there are 256 threads created. (when I create an empty project, this jumps to 300)
  • Excel has 220 threads created on my PC right now
  • Visual Studio has 193 threads
  • etc.

(Background services tend to use fewer threads because they don’t have Window/User interaction to deal with)

In an ideal world, there would be a single OS thread for a single CPU thread. But operating systems are multi-tasking, and the apps running on them have many different units of work to account for.

Pete
Microsoft

6 Likes