Max threads cubase supported ?

Hi everyone ,

I wanna upgrade my workstation to a dual xeon system . which have 20cores and 40 threads .
but I’m wondering does cubase 8 support all cpu threads .
If it’s not support 40 threads , i will choose 16cores 32 threads system.

In theory Cubase will support as many cores as you have and there is some evidence this is improving - but many people have found that CPUs with better individual core performance perform better with Cubase as there seems to be a bias towards overloading Core 0. I guess the software will only get better in this area.

From my observations Cubase seems to start a new thread for every distinct track - however all the overhead of bigger and bigger projects seems to put increasing load on core 0 - but I have absolutely no proof of this other than what resource monitor seems to show. Reaper and Mixbus 3 seem to spread load more evenly across all cores on my machine. So in summary I think Cubase mutlicore support is pretty good but could be better.

Don’t know if cubase does for sure but have read that some steinberg forum mod asked it a year or 2 ago (if I recall it right) if someone was using a 16 core system or more, Steinberg was interested in experience with this amount of cores.

I hope someone indeed has experience with this amounts of cores and share this with us.
But also do consider that it might be that your OS could have limitations on CPU core usage.

You have to keep in mind that all the cores/CPUs in a SMP system share the same memory bus. Memory is already a bottleneck on a single core system, which is why fast cache memories are used. The bottleneck is even so dominant, that modern systems have three stages of memory caches with increasing size and decreasing speed.

The problem with a multicore system is now that memory bandwidth is divided by the cores. Only if the data being processed stays within the individual core caches you can use the processing power fully. That is rarely achieved even for single processor systems, as the sizes of CPU caches are still small compared to the typical data volumes processed today. It gets worse though, as the CPU cache is also shared between the cores on that CPU. Only the lowest level cache, the smallest by far, is individual to each core.

This means unless you do a lot of processing on a rather small data set, your multicore system is waiting for data most of the time. And the speed data is provided at is the same as for a single core system(*). In addition to that, many computational problems do not scale well with parallel processing because the tasks that need to be processed are not independent. Multi-track audio processing typically scales to some extend, but far from ideally. In terms of data flow audio processing is pretty much the worst case for multiprocessing however, because you need to stream data, necessarily invalidating the caches regularly.

That said, with the current computer system architecture, cubase (or any other audio processing app) will not be able to scale well. Practically this means that you can expect to gain performance from the first four cores, beyond that it depends on the exact processing. Adding more cores can actually make things worse. But even if you gain something, it will be drastically sub-linear.

Therefore, investing in a modern architecture with lower core count is usually the best idea for DAWs.



(*) Ok, this is not exactly true, but almost. There are CPUs designed specifically for SMP systems with several memory buses or channels. For example, modern Xeon systems use a quad channel memory system that allows 4 concurrent memory reads, but only in the case that each of these reads target a different memory module. This happens rarely in practice and makes the average relative performance closer to 1 than 4. So unless you can reserve memory individually for each processing thread in a specific memory range known to have its own channel and you keep the processing thread locked to a specific core, there’s much less benefit than you’d expect. Multiprocessing is great if you have a problem that fits it in both problem decomposability and data layout. The vast majority of problems don’t.

Well Jazz, if you take a program like Handbrake a video encoder working on one file to produce one output file and therefore many potential clashes over the same resources you wouldn’t expect that to scale very well - but it does - extremely well.

The latency of waiting for memory access compared to wait for any for of I/O is trivial as is the volume of data passing through to be processed - even for a large projects. DAWs like lots of RAM but putting faster RAM in makes almost no difference which proves the point that memory latency is not an issue for DAW applications. I suspect audio processing generally uses a very tight set of instructions and most will be sitting in level 1 or 2 cache. IMHO the hardest thing to achieve is how to co-ordinate all the running processes (tracks) and if that’s tied to one core to make programming easier that will be your limiting factor - and I think this is what hurts Cubase and it’s also why Cubase runs better on processors with fewer but faster cores - this is not the case for other DAWs.

Video encoding is in fact one of the tasks that scale well. All current encoders work on finite tiles that are processed independently. Also the processing / data ratio is rather good, so that you can perform lots of operations with the same cache context. Delivering the data of a video is not the bottleneck in this case, it’s the processing. But you can saturate the processing if you employ enough cores.

Your argument about the data throughput doesn’t work. It’s not just the amount of data, but the ratio of data read/write operations and processing operations and dependency relations between the data together with cache invalidations. Even if the data throughput is small, if it’s not in the cache and has to be loaded there first it’s going to be slow. The memory transfer rates you read in processor data sheets are burst rates that you can only achieve for large block and core exclusive memory transactions.

What’s making things worse is that the typical system scheduler unpredictability forces audio DSP processes to assume rather large worst case response times that come with the correspondingly rather big data granularity. That means you need to load and invalidate the cache lines through all three cache levels all the time, producing way more memory transactions than just reading/writing the audio data once.

Scheduling, or “co-ordinating” as you say, the audio processes is not that hard. If you have a dependency graph of the different jobs, all you need to do is walk down that graph and assign the job to a waiting thread in a pool. If you’re clever you can even make sure the job runs on the CPU/core that already has the required audio data in cache by preferring the same pool thread for a directly descendent job and working with core affinities. I’m sure Steinberg has that figured out quite well.

As part of my job I’ve been designing signal processing environments on both embedded dedicated systems and high level consumer OSes and SMP architectures. I have spent a lot of time understanding the problems of modern parallel computing and I can assure you, they’re not obvious. I doubt that you share the same insight or experience. Otherwise you would recognise what I’m talking about.



Allow me to add the following, as I think it has not been become quite clear from my previous post. Memory coherence is one of the biggest helpers when it comes to efficient parallelisation on a multi-core system and one of the key differences between the example of a video encoder and the signal processing inside a DAW.

The video frame that is tiled up and sent to the different independent processing threads lies in a contiguous block of memory. That means the memory system works at the highest possible efficiency, allowing first block burst transfer into the caches, and most importantly, the data in shared cache levels is also shared by the jobs running on the corresponding cores. That means you get very few cache misses and practically no full cache invalidations at all.

A more or less freely routable audio signal processing environment in a DAW finds a completely different situation. Blocks of data are scattered all over the available memory, the small granularity of the processing blocks makes bulk transfers rare and caches need to be flushed regularly. Scattering of memory blocks and incoherent data drastically increase the memory overhead and are among the biggest issues when it comes to scaling performance with additional processing units.



added some info:

The somewhat higher end systems with SMP Xeon have NUMA architecture, which means that each core has it’s own memory lane and piece of memory reserved.
The other cores cannot touch that memory space and cannot interfere with the memory lane. This is typically server hardware but may apply to high end workstations based on SMP Xeon motherboards (MAC PRO?). The HP servers have standard NUMA architecture, it is a common practice in Database server environments.

The high end Dell, Lenovo and HP workstations have this feature some years now.

Yes, this is what I was also referring to in my footnote in one of the posts above. The limitations I listed there apply however. This is nearly useless for audio DSP in a DAW environment however, because the buffers that need to be processed cannot be allocated with core-affinity as they’re handed through the processing chain.

The big problem with this kind of architecture is, that if one core needs to access another core’s local memory, there are severe performance penalties on both cores. So the architecture is only really useful if you have tasks and memory separated and independent. Databases typically do that, which is why you find this architecture mostly in dedicated database servers.



All modern CPU architectures manage cache coherence using some sort of high speed bus - HyperTransport in the case of my AMD CPU, you do not have to go out to main memory just because the data is sitting on another core’s L1 cache - We knew these problems 40 odd years ago after Amdahl published his “law” - my point is, even spread across many threads the data throughput of even 300 DAW tracks will no way near saturate a multicore CPUs memory interface. Anecdotal perhaps but I have intentionally slowed memory performance to note the affect it has on DAW performance - very little is the answer. Of course adding more cores will give you a diminishing returns as a portion of the workload will always be serial. This effect is markedly different from DAW to DAW which is why I pointed out Steinberg still have some work to do in this area. Yes you will get millions of stalls waiting for data but again the tiny volume of data we are talking about means it’s irrelevant the processors will still easily keep up despite the wasted cycles.

Well, if you insist. Why don’t you use your superior insights to prove the entire industry wrong and design a scalable low latency audio routing/mixing engine for SMP systems. I’d love to see your attempt. My guess is that you’ll be very surprised how much you’ll have to learn.



I’m not proving anyone wrong, other DAWs have already proved me right, on Reaper and Mixbus 3 I can add track after track and the (8) core loading will be very evenly spread – On Cubase I seem to hit a single core limit quickly causing audio dropouts and glitches. This is down to how well the serial nature of the task management is handled. Although it must be said the latest 8.5 patch has improved this for me.

You can be as condescending as you like but you are missing some very simple points. Firstly I have written Assembler (and in some applications machine code!) commercially for 40 years now – I do understand how processors work. Secondly I have already acknowledged the diminishing returns you’ll get with more cores.

Rewind to the good old days we were happily running dozens of tracks +VSTs on our DAWs with weedy processors, single threading on task switching operating systems – how bad is a stall is that? What does this tell you? It tells you that actually on a track by track basis the performance isn’t that demanding. As long as each core, stalls and all, can achieve the performance of say a 300Mhz Pentium II reading data over a 33Mhz 16 bit interface then it’ll get the job done! This is why as long as the serial nature of process management doesn’t become the limiting factor (on one core) stalling the whole output stream then DAWs should scale very well indeed. They won’t be efficient (which doesn’t matter in this application see below) but they will scale.

You are obviously coding signalling applications where every lost cycle counts – I’ve written similar applications surrounding military RADAR signal processing. This is not the problem we face with DAWs – we need to cobble together from various sources a ~500Kbyte per second stream of data from perhaps hundreds of ~200KByte per second stream of data. Even if a 2.5+Ghz core is spending most of its time stalled waiting for data this workload is trivial and the throughput plus processing requirements will be easily achieved.

For the reasons you gave, DAWs don’t sound like good multicore candidates – but given the light (data throughput) loading and processing requirements on each core (track) - it actually turns out to be the opposite situation. As long as you keep process management tight then DAWs should scale pretty well.