Software can be structured to allow fairly even processing, or it can confine too much to one thread, so that the core running that thread is more used than others. Hopefully, the thread load splitting is improved with each version.
When I tested Cubase, I ran it solo, so that it would tend to bias to the OS allocation preference, which at the time (possibly pre W8.1) seemed to be to use the lowest numbered available core first. That was fairly consistent. However, if I were running other services or apps, the low cores could well be busy at the time I start Cubase, and so give inconsistent results.
As you hint at, proper evaluative testing is a PITA to design, set up, and get consistent and reliable results from. Too many unknown inputs, with all elements evolving as their designers ‘move the cheese’.