Description of parameters around FFT-Size

Is there a detailed explanation of the use and effects for the

  • FFT Size
  • Resolution
  • Refinement
  • Windowfunction for FFT. (How do they effect the outcome of FFT-Transformation?)
    available somewhere?
    I would like to see an exact definition of these parameters from a mathematical point of view in order to understand exactly, what those are doing in the working process.
    Ok, FFT-size is fairly clear, but the other two… I am not so sure.
    How do they actually influence sound quality when using Unminx/Process/Transform and selecting?

Numerical issues are also of interest.
For instance, when unmixing components into tonal, transient and noise, the three layers should add up to the original, right?
But in reality, they are not. There are differences form the original, slightly, but they are there. Those differences are also dependent on the above mentioned parameters.
So what exactly is going on here?

I have a scientifc background, so the basics are clear. But I am not an expert on FFT-analysis.


Different in what way? Could you explain exactly how? The sum of the audio should always = to the original (unless you specifically do something to change the phase).

As far as fft explanation definitions, I’ll let @Robin_Lobel explain because (if I can remember) I believe earlier versions of Spectralayers actually allowed for a X8 resolution and I’m not sure why it was reduced to only X4. Only @Robin_Lobel can explain that as I’d also be curious(although I assume because of performance issues) to know.

1 Like

Check easily for yourself.
32bit/192 kHz, unmix into tonal-transient and noise. Safe each layer seperately, reload into audacity- add them up and subtract the original file.
There are slight residues.

The residues are also dependent on FFT-size and FFT-windowtype.
So, yes, there are numerical issues here.

And performance should not be an issue here anymore.
FFT and calculations all around it, can be handled with ease by any modern PC.
Multicore-CPU and Graphiccards with massive parallel computation of linear algebra operations are no problem anymore.
Maybe Steinberg should implement some more advanced algorithms?


Can you upload an example here so I can test for myself (preferably the same file you claim you’re receiving residuals from)

Sure, do you have an emailadress to send this via “wetransfer”?

But you can also create your own example:

  1. Use an (original) soundfile, import it into Spectral Layers
  2. Unmix Components (Tonal, Transient, Noise)
  3. Export the three files (Tonal, Transient, Noise)
  4. Start “Audacity”
  5. Load the original soundfile into Audacity
  6. Also load the three files (Tonal, Transient, Noise) into Audacity
  7. Invert the original Soundfile
  8. Save all four tracks into one file. This should be completely silent.
  9. Reload this one file into Audacity and amplify for 30 or more dB.
    → There you can see and hear some residues.
    Those residues are also dependent on the chosen fft-size and the fft-window-function.

And for this reason alone, I would like to see a deep and complete description of the used algorithms in Spectral Layers and the use and effects of fft-size and window-function for all transformations of this (fantastic) programm.
Since me and others here do have a scientific background, there is no need to be superficial or sketchy in the description.

Okay! So you’re right. There seems to be some type of sample rate/bit depth issue. Usually those problems occurs during conversion between different bit depths.

It does not have to do with sample rate/bit depth. There seem to be some numerical or principle issues with fft-transformations in both directions.
Suppose an fft is made with some windowfunction and then the backwardsoperation with another windowfunction, for instance, the outcome will not be exactly as in the beginning.
But that’s not the issue here. The soundfile gets split up into three layers (Tonal, Transient, Noise), which in theory should add up the original. But they are not, because of numerical issues during the calculation (I guess).

May be there are more sophisticted algorithms, but I don’t know. I am not that deep into fft and its applications.

It would be nice, if Steinberg would be more transparent of what Spectral is actually doing on algorithmic and numerical point of view.
Maybe Steinberg has a closed policy here (proprietary programming) and it will remain a mystery.
So, from a scientific point, the actual transformations of soundfiles could probably be done much better at the cost of more calculation time.
I wonder, if Izotope or Adobe Audition or even more advanced software is capable of doing the fft-transformations more exact than Spectral Layers.
But, of course, the intention of Spectral Layers is not for a scientific community first, but rather for the musical artist. So, that might be excusable.

Nevertheless, Spectral Layers is a fantastic program with ingenious concepts and very good overall usability.


Ohhhhhhh Yes! I forgot about that.

So! I still believe the issue (fundamentally) is still a bit depth issue that is most likely the core issue. When you mentioned fft it brought back to my memory that changing the fft size does indeed affect/effect how those components are output. I don’t believe (at the root) this issue has anything to do with algorithms.

Of course if you’re a “scientist” (like you say) you can recreate the experiment in a controlled environment(pun intended) . Find a 32 bit audio file (preferably music… Preferably music with high dynamics) and convert the bit depth to 24 or 16, flip the phase against the original 32 bit file and merge the 2 together, if done correctly then you will experience those same oddities.

@Marc_von_Bredow Definition of the parameters:

  • FFT Size, you know about it
  • Resolution is a combination of increased time overlap and frequency overlap, balanced to provide progressive increase in precision and computation load. Resolution 1: time overlap x4, freq overlap x1, Resolution 2: time overlap x8, freq overlap x1, Resolution 3: time overlap x8, freq overlap x2, Resolution 4: time overlap x16, freq overlap x2
  • Refinement is a proprietary algorithm that makes the spectrogram looks sharper, but doesn’t actually change the underlying data and analysis, it’s just a visual effect on top of the spectrogram
    *Window function is common knowledge like the FFT Size, but if I had to define its impact I would say that it influences the dynamic range you can expect from a spectrogram analysis, and the thickness of the frequencies. Its role is somehow close to the role of the FFT Size.

Regarding your null test, not sure why you went to Audacity, doing it all in SpectraLayers would prove you that there’s no data left or damaged or residuals. Just open a file, duplicate the layer, unmix its components, and revert the phase of the first layer (the untouched one) and activate the Composite View : you’ll see that it perfectly sums to zero.

1 Like

I have to carefully disagree here and also want to raise awareness to some possible issues.
I did some tests myself due to this thread (all null tests done within SpectraLayers), and found that there are indeed differences , although most of them are at very low volume (below -150dB), so I wouldn’t consider them a problem at all.
What caught my attention more, were some somewhat regular spikes/pops/clicks, that reminded me of another issue a few years ago in SpectraLayers 7/8: Strange clicks after editing.

So I did multiple tests with Unmix > Song, Components and Levels at different sample rates and always the same spectral view settings (FFT Size 2048 samples; FFT Window Hann; Resolution x3; Refinement 0%; Amplitude Min -210 dB to Max -18 dB).
The source file was always the same (a lossy OGG Vorbis file, about 93 seconds duration, stereo, 44100 Hz, decoded in SpectraLayers as 32 bit float), sample rate conversion was done in SpectraLayers.
Frequency is always displayed linear at full range, the waveform is zoomed to about -57dB / -60dB.

Unmix Song
Not sure if there even is a realistic use case for 32kHz, but here are the most artifacts, especially in high frequencies (maybe due to sample rate conversion?). Going to 44.1kHz there are only 2 short paired spikes, with a consistent repetition. The spikes peak around the -60dB mark.
Going to 48kHz and above the unmixed result is basically perfect (ignoring the very low residual noise).

Unmix Components
With increasing sample rate you can guess a strange pattern with increasing amount of spikes. Peaks up to around -20dB.

Unmix Levels
Level threshold for the process was set to -75dB. Pretty consistent pattern of artifacts. Doubling the sample rate doubles the spikes. Peaks around -57dB.

Yes, these peaks are overall not that loud and will most likely not be noticed in a mix. Still, I can hear them when isolated via null test pretty clearly (when there is mainly lower frequency). Probably OK from a musical standpoint, not ideal from a scientific standpoint (to get back to the original post). Maybe other settings for the spectral view will result in better processing.
So unless I’m the only one getting these artifacts, there is still room for improvements.

Windows 11 Pro 23H2 | SpectraLayers Pro 10.0.50 Standalone with active GPU acceleration | AMD Ryzen 5900X | RTX 3080

1 Like

Thx a lot for this detailed analysis. I also noted the spikes. So in summary, we come to the conclusion, that after unmixing, the files do NOT add up to the original file. There ARE artefacts/residues, that may not be resolvable in principle.
As an analogy it reminds me of Gibbs phenomenon (see here for more details: Gibbs phenomenon - Wikipedia). Since Spectral Layers can only work with finite summation during the FFT-process, no matter the settings, there will always be errors, that can not be eliminated. This has to be considered, when applying several transformations in a row.
I think, Steinberg should make a statement on this.
Maybe some more details of the underlying algorithms and its particular implementation would help clarify the issue here.

1 Like

I thought it to be a good idea to add up the files with another software outside of Spectral Layers.
One could even compare the added up files with the original file on a byte-level. Just to make sure.
There are not the same.

1 Like

@Laturec Thanks for the detailed tests, I just did my quick null test on a 10 seconds file which wasn’t showing it. I’ll look into that, as well as the 32KHz issue.

@Marc_von_Bredow It’s definitely not an issue with the FFT/iFFT precision, as proven by the sections between the spikes, which perfectly sum to zero, but rather a buffer transition issue, which is rather an algorithmic tweak than a precision limitation.

1 Like

Considering this, would you say that certain settings regarding window function would be better suited for certain tasks, e.g. for achieving the best possible result when extracting/unmixing to a drum layer, and/or unmixing drums?
Or is the window function not involved in the unmixing processes at all?

This is detailed in the documentation, see the bottom of this page:

1 Like

And the same goes for window function, i.e. Hamming and Hanning etc. which I don’t find in the documentation?

Parameters are briefly described here, but regarding specifically Window functions, in SpectraLayers they are ordered from lowest dynamic range (and finest frequency lines) to biggest dynamic range (and thickest frequency lines).
Rectangle (lowest dyn range) < Bartlett < Hamming < Hann < Blackman < Blackman-Harris < Kaiser-15 (highest dyn range)

SpectraLayers defaults to Blackman-Harris which provides great dynamic range, and still reasonable frequency thickness.

As a rule of thumb, anything from Hann to Blackman-Harris is good. Kaiser-15 is a little too extreme and should probably be reserved to specific scientific analysis that requires very high dynamic range, and Rectangle/Bartlett for very specific experiments, not suited for daily work.

1 Like

Many thanks!

Thx for the hint, but this documentation is rather for the artist, not for a scientific inclined reader.
It’s ok, but it’s definetely not enough.
I am looking for a thorough documentation, that covers every detail of the options the program offers.
Yes, Spectral Layers is intended as a simple (in relative terms) to use workhorse for the acustically inclined user, but since there remain so many open questions, there should be a better and more complete documentation - at least somewhere.
I guess Izotope RX and Adobe Audition are no better here. The programs are there, certain algorithms have been implemented, the math behind it should be keep below the threshold of discouraging the normal user. Which is fine and understandable, but for some users, as me and others, this is definetely not enough.
So put it all together in a simple question: Is there a detailed explanation of all the parameters and algorthms used in Spectral Layers available to the public? Yes or no?

For instance: When unmixing a sound to get the transients. I guess it is somehow the first derivative of something and so the steepness of the signal is in measure here. Once the steepness passes a certain internal set value, it is considered a transient. It would be nice to know, what exactly is going on and, even better, the transient should be parametrizable by a user interface.
A detailed explanation of every accessible function in the user interface would be convenient, don’t you think?
Well, I guess, at least the programmers know the details, right?

As you’ve rightly guessed such details are not exposed on purpose, to not lose non-scientific users and keep it simple. Future SpectraLayers documentation could indeed include a couple more details about all the display parameters that we’ve mentioned here, as some wants to know the exact details.

However regarding process and unmixers algorithms it’s unlikely to be detailed in the documentation, but feel free to ask me questions about it (I’m the sole and only developer of SpectraLayers since v1).

Regarding Unmix Components, it’s purely AI-based. There were almost a hundred attempts at solving this algorithmically, with no fully satisfying results because no formula would handle all possible patterns. Fortunately deep learning came at the right time to “solve” it. It’s not 100% perfect in how its discriminate between the 3 components, but still much better and more accurate than any algo attempts before. So it’s basically a UNet-like AI model trained on thousands and thousands of random noise+frequency+transient patterns.