Dev Log: Building a German Voice
We just wrapped a studio session with a singer to start building Cantai’s first German vocal library — and I wanted to share a bit about the process, because it’s genuinely fascinating how this works.
Pre-order Cantai for Dorico now!
Why German, and Why Now?
German is one of the most requested languages from our users, and for good reason. Between lieder, opera, choral works, and contemporary composition, there’s an enormous body of music written for German text. If you’re scoring anything in that tradition, you need a voice that can handle it.
The Session
Here’s the thing about building a singing voice: the training data is the voice. Every stylistic detail the model learns comes directly from what the singer actually performed in the studio. So the repertoire choices matter a lot.
We built a setlist of German lieder and arias — real songs, chosen to give us broad phonetic coverage across the language. German has some sounds that don’t exist in English (the “ch” in ich, the Umlauts), and the best way to capture those naturally is to record them in context, embedded in actual musical phrases, rather than running through clinical phonetic exercises.
We tracked across a full dynamic range — quiet, intimate passages through to full operatic power — in a dry studio environment. Clean signal is everything here; the less room sound baked into the recording, the more flexibility the model has downstream.
How It Becomes a Voice
This is the part people are usually most curious about, so here’s the short version:
First, we take the recordings and create a precise time-alignment between the audio, the lyrics, and the notes. Every syllable gets mapped to exactly where it sits in the performance — when it starts, when it ends, how it transitions into the next sound.
Then we extract the acoustic detail: the pitch contour (not just “what note,” but the continuous movement of pitch — the vibrato, the subtle slides between notes that make a voice sound alive) and the spectral texture of the singer’s tone.
All of that feeds into a diffusion model. If you’re not familiar with the term: imagine starting with pure noise — like static — and then gradually, iteratively refining it, guided by the score and the acoustic data from our session, until what’s left is a clear, natural vocal performance. The model learns how this singer sounds when singing German, and can then generate new performances from any score you give it.
What’s Next
We’ll be evaluating the first renders soon and filling any gaps in the phonetic coverage. More updates to come — and if there are other languages you’re hoping to see, let us know.
