Voice cloning

By “voice cloning” I mean the process of analyzing an existing recording of a human voice and then applying the characteristics of that voice, the formants and ‘sound’ of that voice etc., to a target file or set of files.

Hypothetical use-case:

  1. Analyze a clean unedited portion of the “talent”, an actor or narrator etc. to yield a “voiceprint”, or “clone” of that voice’s tonality.

  2. Re-record narration/vo/dialog that an editor has made into a “frankebite” (a set of edits from many sources that creates a sentence that is often very time consuming to fix if it’s at all possible).

  3. Apply the “voiceprint” of the cloned voice-actor / narrator to my new recording.

I’m willing to bet that in audio post-production this will be the next major leap forward in terms of dialog editing and maybe even ADR; far faster than editing is re-reading and applying tonality to a new recording, assuming of course our performance is similar to the talent.

I think something like this included in Spectralayers (and thus in Nuendo through ARA) would be a really big advantage over the competition.

I’d be interested in hearing if people think this would be a good addition.

Cheers,
m

PS: I’m thinking of technologies such as machine learning to accomplish this (and please, if using a GPU for this, allow it to run on AMD, Intel and Nvidia)…

2 Likes

Adding a tag for #nuendo people…

Outside of the use cases you describe it would also be great for sound design.

2 Likes

+1 This would be amazing!

Could be a massive hit on the deep-fake market … :thinking:

2 Likes

Of course.

Though I think they probably already have or soon will have those tools readily available. As a matter of fact I think some in our (post) industry already use some other companies’ solutions, it’s just that it’s currently clunky 3rd party solutions without good workflow integration.

Sooner or later other DAWs will have this. Or iZotope RX. It’s just a matter of time.

2 Likes

Big +1!!
This could be incredibly useful, nefarious purposes aside… or possibly including ‘nefarious’ purposes, depending on your viewpoint :wink:

I’ve already seen something like this, I cant remember where or what it was called, but what I remember was that they didnt use formants, they used something like “Voice profile” where they took the vibrato/tremelo(each harmonic of the voice) and built a “Voice Profile”.

Not really sure if this would be suitable or appropriate for something like spectralayers because it is focused more towards “spectral editing”.

A feature I would like to see in the future (that is related to this very topic) is “Source Tracking(Voice Tracking)”. For example it would be cool to use the “Harmonic selections” tool and lock on/latch on to a specific source. For further example, if you have 2 different singers singing at the same time (in a similar physical frequency spectrum space/area) and you wanted to separate one of the vocalist and you use the “harmonics selections” tool, it would latch on/lock on to both of the vocalists because it is not smart enough to detect 2 different sources. Another example is if you have a violin playing at the same time as a cello, the “harmonics selections” tool wouldn’t be smart enough to latch on/lock on to one source and reject the other sources. So in other words a “smarter selections” tool or a “Source Tracker” built upon the “Harmonics selections” tool.

I’m usually using RX for various reasons, but I seem to recall that there already is a tool for matching tonal balance, like an analyzer that sets an EQ curve or something like that. I think this is sort of an extension of the same thing conceptually (not technically).

But sure, if it existed as a plugin / offline process that too would be great. I’ll perhaps clone my request to the Nuendo forum…

Human perception of speech is MUCH more acute than for other sound sources such as musical instruments. Resonances from skull cavities, chest etc all giving clues to size, shape, age and even health of the speaker.
These factors all make a task like this significantly more complex than a simple “tonal match”, but probably not impossible given enough raw data about the original voice. If you try shifting the formants from an acoustic guitar recording you can go quite a way before it stops sounding like a guitar, but even a slight shift in a voice will make it sound like a different person.
I suspect this kind of processing would require a whole new dedicated software application.

Yes, I used “” to make clear I’m struggling for an appropriate technical term for this.

It essentially already exists and I think it simply uses machine learning. I mean, I don’t mean to say it’s “simple” simple, just… you know, it already exists.

Hello MattiasNYC,
If it already exists I’d be VERY interested to find out more about it. Do you know the name of the software or who it is produced by ?
Thanks
Steve www.redtapemusic.net

Thanks Mattias, I hadn’t come across the ReSpeecher software before, very interesting. However as Abigail explains in the video, she is not emulating the voice of another person in a convincing way, but altering her own voice to sound like another person, of a different age or gender which is far simpler task, and doesn’t require the copying of phrasing, intonation, accent etc. The formant shift required to simulate change of gender or age is widely available in lots of software like Melodyne or Cubase’s vari-audio etc.
S

I don’t understand your objection. I also don’t understand the difference between " emulating the voice of another person in a convincing way" and “altering her own voice to sound like another person”.

If I have speech in a TV show and it’s a frankenbite, then an integrated tool like this would be perfect. I could simply record what the person is saying with my own voice and if the person’s voice has been cloned my voice would be altered to sound like that other person. The gain here is that I don’t have to deal with nasty edits that barely work, if at all.

As for the video you could imagine her using a cloned voice of Morgan Freeman for example, substituting words as needed in a movie. She’d just record herself saying what the director/producer wants Freeman to have said, and then applying the “Freeman” preset. Now it sounds like Freeman said what they wanted. You’d never know the difference.

Fair enough, you’re right, I didn’t explain that very well. I was thinking of the difference between "altering my voice to sound like someone who is not me, but no specific other person, (which is what Abigail is doing, and can already be done quite easily) and altering my voice to sound convincingly like (for instance) Leonardo DiCaprio, which is a very much more complicated task, involving the emulation of phrasing, accents, differing pronunciations, breathing patterns, fricatives and phonemes etc etc.
I don’t know of any software able to do that convincingly…yet !
S

Voice Cloning Software for Content Creators | Respeecher

Ahhhh Okay, I see what you’re saying. So you want this technology to be licensed to use within spectralayers? That would be cool to see but I highly doubt Steinberg would be able to acquire an exclusive license to this technology as the developers seem to already have been working with big budget hollywood studios (and most likely want to keep their technology open to those hollywood studio production companies).

This would be cool to have Steinberg get the source code for this technology and incorporate it into spectralayers, this would definitely put Izotope to shame.

As far as “Machine Learning” goes, the main developer of “Spectralayers” is highly skilled with A.I./Machine learning however IRONICALLY there is a lack of A.I./Machine Learning capabilities within spectralayers (but then again “spectralayers” is more of a straightforward automated editing tool like the idea of photoshop but for audio). However just like “Adobe after effects” (where more tools/effects have been added over the years) more tools could be added to “Spectralayers” (such as “Respeecher”).

As of right now “Spectralayers” strength is it’s toolset and I believe if Steinberg invests into more toolsets (features) then it can put it 10 years ahead of Izotope or any other spectral editing program.

One thing that I dislike about “Spectralayers” is that it is poorly optimized, the gui is extremely laggy/clunky and when you turn up the resolution/refinement/fft size and try to edit, it makes it almost impossible to do. I believe “Spectralayers” being poorly optimized has to do with “Legacy Code”. I have a powerful machine that can game in 4k on all high settings but working with “Spectralayers” seems like I’m working with a computer from 10 years ago and can sometimes become unbearable when you have to wait 5-10 seconds for the gui to zoom in on a location because it is poorly optimized. Someone here complained about how poorly “Spectralayers” is optimized and IRONICALLY one of the administrators/moderators locked the topic, however the person that complained about how poorly optimized “Spectralayers” is was right in the sense that “Spectralayers” needs to be optimized and I believe Steinberg should invest more into optimizing “Spectralayers” so it can be fluid and easy to work with on a powerful machine. I believe Steinberg should bare the responsibility of adding more tools such as “Respeecher” and optimization/optimizing of “Spectralayers” and not the main developer as I believe he should rest/retire.

No that’s not what I said. I used that company and its tech as an example of how this can be implemented in practice. That’s all.

Exactly! Thats what I meant. Thats why I mentioned that the original developer of “Spectralayers” is highly skilled with Machine Learning/A.I. and something like this could easily be implemented within “Spectralayers” with his expertise. I mainly referenced Steinberg licensing this technology to point out its a good idea and make the point that they(Steinberg) are the ones responsible for the new features. If you dont know “Spectralayers” was once owned by another company before Steinberg bought it, so it is up to Steinberg to decide how much further they want to develop “Spectralayers” and invest into it. Also I more-or-so suggested for Steinberg to license this technology, maybe as an idea (reference point/rough draft/blueprint) so they can learn from it and create something new or their own technology that makes sense within “Spectralayers”. The idea behind “Spectralayers” is that the main developer intended for it to be like the idea of “Adobe Photoshop” but for “Audio editing”. However “Adobe Photoshop” is lightyears ahead compared to “Spectralayers” with its editing capabilities. I remember the days when “Adobe Photoshop” was growing in popularity and people used it to create memes and females were heavily editing themselves with large breast and calling themselves “natural” and people would used to say “ohhhhh that’s definitely ‘Photoshopped’, that is not real”. Thats where the term “Photoshop” came from (because of it’s nature into creating something that is not real). It would be cool for Steinberg to heavily invest into “Spectralayers” so much where people would use it to create content. It would be cool if “Spectralayers” becomes so advanced that I could hear The President’s voice saying something like “I would cum all over her so fast” as “leaked audio” and the fact checkers would say something like “ohhh that’s fake, they’re using ‘Spectralayers’ to do this. It’s not real”.

Again, this technology(“Respeecher”) wouldnt necessarily be appropriate for something like “Spectralayers” (because it’s a spectral editing app and not “voice synthesizer”). Thats why I hinted that for this technology to be implemented within “Spectralayers” the feature would have to make sense and fit within the realms of “Spectral editing”. So for example, the feature would have to make sense within “Spectralayers” and could be implemented as “Voice Profile” where you can insert someone else’s voice into a classic “Frank Sinatra” song and adjust certain parameters (such as consonants, vibrato/tremelo/harmonics, syllables, speech patterns, speech recognition, sources/ etc).