Text to Speech Feature for Spectralayers

Gregory_McCollum · October 17, 2023, 5:24pm

Adding a text to speech feature to SLP would be extremely popular. Open source code is already published as per the video below:

Highly-Controversial · October 17, 2023, 5:59pm

Ummm text to speech , me love a good text to speech , but is it what Spectralayers is about ?

Unmixing · October 17, 2023, 6:09pm

I have been observing a lot of users requesting a lot of Voice/Speech features (such as A.I. voice cloning) within spectralayers. I’m not totally against it however because it seems like only one developer is only working on spectralayers, that might make the development cycle extremely slow. I think the text-to-speech and the A.I. voice cloning idea is great and I’d like to be able to record someone’s voice and manipulate it into a voice like Mariah Carey(with the correct formants and the correct articulation in the voice print profile) but for that to be implemented within Spectralayers would probably take a development team rather than just one developer.

The best way I can reiterate what I said above (so others can understand) is to think of the idea of serum and Steve Duda. Steve Duda is a talented developer but he alone didn’t build Serum by himself, he outsourced what was out of his expertise to a mathematician(so that things like dsp would be much more efficient on a low level api) and that allowed him to focus on other important features and other things. With Spectralayers (and I cant speak for Steinberg, I’m only going based off of my observations and assumptions) it seems like it’s only one developer working on that application and it doesn’t seem like Steinberg has any incentives to invest in Spectralayers for features like A.I. Voice cloning. I would like to see voice features like voice cloning and would like to see more features that uses A.I. to do various things with the voice and would like to see a feature that would allow you to type a sentence and have A.I. sing back what you typed(in Mariah Carey’s voice), but for that to become a reality Steinberg would have to make a decision and decide if those are features they want to invest in. I cant fathom one developer doing all that by himself and then even if that were possible, I’d imagine it would take years for that to be implemented.

Gregory_McCollum · October 17, 2023, 7:26pm

If you are correct and there is only one SLP developer I would agree. There are plenty of features and fixes that current users would prefer over a TTS feature. But SLP already does half the job, allowing one to edit an audio, clean up noise and transcribe the result. Given the commercial potential of TTS (as evidenced by the plethora of websites offering TTS for $), I think it would be a good investment for Steinberg. I haven’t written any code since machine language so I don’t know the magnitude of the job. But the code is already written. It just needs a user-friendly interface.

Nspace · October 19, 2023, 7:32pm

The concept of number or developers and development team has completely blurred in recent years. There are strategic alliances per project and even per tasks, between teams and between collaborators. Specific licenses. You may set a discord, a Github, a slack channel or other platform and collaborations may mount depending on what is requested and uploaded.
For instance, Monsieur Lobel himself set years ago an open community effort called Pytorch with tools and frameworks to build libraries and support AI. One of its endeavors is this PyTorch Edge | PyTorch
This is just one example, together with his previous/ongoing? scientific research and lets not forget the alliances/agreements he first set with Sony Creative more than a decade ago, next with Magix and followed with Steinberg, all constitue a complex web of inputs for SpectraLayers development.
Therefore, saying “one developer” does not represent what really there is behind this development.

Unmixing · October 19, 2023, 9:31pm

@Nspace

Yeah I agree, but features like “A.I. Voice cloning” is extremely specific (and overall outside of spectral editing). It’s not just frameworks and toolkits and libraries, you’re talking about a whole area of expertise of Machine learning/deep learning and neural networks. Then on top of that, new research needs to be done (for example a lot of the voice A.I. clone songs that I hear with celebrity voices have a common noticeably problem with voice articulation. The vowels and consonants sound accurate but the voice articulation and how that individual pronouces each individual syllable is way off). Then again, something like A.I. voice cloning can become a problem where people will abuse it (not to mention all the proposed regulations surrounding A.I.)

True, I dont know everything on the development side but I’m pretty sure that it’s going to take more than an alliance or libraries or github code in order to implement something like voice A.I. cloning.

Puma0382 · October 20, 2023, 1:17pm

I think @Gregory_McCollum has made a fair request and an interesting topic. I’d be keen to read the developers thoughts on the matter.

Crucially, the video posted shows TTS to be perfectly possible right now, using several open-source tools. It was quite an insight. The voice model ‘training’ part seemed pretty ‘friction-free’.

The task to remove background noise in this case, could be done quite separately and won’t always be needed. Though, that is one of SL’s specialities of course…

But I can imagine it is quite a deal of effort to build similar (to open-source) TTS capability right inside a commercial app sold on the open market.

BTW - I have no expert/programming/software knowledge either way; am just another end user.

Robin_Lobel · October 20, 2023, 3:27pm

There are a lot of AI features considered for the next version. TTS is indeed one of them.
It’s hard to say what features will make it into SL11 at this point, but this one will certainly be considered during development.

DosWasBest · October 20, 2023, 7:12pm

@unmixing said…

A.I. Voice cloning* . I would like to see voice features like voice cloning and would like to see more features that uses A.I. to do various things with the voice and would like to see a feature that would allow you to type a sentence and have A.I. sing back what you typed(in Mariah Carey’s…

WELL…

I’m not Yamaha, not Steinberg, not Mariah, but I am historically plaintiff in similar territory lawsuits and I’ll offer my opinion.

In the US, via case law/staute over the past few decades, it’s been the case that decisions have come down that one can’t…for example…copyright a snare drum hit. etc.

There’ve been cases that have pronounced…you can’t copyright your voice.

However…and this would I think be very important to a commercial software product maker or a bunch of github guys or similar (who I believe are being sued)…it IS possible for a celebrity with a distinctive voice to trademark their voice…which can carry very sharp teeth in a court of law…in the US.

Now…if Mariah Carey does indeed have a registered trademark…and…

JimsFantasticCelebrityAiVoiceClone software goes on sale for $39.95 somewhere on the planet…and contains a Mariah preset…or a preset cleverly named Moriah or MarAI and it sounds pretty much like the real Mariah after 16yr old BobGarage types in text to the program…creates a song that sounds like Mariah singing and puts the file up on youtube for free…

IF there is a trademark…guess who the LA attorneys are gonna go after? The maker of the software. The enabler.

Me…I wouldn’t want any remote prospect of being involved with creating the product.

I’m pretty sure it’s this type of thing that buried csp back at the turn of the century although I’ve never been sure.

All that being said, the cork is out of the AI bottle so to speak. I just wouldn’t want to be the one nailed with a lawsuit.

Unmixing · October 20, 2023, 7:58pm

On-the-other-hand, you could technically add a “match voice” process (with a fairly simple input and output process) and that shouldn’t violate any copyrights nor trademarks. For example if there was a fairly simple process to input 20 Mariah Carey acapellas and have an A.I. process round up all 20 acapellas and match it to another recorded vocal (like the idea of “match EQ” and the concept of “de-bleed process” where you input your sources and it outputs the results) (where it matches the timbre, the tone, and voice profile to another voice) it shouldn’t violate any copyright nor trademarks. It wouldn’t work(in terms of completely cloning another voice) because there are lead and background vocals within acapellas and the sum of those 20 acapellas would round up all the background vocals along with the lead vocals and the timbre/tone would sound artificial. So technically it would be very difficult to abuse.

DosWasBest · October 20, 2023, 11:40pm

I gotta reiterate…go look at the current lawsuits.

Last I looked, there were three or four but today, I notice it’s spiraled into dozens.

Anyone running a recorded celebrity voice through software in an attempt to recreate an ai match…is fair game for a lawsuit…including…the fact that one “uses” a recording of let’s say the Beach Boys… in order to “analyze”…demix for analysis…etc…

suddenly, you’re screwed on a bunch of legal fronts before you get to the ai analysis…specifically…you grabbed a specific"recording" to begin the analysis…guess what?

The “recording” is covered under SR copyright registration. You’re not allowed to source it. Busted!

Now…let’s say one uses a field audio recorder to manually record Beyonce…or Taylor Swift…as they’re standing in an airport, talking on their cellphone…and you manage to get a clean recording of Taylor Swift speaking the words “hey, I’m in Dallas”…and you go back to your cave and run that through software to somehow…somehow…create a “Taylor Swift” preset that has turned her speaking voice into pitches, assigned to every word in the english language, complete with nuances…just from her speaking “hey, I’m in Dallas”.

Hmmm…that may work…“see judge, I made my ai from my own recording of Taylor Swift, talking at the airport on her cellphone”

Which isn’t covered by an sr copyright.

Judge says “ok, what was the date you made the recording?”

You say the date.

Bam…Taylor’s massive lawyers hit you with “invasion of privacy”…it’s illegal to record someone on the phone

Let’s go back…let’s say you lie and say you didn’t source the ai from a recording of the celebrity…guess what?

You’ll have to “prove” you didn’t
And you’ll lose that one

Sounds like all this is getting out there in far-fetched land?

Believe me, these things can go on until they bleed you dry financially.

Rights-holders can get very very very angry and legally vindictive.

Go on…check out the mushrooming ai cases…riaa, abkco, universal music group…on and on and on!!! They’re suing outfits left and right. Even I was surprised how many new cases there are.

This is gonna be wayyyyy bigger than the Napster fiasco!!

I would not dare touch this voice-clone stuff as a software developer!

steve · October 20, 2023, 11:56pm

excellent.

Gregory_McCollum · October 21, 2023, 1:19am

You are certainly correct about the explosion of lawsuits around generative AI. However, I would be interested if you know of any lawsuits filed against a software developer for the task the software was designed to accomplish. Just wondering.

Unmixing · October 21, 2023, 5:12am

It doesn’t necessary have to specifically be “voice cloning” (per se), however a feature like “tonal match” or “timbre match” process (where you can take any source and morph the characteristics of it into another). For example, in Serum you can morph (both spectrally and on a wave dsp level) two waveforms into each other in real-time. It would be interesting to take a voice of someone singing and morph it into the characteristics of a sawtooth wave synthesizer(the timbre/tonal aspects).

The bigger picture is a feature that can not only morph voices into another but can morph anything into anything (for example a voice into synthesizer or vice-versa a synthesizer into voice). The casting/molding feature is an excellent example of a concept like this being brough to life.

Voice cloning(although wouldn’t necessarily be the bigger picture) is just one aspect of it. Imagine the ability to take someone’s voice and morph the characteristics of it into a synthesizer.

DosWasBest · October 21, 2023, 9:40am

@unmixing I don’t have an opinion on what happens if you yourself capture a recording of Mariah Carey’s “Hero”, demix it and turn the main vocal into a morphed Hammond B3 instead…or tuba.

If you get a worldwide hit, do you intend to do interviews blabbering "yeah, I made that tuba from a Mariah Carey vocal?

My red flags were from your proposal…

“would like to see a feature that would allow you to type a sentence and have A.I. sing back what you typed(in Mariah Carey’s voice), but for that to become a reality Steinberg would have to make a decision and decide if those are features they want to invest in”

Unmixing · October 21, 2023, 1:46pm

I understand, the reason why you’re thinking red flag is because you’re only looking at from one perspective. Where as I’m looking at it from the possibility of sound design. For example there are dozens of videos on youtube demonstrating the feature of casting and molding and some youtubers demonstrate that feature to do sound design.

Plus (like I said), the already demonstrated A.I. vocal songs that are on youtube have numerous artifacts and people can tell that those vocals are artificial (mainly because it’s combining lead vocals along with background vocals into the algorithms).

tony181 · November 21, 2023, 12:28am

Try otter.ai. It’s dedicated text to speech and the best I’ve found. But it still makes plenty of mistakes. I vote for Spectralayer to continue their development path. Right now I’d rate the product as promising, probably enough for me to buy a license. I di a lot of voice cleanups and am particularly interested in unmixing.

Gregory_McCollum · November 21, 2023, 12:43am

Tortoise is good also.

tony181 · November 21, 2023, 1:00am

Thank you. Always looking for new apps.

MattiasNYC · December 8, 2023, 4:52am

Sorry about the semi-necro, but I’ll add my vote for voice-cloning.

However, my use case is for post production (sound-to-picture) where we often get crap audio that needs to be repaired. I think it would be much faster in a lot of cases to just re-read the dialog and apply cloning. I don’t think it would be a legal issue if it’s merely a matter of rebuilding what a person has already said.

Just my two cents.

PS: Oh, and btw - either Steinberg does this or a different company does, at which point dialog editors will simply switch to that other software. Guaranteed.