We use machine-synthesized voices for our daily smart assistants, navigation voices, and even some of the news broadcasts we hear.
AI anchors automatically synthesize voice and video to produce news content every day ? ?
How to synthesize a voice
Speech synthesis may seem complicated, but actually synthesizing a voice is something we can do ourselves at home:
For example, record words like "Alipay arrives," "yuan," and "one, two, three, four, five, six," play them back in a specific spliced combination, and you're rewarded with the sound of a bill that 800 million Alipay users in China have heard.
Of course, the voice you record yourself is neither sweet nor can you get paid for it
This way of "splicing" a piece of audio to get a copy of the voice is called splicing.
The synthesized voice produced by the splicing method can actually be traced back in daily life to the voice announcement of the oversized calculator that the cashier's aunt pressed in her hand at the checkout at the kiosk you had to go to after school back in the day on the corner of your street.
" Plus one, plus one, plus two, equals, zero!" ?
The most primitive splicing method recorded samples of a few fixed phrases, which could only be used for navigation, and if you asked it what the weather was like today, it wouldn't be able to tell you the answer even if it knew the answer.
Then a smart guy came up with a brilliant idea: if I recorded every word in Chinese, I could put it all together.
However, the spliced synthesized utterances had another major flaw: tone and pause.
Miss Zhan Yan, who dubbed the voice of Alipay, revealed that she recorded several pronunciations of "four" to ensure the synthesis effect in different scenarios. And the splicing method also does not know how to give "Xiao Ming can not find / mom and dad are very anxious", "Xiao Ming can not find mom and dad / very anxious" break. ?
The splicing method solves the problem of how to pronounce each word, but it certainly doesn't read like a normal person.
So how to make synthetic speech sound more realistic became a top priority for everyone to optimize.
How to make speech more realistic
At this point, in addition to increasing the sample size, we must also introduce another key technology: algorithms.
Synthesized sounds with algorithms are like being injected with a soul. To put it bluntly, it's "a smart algorithm that knows how to handle the intonation and pauses of a sentence.
And this way of using algorithms to assist in generating synth sounds is called the parametric method. ?
The parametric method naturally requires a much higher level of sound, with 'elimination of silent clips' and 'professional recording environments' being the norm, and, crucially, the sounds that need to be recorded no longer being the pronunciation of words.