ChatTTS has become popular, but the documentation is vague, especially regarding the specific control of tone, rhythm, and speakers. After repeated testing and troubleshooting, I finally understand a few things, and I'm recording them below.
UI interface code open source address https://github.com/jianchang512/chattts-ui
Available Control Symbols in Text
Control symbols can be inserted into the original text to be synthesized. Currently, the controllable elements are laughter and pauses.
[laugh] Represents laughter
[uv_break] Represents a pause
Here's an example text:
text="Hello [uv_break] friends, I heard today is a good day, isn't it [uv_break] [laugh]?"
In actual synthesis, [laugh]
will be replaced by laughter, and a pause will be added at the [uv_break]
location.
The intensity of laughter and pauses can be controlled by passing prompts in the params_refine_text
parameter.
laugh_(0-2) Possible values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense / or?
break_(0-7) Possible values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become progressively more noticeable / or?.
Code:
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})
However, actual testing reveals that there is no obvious difference between [break_0] and [break_7], and similarly, no significant difference between [laugh_0] and [laugh_2].
Skip the refine text stage
During actual synthesis, the control symbols are re-arranged (refined text), for example, the text above will eventually be refined as
你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]
As you can see, the control symbols are inconsistent with the ones you marked, and the actual synthesis effect may produce unwanted pauses, noise, laughter, etc. So how do you force it to synthesize according to the actual text?
Set the skip_refine_text
parameter to True
to skip the refine text stage.
chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
Fix the Speaker's Tone
By default, a different tone is randomly called for each synthesis, which is very unfriendly, and there is no specific description of the tone selection.
To simply fix the speaking role, you first need to manually set a random number seed; different seeds will produce different tones.
torch.manual_seed(2222)
Then get a random speaker
rand_spk = chat.sample_random_speaker()
Then pass it through the params_infer_code
parameter.
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})
After testing, 2222 7869 6653
is a male tone, and 3333 4099 5099
is a female role. More roles can be tested by adjusting different seed numbers.
Speech Rate Control
You can control the speech rate by setting prompt
in the params_infer_code parameter of chat.infer.
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})
The range of speed values is not specified; the default in the source code is speed_5
, but no obvious difference was found when testing speed_0
and speed_7
.
WebUI Interface and Integrated Package
Open source and download address https://github.com/jianchang512/chatTTS-ui
After decompressing the integrated package, double-click app.exe.
Deploy the source code according to the repository instructions.