ChatTTS has become popular quickly, but its documentation is not detailed enough, especially regarding the control of tone, rhythm, and specific speakers. After repeated testing and troubleshooting, I've gained some understanding and documented it below.
UI interface code open source address: https://github.com/jianchang512/chattts-ui
Available Control Symbols in Text
Control symbols can be inserted into the original text to be synthesized. Currently, the following two types can be controlled: laughter and pauses.
[laugh] Represents laughter
[uv_break] Represents a pause
Example text:
text="你好啊[uv_break]朋友们,听说今天是个好日子,难道[uv_break]不是吗[laugh]?"
During actual synthesis, [laugh]
will be replaced by laughter, and a pause will be added at [uv_break]
.
The intensity of laughter and pauses can be controlled by passing prompts in the params_refine_text
parameter.
laugh_(0-2) Available values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense / or?
break_(0-7) Available values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become increasingly obvious / or?.
Code:
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})
However, actual testing found that there is no obvious difference between [break_0] and [break_7], and similarly, no obvious difference between [laugh_0] and [laugh_2].
Skipping the Refine Text Stage
During actual synthesis, the text is re-organized (refined) to insert control symbols. For example, the example text above will eventually be organized as:
你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]
As you can see, the control symbols are not consistent with the ones you marked, and the actual synthesis effect may have unwanted pauses, noises, laughter, etc. So, how do you force the synthesis to follow the actual text?
Set the skip_refine_text
parameter to True
to skip the refine text stage.
chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
Fixing the Speaker's Tone
By default, different tones are randomly called for each synthesis, which is very unfriendly, and there is no specific description of tone selection.
To simply fix the speaking role, you first need to manually set a random number seed. Different seeds will produce different tones.
torch.manual_seed(2222)
Then get a random speaker
rand_spk = chat.sample_random_speaker()
Then pass it through the params_infer_code
parameter
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})
After testing, 2222 7869 6653
are male tones, and 3333 4099 5099
are female roles. You can adjust different seed numbers to test more roles yourself.
Speech Rate Control
You can control the speech rate by setting prompt
in the params_infer_code parameter of chat.infer
chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})
The range of available speed values is not specified. The default in the source code is speed_5
, but no obvious difference was found when testing speed_0
and speed_7
.
WebUI Interface and Integrated Package
Open source and download address: https://github.com/jianchang512/chatTTS-ui
After decompressing the integrated package, double-click app.exe
Deploy the source code according to the repository instructions.