Skip to content

ChatTTS has become popular quickly, but its documentation is not detailed enough, especially regarding the control of tone, rhythm, and specific speakers. After repeated testing and troubleshooting, I've gained some understanding and documented it below.

UI interface code open source address: https://github.com/jianchang512/chattts-ui

Available Control Symbols in Text

Control symbols can be inserted into the original text to be synthesized. Currently, the following two types can be controlled: laughter and pauses.

[laugh] Represents laughter

[uv_break] Represents a pause

Example text:

text="你好啊[uv_break]朋友们,听说今天是个好日子,难道[uv_break]不是吗[laugh]?"

During actual synthesis, [laugh] will be replaced by laughter, and a pause will be added at [uv_break].

The intensity of laughter and pauses can be controlled by passing prompts in the params_refine_text parameter.

laugh_(0-2) Available values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense / or?

break_(0-7) Available values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become increasingly obvious / or?.

Code:

chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})

However, actual testing found that there is no obvious difference between [break_0] and [break_7], and similarly, no obvious difference between [laugh_0] and [laugh_2].

Skipping the Refine Text Stage

During actual synthesis, the text is re-organized (refined) to insert control symbols. For example, the example text above will eventually be organized as:

你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]

As you can see, the control symbols are not consistent with the ones you marked, and the actual synthesis effect may have unwanted pauses, noises, laughter, etc. So, how do you force the synthesis to follow the actual text?

Set the skip_refine_text parameter to True to skip the refine text stage.

chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})

Fixing the Speaker's Tone

By default, different tones are randomly called for each synthesis, which is very unfriendly, and there is no specific description of tone selection.

To simply fix the speaking role, you first need to manually set a random number seed. Different seeds will produce different tones.

torch.manual_seed(2222)

Then get a random speaker

rand_spk = chat.sample_random_speaker()

Then pass it through the params_infer_code parameter

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})

After testing, 2222 7869 6653 are male tones, and 3333 4099 5099 are female roles. You can adjust different seed numbers to test more roles yourself.

Speech Rate Control

You can control the speech rate by setting prompt in the params_infer_code parameter of chat.infer

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})

The range of available speed values is not specified. The default in the source code is speed_5, but no obvious difference was found when testing speed_0 and speed_7.

WebUI Interface and Integrated Package

Open source and download address: https://github.com/jianchang512/chatTTS-ui

After decompressing the integrated package, double-click app.exe

Deploy the source code according to the repository instructions.

UI Interface Preview