Skip to content

ChatTTS has become popular, but the documentation is vague, especially regarding the specific control of tone, rhythm, and speakers. After repeated testing and troubleshooting, I finally understand a few things, and I'm recording them below.

UI interface code open source address https://github.com/jianchang512/chattts-ui

Available Control Symbols in Text

Control symbols can be inserted into the original text to be synthesized. Currently, the controllable elements are laughter and pauses.

[laugh] Represents laughter

[uv_break] Represents a pause

Here's an example text:

text="Hello [uv_break] friends, I heard today is a good day, isn't it [uv_break] [laugh]?"

In actual synthesis, [laugh] will be replaced by laughter, and a pause will be added at the [uv_break] location.

The intensity of laughter and pauses can be controlled by passing prompts in the params_refine_text parameter.

laugh_(0-2) Possible values: laugh_0 laugh_1 laugh_2 Laughter becomes more intense / or?

break_(0-7) Possible values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 Pauses become progressively more noticeable / or?.

Code:


chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})
chat.infer([text],params_refine_text={"prompt":'[oral_2][laugh_2][break_4]'})

However, actual testing reveals that there is no obvious difference between [break_0] and [break_7], and similarly, no significant difference between [laugh_0] and [laugh_2].

Skip the refine text stage

During actual synthesis, the control symbols are re-arranged (refined text), for example, the text above will eventually be refined as

你 好 啊 [uv_break] 啊 [uv_break] 嗯 [uv_break] 朋 友 们 , 听 说 今 天 是 个 好 日 子 , 难 道 [uv_break] 嗯 [uv_break] 不 是 吗 [laugh] ? [uv_break]

As you can see, the control symbols are inconsistent with the ones you marked, and the actual synthesis effect may produce unwanted pauses, noise, laughter, etc. So how do you force it to synthesize according to the actual text?

Set the skip_refine_text parameter to True to skip the refine text stage.

chat.infer([text],skip_refine_text=True,params_refine_text={"prompt":'[oral_2][laugh_0][break_6]'})

Fix the Speaker's Tone

By default, a different tone is randomly called for each synthesis, which is very unfriendly, and there is no specific description of the tone selection.

To simply fix the speaking role, you first need to manually set a random number seed; different seeds will produce different tones.

torch.manual_seed(2222)

Then get a random speaker

rand_spk = chat.sample_random_speaker()

Then pass it through the params_infer_code parameter.

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk})

After testing, 2222 7869 6653 is a male tone, and 3333 4099 5099 is a female role. More roles can be tested by adjusting different seed numbers.

Speech Rate Control

You can control the speech rate by setting prompt in the params_infer_code parameter of chat.infer.

chat.infer([text], use_decoder=True,params_infer_code={'spk_emb': rand_spk,'prompt':'[speed_5]'})

The range of speed values is not specified; the default in the source code is speed_5, but no obvious difference was found when testing speed_0 and speed_7.

WebUI Interface and Integrated Package

Open source and download address https://github.com/jianchang512/chatTTS-ui

After decompressing the integrated package, double-click app.exe.

Deploy the source code according to the repository instructions.

UI Interface Preview