ChatTTS is trending! However, the official documentation is vague, especially regarding fine-grained control over intonation, rhythm, and speaker selection. After extensive testing and troubleshooting, I've gained some understanding and am documenting it here.
UI interface code open source address: https://github.com/jianchang512/chattts-ui
Available Control Symbols in Text
You can insert control symbols into the original text to be synthesized. Currently, you can control laughter and pauses.
[laugh] represents laughter.
[uv_break] represents a pause.
Example text:
text="Hello [uv_break] everyone, I heard today is a good day, isn't it [laugh]?"
During actual synthesis, [laugh]
will be replaced with laughter, and a pause will be added at [uv_break]
.
The intensity of laughter and pauses can be controlled by passing a prompt in the params_refine_text
parameter.
laugh_(0-2) Available values: laugh_0 laugh_1 laugh_2 - Laughter becomes increasingly intense (or something similar).
break_(0-7) Available values: break_0 break_1 break_2 break_3 break_4 break_5 break_6 break_7 - Pauses become progressively more noticeable (or something similar).
Code:
chat.infer([text], params_refine_text={"prompt": '[oral_2][laugh_0][break_6]'})
chat.infer([text], params_refine_text={"prompt": '[oral_2][laugh_2][break_4]'})
However, in actual testing, the difference between
[break_0]
and[break_7]
is not obvious, and the same is true for[laugh_0]
to[laugh_2]
.
Skipping the Refine Text Stage
During actual synthesis, the text is reorganized (refined) and control symbols are inserted. For example, the text above might be refined to:
Hello [uv_break] everyone, I heard today is a good day, isn't it [laugh] ? [uv_break]
As you can see, the control symbols are not consistent with what you marked, and the actual synthesis effect may include unwanted pauses, noise, laughter, etc. So how do you force the synthesis to follow the exact input?
Set the skip_refine_text
parameter to True
to skip the refine text stage.
chat.infer([text], skip_refine_text=True, params_refine_text={"prompt": '[oral_2][laugh_0][break_6]'})
Locking the Speaker's Voice
By default, a different voice is randomly selected for each synthesis, which is not ideal, and there is no specific documentation on voice selection.
To simply lock the speaking role, you first need to manually set a random number seed. Different seeds will produce different voices.
torch.manual_seed(2222)
Then, get a random speaker.
rand_spk = chat.sample_random_speaker()
Then pass it through the params_infer_code
parameter.
chat.infer([text], use_decoder=True, params_infer_code={'spk_emb': rand_spk})
After testing, 2222
, 7869
, and 6653
are male voices, while 3333
, 4099
, and 5099
are female voices. You can test different seed numbers to find more roles.
Speech Rate Control
You can control the speech rate by setting the prompt
in the params_infer_code
parameter of chat.infer
.
chat.infer([text], use_decoder=True, params_infer_code={'spk_emb': rand_spk, 'prompt': '[speed_5]'})
The available range of speed values is not specified. The default value in the source code is speed_5
, but testing speed_0
and speed_7
did not reveal significant differences.
WebUI Interface and Integrated Package
Open source and download address: https://github.com/jianchang512/chatTTS-ui
After extracting the integrated package, double-click app.exe
.
For source code deployment, follow the instructions in the repository.