Spark-TTS is a highly anticipated open-source voice cloning project developed collaboratively by several universities, including the Hong Kong University of Science and Technology, Northwestern Polytechnical University, and Shanghai Jiao Tong University. Local testing shows that its performance is comparable to F5-TTS.
Spark-TTS supports voice cloning in both Chinese and English, and the installation and deployment process is not complicated. This article will detail how to install and deploy it, and modify it to be compatible with the F5-TTS API interface, so that it can be used directly in the F5-TTS dubbing channel of the pyVideoTrans software.
Prerequisites: Make sure you have installed Python version 3.10, 3.11, or 3.12.
1. Download the Spark-TTS Source Code
First, create a folder consisting of English letters or numbers on a non-system drive, such as D:/spark
. The reason for requiring English letters, numbers, and a non-system drive is to avoid potential errors related to Chinese characters, permissions, etc.
Then, visit the official Spark-TTS code repository: https://github.com/SparkAudio/Spark-TTS
As shown in the image below, click to download the ZIP package of the source code:
After the download is complete, unzip the file and copy all the files and folders to the D:/spark
folder. The directory structure after copying should look like this:
2. Create a Virtual Environment and Install Dependencies
- Create a Virtual Environment
Enter cmd
in the address bar of the folder and press Enter. In the black terminal window that pops up, execute the following command:
python -m venv venv
As shown in the figure:
After execution, a venv
folder will be added to the D:/spark
directory:
Note: If you see the message
python is not recognized as an internal or external command
during execution, it means that Python is not installed or has not been added to the system environment variables. Please refer to relevant articles to install Python.
Next, execute venv\scripts\activate
to activate the virtual environment. After activation, (venv)
will appear at the beginning of the terminal line, indicating successful activation. All subsequent commands need to be executed in this environment. Please check whether it has been activated before each execution.
- Install Dependencies
In the activated virtual environment, continue to execute the following command in the terminal to install all dependencies:
pip install -r requirements.txt
The installation process may take a long time, please be patient.
3. Download the Model
Models required for open-source AI projects are usually hosted on Hugging Face (huggingface.co). Since this website is blocked in China, you need to use a proxy to download the model. Please make sure that the proxy environment is configured correctly and that the system proxy is set up.
Create a text file named down.txt
in the current directory D:/spark
, copy and paste the following code into the file and save it:
from huggingface_hub import snapshot_download
snapshot_download("SparkAudio/Spark-TTS-0.5B", local_dir="pretrained_models/Spark-TTS-0.5B")
print('下载完成')
Then, execute the following command in the terminal window of the activated virtual environment:
python down.txt
Check if there is (venv)
before the command line:
Wait for the terminal to indicate that the download is complete.
If you see information similar to the following, it means that the network connection is incorrect, which may be due to an incorrect configuration of the proxy environment:
Returning existing local_dir `pretrained_models\Spark-TTS-0.5B` as remote repo cannot be accessed in `snapshot_download` ((MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/SparkAudio/Spark-TTS-0.5B/revision/main (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001BC4C8A4430>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: aa61d1fb-ffc7-4479-9a99-2258c1bc0aee)')).
4. Start the Web Interface
After the model is downloaded, you can start and open the Web interface.
Execute the following command in the terminal of the activated virtual environment:
python webui.py
Wait until the following information appears, indicating that the startup is complete:
At this time, you can open the address http://127.0.0.1:7860
in your browser. The Web interface is shown in the figure below:
5. Voice Cloning Test
As shown in the figure below, select an audio file with the voice you want to clone (3-10 seconds long, clear pronunciation, clean background).
Then, enter the text corresponding to the audio in the Text of prompt speech
on the right, enter the text you want to generate on the left, and finally click the Generate button at the bottom to start execution.
After the execution is complete, as shown in the figure below.
6. Use in pyVideotrans Software
Spark-TTS is very similar to F5-TTS. With a simple modification, Spark-TTS can be used directly in the F5-TTS dubbing channel of pyVideotrans. If you don’t know how to modify it, you can directly download the modified version and overwrite the
webui.py
. Download address: https://pvt9.com/spark-use-f5-webui.zip
- Open the
webui.py
file and paste the following code above line 135 (approximately):
def basic_tts(gen_text_input, ref_text_input, ref_audio_input,remove_silence=None,speed_slider=None):
"""
Gradio callback to clone voice using text and optional prompt speech.
- text: The input text to be synthesised.
- prompt_text: Additional textual info for the prompt (optional).
- prompt_wav_upload/prompt_wav_record: Audio files used as reference.
"""
prompt_speech = ref_audio_input
prompt_text_clean = None if len(ref_text_input) < 2 else ref_text_input
audio_output_path = run_tts(
gen_text_input,
model,
prompt_text=prompt_text_clean,
prompt_speech=prompt_speech
)
return audio_output_path,prompt_text_clean
Special Note: Python code uses spaces for indentation alignment, otherwise the code will report an error. To avoid errors, it is recommended not to open the webui.py
file with Notepad, but to use a professional code editor, such as Notepad++ or VSCode and other free tools.
- Then, find the
generate_buttom_clone = gr.Button("Generate")
code around line 190. Paste the following code above it, also pay attention to alignment:
generate_buttom_clone2 = gr.Button("Generate2",visible=False)
generate_buttom_clone2.click(
basic_tts,
inputs=[
text_input,
prompt_text_input,
prompt_wav_upload,
text_input,
text_input
],
outputs=[audio_output,prompt_text_input],
api_name="basic_tts"
)
- After saving the file, restart
webui.py
:
python webui.py
- Fill in the address
http://127.0.0.1:7860
into the API address of "Menu" -> "TTS Settings" -> "F5-TTS" in the pyVideotrans software, and you can start using it. The position and filling method of the reference audio are the same as the usage method of F5-TTS.