clone-voice Voice Cloning Tool
The model used in this project is xtts_v2 from coqui.ai, licensed under the Coqui Public Model License 1.0.0. Please comply with this license when using the project. Full license text available at https://coqui.ai/cpml.txt.
This is a voice cloning tool that can use any human voice to synthesize text into speech with that voice or convert one voice to another using the target voice.
It's very easy to use and works without an NVIDIA GPU. Download the pre-compiled version, double-click app.exe to open a web interface, and use it with simple clicks.
Supports 16 languages including Chinese, English, Japanese, Korean, French, German, Italian, and allows online voice recording from a microphone.
For best synthesis results, record a clear and accurate voice clip of 5 to 20 seconds without background noise.
English results are excellent, while Chinese results are acceptable.
How to Use the Pre-compiled Windows Version (Other Systems Can Deploy from Source)
Click here to open the Releases download page, download the pre-compiled main file (1.7GB) and the model (3GB).
Extract the downloaded files to a location, e.g., E:/clone-voice.
Double-click app.exe and wait for the web window to open automatically. Carefully read the text prompts in the cmd window; any errors will be displayed here.
After downloading the model, extract it to the
ttsfolder in the software directory. The result after extraction should look like this:
Conversion Steps:
Select the 【Text->Voice】 button, enter text in the text box or import an SRT subtitle file, then click "Start Now".
Select the 【Voice->Voice】 button, click or drag the audio file to convert (mp3/wav/flac), then choose the voice to clone from the "Voice File to Use" dropdown. If none are satisfactory, click "Upload Local" to select a pre-recorded 5-20s wav/mp3/flac voice file, or click "Start Recording" to record your own voice online for 5-20s, then click "Use" after recording. Finally, click "Start Now".
If the machine has an NVIDIA GPU and CUDA environment correctly configured, CUDA acceleration will be automatically used.
Source Code Deployment (Linux, Mac, Windows)
The source code version requires setting HTTP_PROXY in the .env file (e.g., HTTP_PROXY=http://127.0.0.1:7890). Models are downloaded from https://huggingface.co and https://github.com, which are inaccessible in some regions. Ensure the proxy is stable and reliable; otherwise, large model downloads may fail midway.
Requirements: Python 3.9–3.11, and install the git-cmd tool in advance. Download here.
Create an empty directory, e.g., E:/clone-voice, open a cmd window in this directory by typing
cmdin the address bar and pressing Enter. Use git to pull the source code to the current directory:git clone [email protected]:jianchang512/clone-voice.git .Create a virtual environment:
python -m venv venvActivate the environment: On Windows,
E:/clone-voice/venv/scripts/activate.Install dependencies:
pip install -r requirements.txt --no-deps. For CUDA acceleration on Windows and Linux, continue by executingpip uninstall -y torchto uninstall, thenpip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121. (Requires an NVIDIA GPU and properly configured CUDA environment.)On Windows, extract ffmpeg.7z and place
ffmpeg.exein the same directory asapp.py. On Linux and Mac, download the corresponding ffmpeg version from the ffmpeg official website, extract theffmpegbinary, and place it in the root directory. Ensure the executableffmpegis in the same directory asapp.py.First run
python code_dev.py. When prompted to agree to the license, entery, then wait for the model download to complete.Downloading the model requires a global proxy. The model is very large, and if the proxy is unstable, many errors may occur; most errors are due to proxy issues.
If multiple models show as successfully downloaded but it still prompts "Downloading WavLM model" error, modify the library file
\venv\Lib\site-packages\aiohttp\client.py. Around line 535, aboveif proxy is not None:, add your proxy address, e.g.,proxy="http://127.0.0.1:10809".After download completes, start with
python app.py.【Training Instructions】 To train, execute
python train.py. Adjust training parameters inparam.json, then rerun the training scriptpython train.py.Each startup connects to external servers to check or update models; please wait patiently. To avoid this, manually modify the dependency file: Open \venv\Lib\site-packages\TTS\utils\manage.py, around line 389, in the
def download_modelmethod, comment out the following code:
if md5sum is not None:
md5sum_file = os.path.join(output_path, "hash.md5")
if os.path.isfile(md5sum_file):
with open(md5sum_file, mode="r") as f:
if not f.read() == md5sum:
print(f" > {model_name} has been updated, clearing model cache...")
self.create_dir_and_download_model(model_name, model_item, output_path)
else:
print(f" > {model_name} is already downloaded.")
else:
print(f" > {model_name} has been updated, clearing model cache...")
self.create_dir_and_download_model(model_name, model_item, output_path)- Source code deployment may frequently encounter errors, mostly due to proxy issues preventing model downloads or incomplete downloads. Use a stable, globally enabled proxy. If downloads remain incomplete, use the pre-compiled version.
Common Issues
The xtts model is for learning and research only, not for commercial use.
The source code version requires setting HTTP_PROXY in the .env file (e.g., HTTP_PROXY=http://127.0.0.1:7890). Models are downloaded from https://huggingface.co and https://github.com, which are inaccessible in some regions. Ensure the proxy is stable and reliable; otherwise, large model downloads may fail midway.
After startup, the model needs cold loading, which takes some time. Wait patiently until
http://127.0.0.1:9988is displayed and the browser page opens automatically, then wait 2–3 minutes before converting.Features include:
Text to Speech: Input text and generate speech with the selected voice. Voice to Voice: Select a local audio file and generate another audio file with the selected voice.If the cmd window remains unresponsive for a long time and requires pressing Enter to continue output, click the icon in the top-left corner of the cmd window, select "Properties", and uncheck "Quick Edit Mode" and "Insert Mode".
Pre-compiled Version: Voice-to-voice thread fails to start.
First, confirm the model is correctly downloaded and placed. The tts folder should contain 3 folders, as shown below:
If correctly placed but errors persist, click to download extra-to-tts_cache.zip, extract the two files, and copy them to the tts_cache folder in the software root directory.
If the above doesn't work, set the proxy address in the .env file, e.g.,
HTTP_PROXY=http://127.0.0.1:7890, to resolve the issue. Ensure the proxy is stable and the port is correct.Prompt: "The text length exceeds the character limit of 182/82 for language"
This is due to sentences separated by periods being too long. Use periods to break up long sentences instead of many commas, or manually modify the limit in clone/character.json.
Prompt: "symbol not found __svml_cosf8_ha"
Open the webpage https://www.dll-files.com/svml_dispmd.dll.html, click the red "Download" text, download and extract, then copy the dll file to "C:\Windows\System32".
CUDA Acceleration Support
Install CUDA Tools Detailed Installation Guide
If your computer has an NVIDIA graphics card, first update the graphics driver to the latest version, then install the corresponding CUDA Toolkit 11.8 and cudnn for CUDA11.X.
After installation, press Win + R, type cmd, and press Enter. In the opened window, type nvcc --version to confirm version information is displayed, similar to this image:
Then type nvidia-smi to confirm output information and see the CUDA version, similar to this image:
If correct, CUDA acceleration is enabled; otherwise, reinstall.
