Skip to content

Voice Cloning Tool

Clone-Voice Open Source Project Repository

The models used in this project are sourced from https://github.com/coqui-ai/TTS and are licensed under CPML. They are for learning and research purposes only and not for commercial use.

This is a voice cloning tool that can use any human voice tone to synthesize speech from text or convert one voice to another using the selected tone.

It is very easy to use; no NVIDIA GPU is required. Download the precompiled version, double-click app.exe to open a web interface, and use it with simple clicks.

Supports 16 languages including Chinese, English, Japanese, Korean, French, German, and Italian, and allows online voice recording via microphone.

For optimal synthesis results, it is recommended to record a clear and accurate voice clip lasting 5 to 20 seconds without background noise.

The English synthesis effect is excellent, while the Chinese effect is acceptable.

How to Use the Precompiled Windows Version (Other Systems Can Deploy from Source)

  1. Click here to open the Releases download page, and download the precompiled main file (1.7 GB) and the models (3 GB).

  2. After downloading, extract the files to a location, such as E:/clone-voice.

  3. Double-click app.exe and wait for the web window to open automatically. Carefully read the text prompts in the cmd window, as any errors will be displayed here.

  4. After downloading the models, extract them into the tts folder in the software directory.

  5. Conversion steps:

    • Select the 【Text -> Voice】 button, enter text in the text box or import an SRT subtitle file, then click "Start Now."

    • Select the 【Voice -> Voice】 button, click or drag the audio file to convert (mp3/wav/flac), then choose the voice tone to clone from the "Voice File to Use" dropdown. If there are no satisfactory options, click the "Local Upload" button to select a pre-recorded 5-20 second WAV/MP3/FLAC voice file, or click "Start Recording" to record your own voice online for 5-20 seconds, then click "Use" after recording. Finally, click the "Start Now" button.

  6. If the machine has an NVIDIA GPU and the CUDA environment is correctly configured, CUDA acceleration will be automatically used.

Source Code Deployment (Linux, Mac, Windows)

The source code version requires a global proxy because models need to be downloaded from https://huggingface.co, which is inaccessible in some regions.

  1. Requires Python 3.9 to 3.11.

  2. Create an empty directory, e.g., E:/clone-voice, open a cmd window in this directory by typing cmd in the address bar and pressing Enter. Use Git to pull the source code to the current directory: git clone [email protected]:jianchang512/clone-voice.git .

  3. Create a virtual environment: python -m venv venv

  4. Activate the environment: On Windows, run E:/clone-voice/venv/scripts/activate

  5. Install dependencies: pip install -r requirements.txt

  6. On Windows, extract ffmpeg.7z and place ffmpeg.exe in the same directory as app.py. On Linux and Mac, download the corresponding version of ffmpeg from ffmpeg official website, extract it, and place the executable binary ffmpeg in the root directory, ensuring it is in the same directory as app.py.

    First run python code_dev.py, and when prompted to agree to the license, enter y, then wait for the models to download. Downloading models requires a global proxy. The models are very large, and if the proxy is unstable, many errors may occur, most of which are due to proxy issues.

    If multiple models appear to download successfully but an error like "Downloading WavLM model" persists, modify the library file \venv\Lib\site-packages\aiohttp\client.py. Around line 535, above if proxy is not None:, add your proxy address, e.g., proxy="http://127.0.0.1:10809".

  7. After downloading, start the application with python app.py.

  8. Each startup will connect to external servers to check or update models; please wait patiently. To avoid this, manually modify the dependency file: Open \venv\Lib\site-packages\TTS\utils\manage.py, around line 389 in the def download_model method, and comment out the following code:

if md5sum is not None:
	md5sum_file = os.path.join(output_path, "hash.md5")
	if os.path.isfile(md5sum_file):
	    with open(md5sum_file, mode="r") as f:
		if not f.read() == md5sum:
		    print(f" > {model_name} has been updated, clearing model cache...")
		    self.create_dir_and_download_model(model_name, model_item, output_path)
		else:
		    print(f" > {model_name} is already downloaded.")
	else:
	    print(f" > {model_name} has been updated, clearing model cache...")
	    self.create_dir_and_download_model(model_name, model_item, output_path)
  1. The source code version may frequently encounter errors, mostly due to proxy issues preventing model downloads or incomplete downloads. It is recommended to use a stable global proxy. If downloads remain incomplete, use the precompiled version.

CUDA Acceleration Support

Detailed CUDA Installation Guide

Important Notes

The xtts model is for learning and research purposes only and not for commercial use.

  1. The source code version requires a global proxy because models need to be downloaded from https://huggingface.co, which is inaccessible in some regions. Frequent errors during startup are often due to proxy issues. Use a stable global proxy; if downloads remain incomplete, use the precompiled version.

  2. After startup, the model requires cold loading, which takes some time. Please wait patiently until http://127.0.0.1:9988 is displayed and the browser page opens automatically, then wait an additional 2-3 minutes before performing conversions.

  3. Features include:

    • Text to Speech: Input text and generate speech using the selected voice tone.

    • Voice to Voice: Select a local audio file and generate another audio file using the selected voice tone.

  4. If the cmd window remains unresponsive for a long time and requires pressing Enter to continue, click the icon in the top-left corner of the cmd window, select "Properties," and uncheck the boxes for "Quick Edit Mode" and "Insert Mode."