Skip to content

Voice Cloning Tool

clone-voice Open Source Project on GitHub

The models used in this project are all derived from https://github.com/coqui-ai/TTS. The model protocol is CPML, which is only for learning and research purposes and cannot be used for commercial purposes.

This is a voice cloning tool that allows you to use any human voice to synthesize a piece of text into speech using that voice, or to convert one voice into another using that voice.

It is very simple to use. You can use it even without an NVIDIA GPU. Download the pre-compiled version, double-click app.exe to open a web interface, and you can use it with a few mouse clicks.

Supports 16 languages including Chinese, English, Japanese, Korean, French, German, and Italian. You can record your voice online from a microphone.

To ensure the synthesis effect, it is recommended to record for 5 to 20 seconds, with clear and accurate pronunciation and no background noise.

The English effect is great, and the Chinese effect is passable.

How to Use the Windows Pre-compiled Version (Source Code Deployment for Other Systems)

  1. Click here to open the Releases download page, download the pre-compiled main file (1.7G) and the model (3G).

  2. After downloading, extract it to a location, such as E:/clone-voice.

  3. Double-click app.exe and wait for the web window to open automatically. Please read the text prompts in the cmd window carefully. Any errors will be displayed here.

  4. After downloading the model, extract it to the tts folder in the software directory.

  5. Conversion operation steps:

    • Select the 【Text -> Voice】 button, enter text in the text box, or click to import an SRT subtitle file, and then click "Start Now".

    • Select the 【Voice -> Voice】 button, click or drag the audio file you want to convert (mp3/wav/flac), and then select the voice you want to clone from the "Voice file to use" drop-down box. If you are not satisfied, you can also click the "Upload Local" button to select a recorded 5-20s wav/mp3/flac audio file. Or click the "Start Recording" button to record your own voice online for 5-20s, and click "Use" when finished. Then click the "Start Now" button.

  6. If the machine has an NVIDIA GPU and the CUDA environment is configured correctly, CUDA acceleration will be used automatically.

Source Code Deployment (Linux, Mac, Windows)

The source code version requires a global proxy because it needs to download models from https://huggingface.co, which is not accessible in China.

  1. Requires Python 3.9 -> 3.11

  2. Create an empty directory, such as E:/clone-voice. Open a cmd window in this directory. The method is to enter cmd in the address bar and press Enter. Use git to pull the source code to the current directory: git clone [email protected]:jianchang512/clone-voice.git .

  3. Create a virtual environment: python -m venv venv

  4. Activate the environment. In Windows, E:/clone-voice/venv/scripts/activate

  5. Install dependencies: pip install -r requirements.txt

  6. In Windows, unzip ffmpeg.7z and place ffmpeg.exe in the same directory as app.py. For Linux and Mac, download the corresponding version of ffmpeg from the ffmpeg official website, extract the ffmpeg program to the root directory, and make sure to put the executable binary file ffmpeg and app.py in the same directory.

    First run python code_dev.py. When prompted to agree to the agreement, enter y and wait for the model to be downloaded. Downloading the model requires a global proxy. The model is very large. If the proxy is not stable enough, you may encounter many errors, most of which are caused by proxy issues.

    If it shows that multiple models have been downloaded successfully, but the error "Downloading WavLM model" is still displayed at the end, you need to modify the library package file \venv\Lib\site-packages\aiohttp\client.py. Around line 535, add your proxy address above if proxy is not None:, for example, proxy="http://127.0.0.1:10809".

  7. After the download is complete, start python app.py again.

  8. Each startup will connect to the outside network to detect or update the model. Please be patient. If you don't want to detect or update every time you start, you need to manually modify the file under the dependency package. Open \venv\Lib\site-packages\TTS\utils\manage.py, around line 389, in the def download_model method, comment out the following code:

if md5sum is not None:
	md5sum_file = os.path.join(output_path, "hash.md5")
	if os.path.isfile(md5sum_file):
	    with open(md5sum_file, mode="r") as f:
		if not f.read() == md5sum:
		    print(f" > {model_name} has been updated, clearing model cache...")
		    self.create_dir_and_download_model(model_name, model_item, output_path)
		else:
		    print(f" > {model_name} is already downloaded.")
	else:
	    print(f" > {model_name} has been updated, clearing model cache...")
	    self.create_dir_and_download_model(model_name, model_item, output_path)
  1. The source code version may encounter errors frequently when starting, which are basically caused by proxy problems that prevent the model from being downloaded completely from outside the network or the download is interrupted. It is recommended to use a stable proxy and enable it globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

CUDA Acceleration Support

Detailed Installation Method for Installing CUDA Tools

Precautions

The model xtts can only be used for learning and research, and cannot be used for commercial purposes.

  1. The source code version requires a global proxy because it needs to download models from https://huggingface.co, which is not accessible in China. The source code version may encounter errors frequently when starting, which are basically caused by proxy problems that prevent the model from being downloaded completely from outside the network or the download is interrupted. It is recommended to use a stable proxy and enable it globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

  2. After starting, the model needs to be cold-loaded, which will take some time. Please wait patiently for http://127.0.0.1:9988 to be displayed and the browser page to open automatically, and then wait for two or three minutes before performing the conversion.

  3. Features include:

    • Text to speech: that is, input text and generate sound with the selected voice.

    • Voice to voice: that is, select an audio file from the local computer and generate another audio file with the selected voice.

  4. If the opened cmd window does not move for a long time, you need to press Enter on it to continue output. Please click on the icon in the upper left corner of the cmd, select "Properties", and then uncheck the "Quick Edit" and "Insert Mode" checkboxes.