Skip to content

Voice Cloning Tool

clone-voice Open Source Project Address

The models used in this project are all from https://github.com/coqui-ai/TTS. The model protocol is CPML, which is only for learning and research and cannot be used for commercial purposes.

This is a voice cloning tool that allows you to synthesize text into speech using any human voice, or convert one voice into another using the same voice.

It's very easy to use, and you don't need an NVIDIA GPU. Download the pre-compiled version, double-click app.exe to open a web interface, and you can use it with a few clicks of the mouse.

Supports 16 languages including Chinese, English, Japanese, Korean, French, German, and Italian. You can record sound online from a microphone.

To ensure the synthesis effect, it is recommended to record for 5 to 20 seconds with clear and accurate pronunciation and no background noise.

The English effect is excellent, and the Chinese effect is acceptable.

How to Use the Windows Pre-compiled Version (Other Systems Can Deploy from Source)

  1. Click here to open the Releases download page and download the pre-compiled main file (1.7G) and the model (3G).

  2. After downloading, extract it to a location, such as E:/clone-voice.

  3. Double-click app.exe and wait for the web window to open automatically. Please read the text prompts in the cmd window carefully. If there are any errors, they will be displayed here.

  4. After downloading the model, extract it into the tts folder in the software directory.

  5. Conversion steps:

    • Select the "Text -> Voice" button, enter text in the text box, or click to import an SRT subtitle file, and then click "Start Now".

    • Select the "Voice -> Voice" button, click or drag the audio file you want to convert (mp3/wav/flac), and then select the voice you want to clone from the "Voice file to use" drop-down box. If you are not satisfied, you can also click the "Upload Local" button to select a recorded 5-20s wav/mp3/flac audio file. Or click the "Start Recording" button to record your own voice online for 5-20s, and click "Use" after recording. Then click the "Start Now" button.

  6. If the machine has an NVIDIA GPU and the CUDA environment is configured correctly, CUDA acceleration will be used automatically.

Source Code Deployment (Linux, Mac, Windows)

The source code version requires a global proxy because it needs to download models from https://huggingface.co, which is not accessible in China.

  1. Requires Python 3.9 -> 3.11

  2. Create an empty directory, such as E:/clone-voice. Open a cmd window in this directory. The method is to enter cmd in the address bar and press Enter. Use git to pull the source code to the current directory: git clone [email protected]:jianchang512/clone-voice.git .

  3. Create a virtual environment: python -m venv venv

  4. Activate the environment. On Windows: E:/clone-voice/venv/scripts/activate

  5. Install dependencies: pip install -r requirements.txt

  6. On Windows, extract ffmpeg.7z and place ffmpeg.exe in the same directory as app.py. For Linux and Mac, download the corresponding version of ffmpeg from the ffmpeg official website, extract the ffmpeg program to the root directory. The executable binary file ffmpeg and app.py must be in the same directory.

    First run python code_dev.py. When prompted to agree to the agreement, enter y and wait for the model to download. Downloading the model requires a global proxy. The model is very large, and if the proxy is not stable enough, you may encounter many errors. Most of the errors are caused by proxy problems.

    If it shows that multiple models have been downloaded successfully, but still prompts "Downloading WavLM model" error, you need to modify the library package file \venv\Lib\site-packages\aiohttp\client.py. Around line 535, add your proxy address on the line above if proxy is not None:, for example, proxy="http://127.0.0.1:10809".

  7. After the download is complete, start python app.py again.

  8. Each startup will connect to the outside network to check or update the model. Please be patient. If you don't want to check or update every time you start, you need to manually modify the file under the dependency package. Open \venv\Lib\site-packages\TTS\utils\manage.py, around line 389, in the def download_model method, comment out the following code:

if md5sum is not None:
	md5sum_file = os.path.join(output_path, "hash.md5")
	if os.path.isfile(md5sum_file):
	    with open(md5sum_file, mode="r") as f:
		if not f.read() == md5sum:
		    print(f" > {model_name} has been updated, clearing model cache...")
		    self.create_dir_and_download_model(model_name, model_item, output_path)
		else:
		    print(f" > {model_name} is already downloaded.")
	else:
	    print(f" > {model_name} has been updated, clearing model cache...")
	    self.create_dir_and_download_model(model_name, model_item, output_path)
  1. The source code version may frequently encounter errors when starting. Basically, these are proxy problems that prevent the model from being downloaded completely from outside the network or the download is interrupted. It is recommended to use a stable proxy and turn it on globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

CUDA Acceleration Support

Detailed installation method for installing CUDA tools

Precautions

The model xtts can only be used for learning and research, not for commercial purposes.

  1. The source code version requires a global proxy because it needs to download models from https://huggingface.co, which is not accessible in China. The source code version may frequently encounter errors when starting. Basically, these are proxy problems that prevent the model from being downloaded completely from outside the network or the download is interrupted. It is recommended to use a stable proxy and turn it on globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

  2. After starting, you need to cold load the model, which will take some time. Please wait patiently until http://127.0.0.1:9988 is displayed and the browser page is opened automatically. Wait two or three minutes before performing the conversion.

  3. Functions include:

    Text to speech: that is, input text and generate sound with the selected voice.
    
    Voice to voice: that is, select an audio file from the local and generate another audio file with the selected voice.
    
  4. If the opened cmd window does not move for a long time, you need to press Enter on it to continue output. Please click on the icon in the upper left corner of the cmd, select "Properties", and then uncheck the "Quick Edit" and "Insert Mode" checkboxes.