Skip to content

Voice Cloning Tool

clone-voice Open Source Project Address

The models used in this project are all from https://github.com/coqui-ai/TTS, and the model agreement is CPML. It can only be used for learning and research, and cannot be used for commercial purposes.

This is a voice cloning tool that can use any human voice to synthesize a voice speaking a piece of text using that voice, or convert one voice into another using that voice.

It is very simple to use. You can use it without an NVIDIA GPU. Download the pre-compiled version, double-click app.exe to open a web interface, and you can use it by clicking with the mouse.

Supports 16 languages including Chinese, English, Japanese, Korean, French, German, Italian, etc., and can record sounds online from the microphone.

To ensure the synthesis effect, it is recommended to record for 5 to 20 seconds, with clear and accurate pronunciation, and no background noise.

The English effect is great, and the Chinese effect is okay.

How to use the Windows pre-compiled version (other systems can deploy the source code)

  1. Click here to open the Releases download page, download the pre-compiled main file (1.7G) and model (3G)

  2. After downloading, extract it to somewhere, such as E:/clone-voice

  3. Double-click app.exe, wait for the web window to open automatically, please read the text prompts in the cmd window carefully, any errors will be displayed here

  4. After the model is downloaded, extract it to the tts folder in the software directory.

  5. Conversion operation steps

    • Select the [Text -> Voice] button, enter text in the text box, or click to import the srt subtitle file, and then click "Start Now".

    • Select the [Voice -> Voice] button, click or drag the audio file (mp3/wav/flac) to be converted, and then select the voice to be cloned from the "Voice file to use" drop-down box. If there is no satisfactory one, you can also click the "Local Upload" button to select a recorded 5-20s wav/mp3/flac sound file. Or click the "Start Recording" button to record your own voice online for 5-20s, and click Use after recording. Then click the "Start Now" button

  6. If the machine has an NVIDIA GPU and the CUDA environment is configured correctly, CUDA acceleration will be used automatically.

Source code deployment (linux mac window)

The source code version requires a global proxy, because it needs to download the model from https://huggingface.co, which cannot be accessed in China

  1. Requires python 3.9->3.11

  2. Create an empty directory, such as E:/clone-voice, open the cmd window in this directory, the method is to enter cmd in the address bar, and then press Enter. Use git to pull the source code to the current directory git clone [email protected]:jianchang512/clone-voice.git .

  3. Create a virtual environment python -m venv venv

  4. Activate the environment, E:/clone-voice/venv/scripts/activate under win,

  5. Install dependencies: pip install -r requirements.txt

  6. Under win, unzip ffmpeg.7z, and put ffmpeg.exe in the same directory as app.py. For linux and mac, go to the ffmpeg official website to download the corresponding version of ffmpeg, and unzip the ffmpeg program to the root directory. The executable binary file ffmpeg and app.py must be in the same directory.

    First run python code_dev.py , enter y when prompted to agree to the agreement, and then wait for the model to be downloaded. Downloading the model requires a global proxy, and the model is very large. If the proxy is not stable enough, you may encounter many errors. Most of the errors are caused by proxy problems.

    If it shows that multiple models have been downloaded successfully, but it still prompts "Downloading WavLM model" error, you need to modify the library package file \venv\Lib\site-packages\aiohttp\client.py, add your proxy address on the line above if proxy is not None: at around line 535, such as proxy="http://127.0.0.1:10809".

  7. After the download is complete, start python app.py again,

  8. Each startup will connect to the outside to detect or update the model, please be patient and wait. If you don't want to detect or update every time you start, you need to manually modify the file under the dependency package, open \venv\Lib\site-packages\TTS\utils\manage.py, around line 389, in the def download_model method, comment out the following code

if md5sum is not None:
	md5sum_file = os.path.join(output_path, "hash.md5")
	if os.path.isfile(md5sum_file):
	    with open(md5sum_file, mode="r") as f:
		if not f.read() == md5sum:
		    print(f" > {model_name} has been updated, clearing model cache...")
		    self.create_dir_and_download_model(model_name, model_item, output_path)
		else:
		    print(f" > {model_name} is already downloaded.")
	else:
	    print(f" > {model_name} has been updated, clearing model cache...")
	    self.create_dir_and_download_model(model_name, model_item, output_path)
  1. The source code version may frequently encounter errors when starting, which are basically caused by proxy problems that prevent the model from being downloaded from outside the wall or the download is interrupted and incomplete. It is recommended to use a stable proxy and turn it on globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

CUDA acceleration support

Detailed installation method of CUDA tools

Precautions

The xtts model can only be used for learning and research, and cannot be used for commercial purposes.

  1. The source code version requires a global proxy, because it needs to download the model from https://huggingface.co, which cannot be accessed in China. The source code version may frequently encounter errors when starting, which are basically caused by proxy problems that prevent the model from being downloaded from outside the wall or the download is interrupted and incomplete. It is recommended to use a stable proxy and turn it on globally. If you still cannot download it completely, it is recommended to use the pre-compiled version.

  2. After startup, the model needs to be cold-loaded, which will take some time. Please wait patiently for http://127.0.0.1:9988 to be displayed, and after the browser page is automatically opened, wait two or three minutes before converting.

  3. Functions include:

     Text to speech: that is, input text and generate sound with the selected voice.
     
     Voice to voice: that is, select an audio file from the local and generate another audio file with the selected voice.
    
  4. If the opened cmd window does not move for a long time, you need to press Enter on it to continue outputting. Please click on the icon in the upper left corner of cmd, select "Properties", and then uncheck the "Quick Edit" and "Insert Mode" checkboxes.