Skip to content

CosyVoice Open Source Address: https://github.com/FunAudioLLM/CosyVoice

CosyVoice-api Open Source Address: https://github.com/jianchang512/cosyvoice-api

Supports Chinese, English, Japanese, Korean, Cantonese, with corresponding language codes: zh|en|jp|ko|yue

Using in Video Translation Software

  1. First, upgrade the software to version 2.08+.
  2. Ensure the CosyVoice project is deployed, the api.py from CosyVoice-api is included, and api.py is successfully launched (API service must be running for use in the translation software).
  3. Open the video translation software, go to Settings (top left) -- CosyVoice: Enter the API address, which defaults to http://127.0.0.1:9233.
  4. Fill in the reference audio and corresponding text.
Reference audio format:

Each line is separated into two parts by the # symbol. The first part is the path to the WAV audio file, and the second part is the corresponding text content. Multiple lines can be entered.

The optimal duration for WAV audio is 5-15 seconds. If the audio file is located in the root directory of the CosyVoice project (same directory as webui.py), you can directly enter the file name here.
If it's located in the wavs directory under the root directory, you need to enter wavs/audio_name.wav

Example reference audio entries:

1.wav#Hello dear friends
wavs/2.wav#Hello friends
  1. After filling in the information, select CosyVoice as the voice-over channel and the corresponding role on the main interface. The clone role duplicates the original video's voice timbre.

For other systems, please deploy CosyVoice first. The specific deployment method is as follows:

Source Code Deployment of the Official CosyVoice Project

Deployment uses conda, which is highly recommended. Otherwise, installation may fail, and you may encounter many issues. Some dependencies cannot be successfully installed using pip on Windows, such as pynini.

1. Download and install Miniconda

Miniconda is a conda management software. It is easy to install on Windows; just follow the prompts like a normal software installation.

Download address: https://docs.anaconda.com/miniconda/

After downloading, double-click the .exe file.

Note that in the following interface, you need to select the top two checkboxes; otherwise, subsequent operations will be a bit troublesome. Selecting the second checkbox means "Add conda commands to the system environment variables." If you don't select it, you won't be able to use the conda short commands directly.

Then click "install" and wait for it to complete before closing.

2. Download CosyVoice source code

First, create an empty directory, for example, D:/py on the D drive. The following instructions will use this as an example.

Open the CosyVoice open-source address: https://github.com/FunAudioLLM/CosyVoice

After downloading and extracting, copy all the files in the CosyVoice-main directory to D:/py.

3. Create a virtual environment and activate it

Go to the D:/py folder, enter cmd in the address bar, and press Enter. This will open a black cmd window.

In the window, enter the command conda create -n cosyvoice python=3.10 and press Enter. This creates a virtual environment named "cosyvoice" with Python version "3.10".

Continue by entering the command conda activate cosyvoice and press Enter. This activates the virtual environment. Only after activation can you continue with installation, startup, and other operations; otherwise, errors will inevitably occur.

The activated state is indicated by the addition of the "(cosyvoice)" character at the beginning of the command line.

4. Install the pynini module

This module can only be installed using the conda command on Windows, which is why using conda on Windows is recommended at the beginning.

Continue in the cmd window that you opened and activated above, and enter the command conda install -y -c conda-forge pynini==2.1.5 WeTextProcessing==1.0.3 and press Enter.

Note: During installation, a prompt will appear asking for confirmation. Enter y and press Enter, as shown below.

5. Install other dependencies using the Alibaba Cloud mirror

Open the requirements.txt file, delete the last line WeTextProcessing==1.0.3, otherwise, the installation will definitely fail because this module depends on pynini, and pynini cannot be installed under pip on Windows.

Then add 3 lines Matcha-TTS flask and waitress to requirements.txt

Continue by entering the command

pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

And press Enter. After a long wait, the installation should be successful.

6. Download the api.py file and place it in the project

Go to this address https://github.com/jianchang512/cosyvoice-api/blob/main/api.py to download the api.py file, and place it together with webui.py.

image.png

image.png

Start the API service

The API interface address is: http://127.0.0.1:9233

Enter the command python api.py and press Enter to execute.

API Interface List

Synthesize text based on built-in roles

  • Interface address: /tts

  • Simply synthesizes text into speech without voice cloning.

  • Required parameters:

text: The text to be synthesized into speech.

role: Select one of '中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女' (Chinese Female, Chinese Male, Japanese Male, Cantonese Female, English Female, English Male, Korean Female).

  • Successful return: WAV audio data.

  • Example code

data={
    "text":"Hello dear friends",
    "reference_audio":"10.wav"
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

Clone voice timbre in the same language

  • Address: /clone_eq

The pronunciation language of the reference audio is consistent with the language of the text to be synthesized. For example, if the reference audio is Chinese pronunciation, and you need to synthesize Chinese text into speech based on that audio.

  • Required parameters:

text: The text to be synthesized into speech.

reference_audio: The reference audio for voice cloning.

reference_text: The text corresponding to the reference audio. The path to the reference audio relative to api.py. For example, if referencing 1.wav, and the file is in the same folder as api.py, then enter 1.wav.

  • Successful return: WAV data.

  • Example code

data={
    "text":"Hello dear friends.",
    "reference_audio":"10.wav",
    "reference_text":"I hope you are doing better than me."
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

Clone voice timbre in different languages:

  • Address: /cone

The pronunciation language of the reference audio is inconsistent with the language of the text to be synthesized. For example, you need to synthesize an English text into speech based on a reference audio with Chinese pronunciation.

  • Required parameters:

text: The text to be synthesized into speech.

reference_audio: The reference audio for voice cloning. The path to the reference audio relative to api.py. For example, if referencing 1.wav, and the file is in the same folder as api.py, then enter 1.wav.

  • Successful return: WAV data.

  • Example code

data={
    "text":"親友からの誕生日プレゼントを遠くから受け取り、思いがけないサプライズと深い祝福に、私の心は甘い喜びで満たされた!。",
    "reference_audio":"10.wav"
}

response=requests.post(f'http://127.0.0.1:9933/tts',data=data,timeout=3600)

OpenAI TTS Compatibility

  • Interface address: /v1/audio/speech
  • Request method: POST
  • Request type: Content-Type: application/json
  • Request parameters: input: The text to be synthesized. model: Fixed to tts-1, compatible with OpenAI parameters, but not actually used. speed: Speech rate, default is 1.0. reponse_format: Return format, fixed to WAV audio data. voice: Only used for text synthesis, select one of '中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女' (Chinese Female, Chinese Male, Japanese Male, Cantonese Female, English Female, English Male, Korean Female).

When cloning, fill in the path to the reference audio relative to api.py. For example, if referencing 1.wav, and the file is in the same folder as api.py, then enter 1.wav.

  • Example code
from openai import OpenAI

client = OpenAI(api_key='12314', base_url='http://127.0.0.1:9933/v1')
with  client.audio.speech.with_streaming_response.create(
                    model='tts-1',
                    voice='中文女',
                    input='Hello dear friends',
                    speed=1.0                    
                ) as response:
    with open('./test.wav', 'wb') as f:
       for chunk in response.iter_bytes():
            f.write(chunk)