F5-TTS-api
Source code address of this project: https://github.com/jianchang512/f5-tts-api
This is the api and webui for the F5-TTS project.
F5-TTS is an advanced text-to-speech system that uses deep learning technology to generate realistic, high-quality human voices. With just a 10-second audio sample, it can clone your voice. F5-TTS can accurately reproduce speech and give it rich emotional color.
Original sound: Queen of the Daughter Kingdom
Cloned audio
Windows Integrated Package (Includes F5-TTS model and runtime environment)
123 Cloud Drive Download https://www.123684.com/s/03Sxjv-okTJ3
Hugging Face Download Address: https://huggingface.co/spaces/mortimerme/s4/resolve/main/f5-tts-api-v0.3.7z?download=true
Applicable Systems: Windows 10/11 (Download and extract to use)
How to use:
Start API Service: Double-click the run-api.bat
file. The API address is http://127.0.0.1:5010/api
.
The API service must be started to be used in the translation software.
The integrated package defaults to using CUDA 11.8. If you have an NVIDIA graphics card and have configured the CUDA/cuDNN environment, the system will automatically use GPU acceleration. If you want to use a higher version of CUDA, such as 12.4, please do the following:
Navigate to the folder where api.py is located, enter
cmd
in the folder address bar, and press Enter. Then, execute the following commands in the pop-up terminal:
.\runtime\python -m pip uninstall -y torch torchaudio
.\runtime\python -m pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
The advantage of F5-TTS lies in its efficiency and high-quality voice output. Compared to similar technologies that require longer audio samples, F5-TTS can generate high-fidelity speech with only a short audio, and can express emotions well, improving the listening experience, which is difficult for many existing technologies to achieve.
Currently, F5-TTS supports English and Chinese.
Usage Tips: Proxy/VPN
The model needs to be downloaded from the huggingface.co website. Since this website is not accessible in China, please set up a system proxy or global proxy in advance, otherwise the model download will fail.
The integrated package has integrated most of the required models, but it may detect updates or download other dependent small models, so if the terminal shows an
HTTPSConnect
error, you still need to set up a system proxy.
Using in Video Translation Software
Start the API service. The API service must be started to be used in the translation software.
Open the video translation software, find the TTS settings, select F5-TTS, and enter the API address (default is http://127.0.0.1:5010).
Enter the reference audio and audio text.
It is recommended to select f5-tts for better generation quality.
Using api.py in a third-party integrated package
- Copy api.py and the configs folder to the root directory of the third-party integrated package.
- Check the path of the python.exe integrated in the third-party integrated package. For example, if it is in the py311 folder, enter
cmd
in the folder address bar in the root directory, and then execute the command.\py311\python api.py
. If it promptsmodule flask not found
, first execute.\py311\python -m pip install waitress flask
.
Using api.py after deploying the F5-TTS official project from source code
- Copy api.py and the configs folder to the project folder.
- Install the module
pip install flask waitress
. - Execute
python api.py
.
API Usage Example
import requests
res=requests.post('http://127.0.0.1:5010/api',data={
"ref_text": 'Fill in the text content corresponding to 1.wav here',
"gen_text": '''Fill in the text to be generated here.''',
"model": 'f5-tts'
},files={"audio":open('./1.wav','rb')})
if res.status_code!=200:
print(res.text)
exit()
with open("ceshi.wav",'wb') as f:
f.write(res.content)
Compatible with OpenAI TTS interface
The voice
parameter must separate the reference audio and the text corresponding to the reference audio with 3 # signs, for example
1.wav###You say that the four elements are empty, but why do you close your eyes? If you open your eyes and look at me, I don't believe you, your eyes are empty.
Indicates that the reference audio is 1.wav and is located in the same location as api.py, and the text content in 1.wav is "You say that the four elements are empty, but why do you close your eyes? If you open your eyes and look at me, I don't believe you, your eyes are empty."
The returned data is fixed as wav audio data.
import requests
import json
import os
import base64
import struct
from openai import OpenAI
client = OpenAI(api_key='12314', base_url='http://127.0.0.1:5010/v1')
with client.audio.speech.with_streaming_response.create(
model='f5-tts',
voice='1.wav###You say that the four elements are empty, but why do you close your eyes? If you open your eyes and look at me, I don't believe you, your eyes are empty.',
input='Hello, dear friends',
speed=1.0
) as response:
with open('./test.wav', 'wb') as f:
for chunk in response.iter_bytes():
f.write(chunk)