Skip to content

SenseVoice is an open-source speech recognition foundation model by Alibaba that supports recognizing speech in Chinese, Japanese, Korean, and English. Compared to previous models, it boasts faster recognition speed and higher accuracy.

However, the official release doesn't include timestamp output, which is inconvenient for generating subtitles. Currently, I'm using other VAD models for pre-segmentation and then SenseVoice for recognition. This led to the creation of this API project, which is integrated into video translation software for ease of use.

SenseVoice Official Repository

This API Project: https://github.com/jianchang512/sense-api

Project Functionality

  1. Replaces the official api.py file to enable timestamped SRT subtitle output.
  2. Connects to video translation and dubbing software for seamless integration.
  3. Includes a Windows integrated package. You can launch the API by double-clicking run-api.bat or start the browser interface by double-clicking run-webui.bat.

The api.py in this project omits emotion recognition processing and only supports the recognition of Chinese, Japanese, Korean, and English speech.

Deploying the SenseVoice Project

  1. Deploy using the official source code method, which supports deployment on Windows, Linux, and MacOSX. Refer to the SenseVoice project page for specific tutorials: https://github.com/FunAudioLLM/SenseVoice. After deployment, download the api.py file from this project and overwrite the api.py file included in the official package (Required for use with video translation software; otherwise, timestamped subtitles will not be generated.).

  2. Deploy using the Windows integrated package, which only supports deployment on Windows 10/11. Download the compressed package from the right side of this page: https://github.com/jianchang512/sense-api/releases. After extracting, double-click run-api.bat to use the API, or double-click run-webui.bat to open the web interface.

Using the API

The default API address is http://127.0.0.1:5000/asr.

You can modify it by opening the api.py file:

HOST='127.0.0.1'
PORT=5000
  1. If you deployed using the official source code, remember to overwrite the api.py file and then execute python api.py.
  2. If you are using the Windows integrated package, simply double-click run-api.bat.
  3. Wait for http://127.0.0.1:5000 to appear in the terminal, indicating successful startup. You can now use the API.

Note: The first time you use it, it will download the model from modelscope over the network, which may take a long time.

Using it in Video Translation and Dubbing Tools

Enter the API address in the menu -- Speech Recognition Settings - SenseVoice Speech Recognition window.

Calling the API in Source Code

  • API address: Assume the default API address is http://127.0.0.1:5000
  • Calling method: POST
  • Request parameters:
    • lang: String type, can be one of zh, ja, ko, or en.
    • file: Audio binary data in WAV format to be recognized.
  • Response:
    • Successful recognition returns: {code:0,msg:ok,data:"Complete SRT subtitle format string"}
    • Recognition failure returns: {code:1,msg:"Reason for error"}
    • Other internal errors return: {detail:"Error information"}

Example: Recognize the 10s.wav audio file, where the spoken language is Chinese.

python
import requests
res = requests.post(f"http://127.0.0.1:5000/asr", files={"file": open("c:/users/c1/videos/10s.wav", 'rb')},data={"lang":"zh"}, timeout=7200)
print(res.json())

Using the Web UI in a Browser

  1. If you are using the official package deployed from source code, execute python webui.py. When the terminal displays http://127.0.0.1:7860, enter this address in your browser.
  2. If you are using the Windows integrated package, double-click run-webui.bat. The browser will open automatically after successful startup.