Skip to content

Speech-to-Text Tool

Speech-to-Text Tool Open Source Address

This is an offline, locally run speech-to-text tool based on the open-source openai-whisper model. It can recognize and convert human voices in videos/audio into text, and can output in JSON format, SRT subtitle format with timestamps, or plain text format. It can be deployed independently to replace OpenAI's speech recognition API or Baidu speech recognition, etc., with an accuracy rate basically equivalent to the official OpenAI API.

After deploying or downloading, double-click start.exe to automatically call the local browser to open the local webpage.

Drag and drop or click to select the audio or video file to be recognized, then select the spoken language, output text format, and the model to be used (base model is built-in). Click Start Recognition, and the recognition result will be output in the selected format on the current webpage.

The entire process does not require internet connection and runs completely locally, and can be deployed on an intranet.

The openai-whisper open-source model has base/small/medium/large/large-v3. The base model is built-in. base->large-v3 has better and better recognition effect, but requires more computer resources. You can download it yourself as needed and put it in the models directory.

All Model Download Addresses

Pre-compiled Win Version Usage / Linux and Mac Source Code Deployment

  1. Click here to open the Releases page to download the pre-compiled file

  2. After downloading, extract it to a certain location, such as E:/stt

  3. Double-click start.exe and wait for the browser window to open automatically

  4. Click the upload area on the page, find the audio or video file you want to recognize in the pop-up window, or directly drag and drop the audio and video file to the upload area, then select the spoken language, text output format, and the model to be used, click "Start Recognition Immediately", wait a moment, and the recognition result will be displayed in the bottom text box in the selected format

  5. If the machine has an Nvidia GPU and has correctly configured the CUDA environment, CUDA acceleration will be used automatically

Source Code Deployment (Linux/Mac/Window)

  1. Requires python 3.9->3.11

  2. Create an empty directory, such as E:/stt. Open the cmd window in this directory by entering cmd in the address bar and pressing Enter.

    Use git to pull the source code to the current directory git clone [email protected]:jianchang512/stt.git .

  3. Create a virtual environment python -m venv venv

  4. Activate the environment. Under win, the command is %cd%/venv/scripts/activate. Under linux and Mac, the command is source ./venv/bin/activate

  5. Install dependencies: pip install -r requirements.txt. If a version conflict error is reported, execute pip install -r requirements.txt --no-deps

  6. Under win, unzip ffmpeg.7z and place the ffmpeg.exe and ffprobe.exe in the project directory. For linux and mac, go to the ffmpeg official website to download the corresponding version of ffmpeg, unzip the ffmpeg and ffprobe binary programs and place them in the project root directory.

  7. Download the model compression package, download the model as needed, and place the xx.pt file in the compression package in the models folder in the project root directory

  8. Execute python start.py and wait for the local browser window to open automatically.