Speech Recognition to Text Tool

Speech Recognition to Text Tool Open Source Address

This is an offline, locally run speech recognition to text tool based on the openai-whisper open-source model. It can recognize human speech in videos/audio and convert it to text, outputting in JSON format, SRT subtitle format with timestamps, or plain text format. It can be self-deployed to replace OpenAI's speech recognition interface or Baidu Speech Recognition, with accuracy comparable to the official OpenAI API.

After deployment or download, double-click start.exe to automatically open the local web page in your browser.
Drag and drop or click to select the audio or video file to recognize, then choose the spoken language, output text format, and model to use (base model is included by default). Click "Start Recognition," and after completion, the result will be displayed on the current web page in the selected format.
The entire process runs offline and locally, and can be deployed on an intranet.
The openai-whisper open-source models include base, small, medium, large, and large-v3, with base model included. From base to large-v3, recognition accuracy improves, but more computer resources are required. You can download additional models and place them in the models directory as needed.
All Model Download Address

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment

Click here to open the Releases page and download the pre-compiled files.
After downloading, extract the files to a location, e.g., E:/stt.
Double-click start.exe and wait for the browser window to open automatically.
Click the upload area on the page, select the audio or video file to recognize in the pop-up window, or directly drag and drop the file into the upload area. Then choose the spoken language, text output format, and model to use. Click "Start Recognition Immediately," and after a short wait, the recognition result will appear in the bottom text box in the selected format.
If the machine has an NVIDIA GPU and the CUDA environment is correctly configured, it will automatically use CUDA acceleration.

Source Code Deployment (Linux/Mac/Windows)

Requires Python 3.9 to 3.11.
Create an empty directory, e.g., E:/stt. Open a command prompt in this directory by typing cmd in the address bar and pressing Enter.
Use git to pull the source code into the current directory: git clone [email protected]:jianchang512/stt.git .
Create a virtual environment: python -m venv venv.
Activate the environment: on Windows, use %cd%/venv/scripts/activate; on Linux and Mac, use source ./venv/bin/activate.
Install dependencies: pip install -r requirements.txt. If there are version conflict errors, run pip install -r requirements.txt --no-deps.
On Windows, extract ffmpeg.7z and place ffmpeg.exe and ffprobe.exe in the project directory. On Linux and Mac, download the corresponding version of ffmpeg from the ffmpeg official website, extract it, and place the ffmpeg and ffprobe binary programs in the project root directory.
Download the model archive, choose the model as needed, and after downloading, place the xx.pt file from the archive into the models folder in the project root directory.
Run python start.py and wait for the local browser window to open automatically.

Speech Recognition to Text Tool ​

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment ​

Source Code Deployment (Linux/Mac/Windows) ​

Speech Recognition to Text Tool

Pre-compiled Windows Version Usage / Linux and Mac Source Code Deployment

Source Code Deployment (Linux/Mac/Windows)