Skip to content

This article introduces an online web-based real-time speech recognition tool that supports real-time microphone recording and speech recognition, as well as audio and video file speech recognition, and provides free use (without usage restrictions).

https://stt.pyvideotrans.com

Speech recognition technology, also known as speech-to-text, utilizes artificial intelligence to convert speech from audio or video into text. This technology has a wide range of applications in many fields, such as meeting minutes, voice assistants, subtitle generation, and so on.

Currently, there are two main methods of speech recognition:

1. Offline Model-Based Speech Recognition:

This method requires deploying a speech recognition model on a local computer. A popular open-source solution is OpenAI Whisper. After downloading its large model (such as large-v2), it can be used offline without the need for an internet connection and without any cost.

However, this method requires strong computing resources (such as a powerful graphics card), otherwise the recognition speed will be slow and the accuracy will decrease.

2. Online API-Based Speech Recognition:

Some companies provide online speech recognition API services, such as ByteDance and OpenAI.

Users only need to upload audio data to the API to get the transcription results.

This method does not require local hardware resources, is fast and accurate, but requires a certain fee.

Real-time Speech Recognition

The above two methods are mainly for existing audio or video files. So, how can we perform real-time transcription of the audio stream recorded by the microphone in real-time? For example, how to record the speech in a meeting in real-time and convert it into text?

Real-time speech recognition is similar in principle to file transcription, but technically more difficult. It requires:

  • Real-time Data Stream Processing: Continuously receiving audio data from the microphone.
  • Data Segmentation and Recognition: Dividing the continuous audio stream into smaller segments and recognizing them one by one.
  • Result Integration and Error Correction: Integrating the recognition results of each segment and correcting errors to improve the accuracy of the final transcription. This usually requires more complex algorithms to handle pauses, overlaps, and other situations in speech.
  • Minimum Latency: It is necessary to minimize the delay from audio input to text output to ensure real-time performance.

Technical Principles and Usage Instructions

image.png

  • Real-time Microphone Recording Recognition: Use the microphone to record audio in real-time and transcribe it in real-time.
  • Audio and Video File Speech Recognition: Supports uploading local audio or video files for transcription.

Technical Principles:

  1. Lightweight Speech Recognition Model (Vosk): In order to run in a browser environment, we use the compact Vosk speech recognition model. Although its accuracy is relatively low, it can effectively reduce resource consumption and ensure smooth operation in the browser.

  2. Local Audio Processing (ffmpeg.wasm): ffmpeg.wasm is used to process audio and video files and extract speech within the user's browser, without uploading audio data to the server.

  3. Client-Side Model Loading: The speech recognition model is downloaded and runs in the browser's memory. This limits us to using larger, more accurate models, and we can only choose smaller models to avoid browser crashes. Even if the user's computer has powerful performance, large models are currently not supported due to server bandwidth limitations.

How to Use

  1. Model Loading: Before use, please load the Chinese or English model as needed.
  2. Microphone Recognition: Click the button in the left area to start real-time recording and recognition using the microphone. The recognition results will be displayed in the text box in real-time.
  3. File Recognition: Select a local audio or video file in the right area, and the tool will use ffmpeg.wasm for local processing and speech recognition. The result is displayed in the text box.
  4. Result Download: The transcribed text can be downloaded as a TXT file.

Precautions

  1. Mutually Exclusive Functions: Microphone real-time recognition and file recognition functions cannot be used at the same time.
  2. Local Processing: The model and audio processing are all performed locally in the user's browser.
  3. Language Support: Currently only supports Chinese and English speech recognition.
  4. Performance Limitations: Due to the use of a lightweight model, the recognition accuracy may not be as good as a large model.

Common Problems

  • Q: What if the recognition accuracy is low? A: We use a lightweight model to ensure browser compatibility and running speed. If you need higher accuracy, it is recommended to download pyVideoTrans and use the large-v2 model locally.
  • Q: Which languages are supported? A: Currently only supports Chinese and English.
  • Q: Why is it slow? A: This may be due to poor network conditions, browser performance, or insufficient computer resources.
  • Q: How large a file can I upload? A: The file size is limited by browser memory and processing power.