Skip to content

This article introduces a free online web-based real-time speech recognition tool that supports live microphone recording transcription and audio/video file voice recognition (with no usage limits).

https://stt.pyvideotrans.com

Speech recognition technology, also known as speech-to-text (STT) or voice transcription, uses artificial intelligence to convert spoken language from audio or video into text. This technology has wide applications across various fields, such as meeting minutes, voice assistants, and subtitle generation.

Currently, there are two primary methods for speech recognition:

1. Offline Model-Based Speech Recognition:

This method requires deploying a speech recognition model on your local computer. A popular open-source solution is OpenAI Whisper. After downloading its larger models (e.g., large-v2), you can use it offline without an internet connection or payment.

However, this approach demands significant computing resources (such as a powerful graphics card). Otherwise, recognition speed will be very slow, and accuracy may decrease.

2. Online API-Based Speech Recognition:

Some companies provide online speech recognition API services, such as ByteDance and OpenAI.

Users simply upload audio data to the API to receive transcription results.

This method eliminates the need for local hardware resources, offering high speed and accuracy, but it typically involves a service fee.

Real-time Speech Recognition

The two methods mentioned above primarily target existing audio or video files. But how can we perform real-time transcription of audio streams recorded directly from a microphone? For example, how can one instantly transcribe spoken words during a meeting?

Real-time speech recognition shares a similar principle with file transcription but presents higher technical challenges. It requires:

  • Real-time Data Stream Processing: Continuously receiving audio data from the microphone.
  • Data Segmentation and Recognition: Dividing the continuous audio stream into smaller segments and transcribing them individually.
  • Result Integration and Error Correction: Combining the recognition results from various segments and performing error correction to enhance the final transcription accuracy. This often involves more complex algorithms to handle pauses, overlaps, and other speech characteristics.
  • Minimal Latency: Minimizing the delay from audio input to text output to ensure real-time responsiveness.

Technical Principles and Usage Guide

image.png

  • Real-time Microphone Audio Recognition: Uses your microphone to record audio live and transcribes it in real-time.
  • Audio/Video File Speech Recognition: Supports uploading local audio or video files for transcription.

Technical Principles:

  1. Lightweight Speech Recognition Model (Vosk): To run within a browser environment, we adopted the compact Vosk speech recognition model. Although its accuracy is relatively lower, it effectively reduces resource consumption, ensuring smooth operation in the browser.

  2. Local Audio Processing (ffmpeg.wasm): Utilizes ffmpeg.wasm to process audio and video files and extract speech directly within the user's browser, eliminating the need to upload audio data to a server.

  3. Client-Side Model Loading: The speech recognition model is downloaded and runs in the browser's memory. This limits us to using smaller, less accurate models to prevent browser crashes. Even with powerful user computers, server bandwidth limitations currently prevent support for larger models.

How to Use

  1. Model Loading: Before use, please load the Chinese or English model as required.
  2. Microphone Recognition: Click the button in the left area to start real-time recording and recognition using your microphone. Recognition results will be displayed in the text box in real-time.
  3. File Recognition: In the right area, select a local audio or video file. The tool will process it locally using ffmpeg.wasm and perform speech recognition. The results will appear in the text box.
  4. Download Results: You can download the transcribed text as a TXT file.

Important Notes

  1. Mutually Exclusive Features: Real-time microphone recognition and file recognition features cannot be used simultaneously.
  2. Local Processing: Both the model and audio processing are performed locally in the user's browser.
  3. Language Support: Currently, only Chinese and English speech recognition are supported.
  4. Performance Limitations: Due to the use of a lightweight model, recognition accuracy may not be as high as with larger models.

Frequently Asked Questions

  • Q: What if the recognition accuracy is low? A: We use a lightweight model to ensure browser compatibility and running speed. If you require higher accuracy, we recommend downloading pyVideoTrans for local use with the large-v2 model.
  • Q: What languages are supported? A: Currently, only Chinese and English are supported.
  • Q: Why is it slow? A: This could be due to network conditions, browser performance, or insufficient computer resources.
  • Q: What is the maximum file size I can upload? A: File size is limited by browser memory and processing capabilities.