Skip to content

There are 14 speech recognition models, categorized into 3 types. All of them are designed to convert human speech from videos into text subtitles.

To minimize download size, the software defaults to only including the smallest 'tiny' model. This model offers the lowest recognition accuracy. For better results, please download other larger models.

Models Usable with Both OpenAI and Faster Modes

  • tiny, tiny.en: The smallest model, offering the fastest speed, lowest resource consumption, and also the lowest accuracy.
  • base, base.en: Slightly larger than 'tiny'.
  • small, small.en: Slightly larger than 'base'.
  • medium, medium.en: A medium-sized model. For Chinese speech recognition, you should choose at least 'medium' or a larger model.
  • large-v1, large-v2, large-v3: The largest models, offering the highest accuracy. These require 8GB or more (preferably 12GB+) of available VRAM.

Models ending with .en are exclusively for audio/video with English speech.

Models Exclusively for Faster Mode

  • distil-whisper-small.en: Exclusively for English videos.
  • distil-whisper-medium.en: Exclusively for English videos.
  • distil-whisper-large-v2: Requires 8GB+ VRAM. Currently performs well for English videos, but poorly for other languages.

Category 1: Models with the .en Suffix

For example, tiny.en, base.en, medium.en, etc. As the name suggests, these models are specifically designed for processing videos where the original language spoken is English. If your video contains English speech, choosing a model with the .en suffix will yield better results than its non-.en equivalent.

Category 2: Models Without the .en Suffix

These models can be used for all supported languages, such as tiny, large-v1, etc.

Category 3: Models Starting with 'distil'

Currently, there are only three models in this category. All of them are designed to process videos where the original language spoken is English. Even if they don't have the .en suffix, it is strongly recommended to use them only for English speech videos, as their performance for other languages will be very poor.

A key characteristic of these models is their faster processing speed. Please note that 'distil' models can only be used in 'faster' mode and are not compatible with 'openai' mode.

  • distil-whisper-small.en
  • distil-whisper-medium.en
  • distil-whisper-large-v2

Faster Model Download

All models can be downloaded from this address: https://github.com/jianchang512/stt/releases/tag/0.0

Once opened, select based on your desired mode. We recommend choosing 'faster' models for quicker processing speed.

After downloading a 'faster' model, the compressed archive will contain a folder. Copy this folder into the 'models' directory within your software installation.

For example, after downloading the 'medium' model, you will find a folder inside the compressed archive. Copy this folder into the 'models' directory.

OpenAI Model Download

The download link is the same: https://github.com/jianchang512/stt/releases/tag/0.0

Scroll down and download the file with a .pt extension. Simply copy this file directly into the 'models' directory.