Skip to content

There are 14 speech recognition models, which can be divided into 3 categories. All of them are used to recognize human speech in videos as subtitle text.

To reduce the download size, the software only has the smallest "tiny" model built-in by default. This model has the lowest recognition accuracy. If you need better results, please download other larger models.

Models that can be used in both openai and faster modes

  • tiny,tiny.en: Smallest model, fastest speed, consumes the least resources, and has the lowest accuracy
  • base,base.en: Slightly larger than tiny
  • small,small.en: Slightly larger than base
  • medium,medium.en: Medium model, used for Chinese recognition, at least medium or larger model should be selected
  • large-v1,large-v2,large-v3: Largest model, highest accuracy, requires 8G or 12G or more of available video memory

Models ending with .en can only be used for audio and video with English pronunciation

Models only available in faster mode

  • distil-whisper-small.en: Only for English videos
  • distil-whisper-medium.en: Only for English videos
  • distil-whisper-large-v2: Requires 8G or more of video memory. Currently, it works well for English videos, but very poorly for other languages

The first type is models with the suffix .en

For example, tiny.en, base.en, medium.en, etc. As the name suggests, this type of model is only used for video processing where the original language is English. That is to say, if the spoken language in the video to be processed is English, then choosing a model with the suffix .en will have better results than the equivalent model without .en.

The second type is models without .en

Can be used for all supported languages, such as tiny, large-v1, etc.

The third type is models starting with distil

There are currently only three models in this category, and they can only process videos where the original language is English. Even if they do not have the .en suffix, it is recommended to only use them for processing videos with English pronunciation. Processing videos in other languages will have very poor results.

The characteristic of this type of model is that it is faster. Note that distil models can only be used in faster mode and cannot be used in openai mode.

  • distil-whisper-small.en
  • distil-whisper-medium.en
  • distil-whisper-large-v2

Faster model download

All models are downloaded from this address: https://github.com/jianchang512/stt/releases/tag/0.0

After opening, select according to the mode you want to use. It is recommended to select the faster model, which is faster.

After the faster model is downloaded, the compressed package contains a folder. Copy the folder inside to the models folder in the software directory.

For example, after the medium model is downloaded, you will see a folder when you open the compressed package. Copy this folder to the models directory.

openai model download

The same address: https://github.com/jianchang512/stt/releases/tag/0.0

Scroll down and after downloading, you will get a file with the .pt suffix. Just copy the file directly to the models directory.