Speech Recognition Model Classification and Explanation | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

There are 14 speech recognition models in total, which can be divided into 3 categories. All are used to recognize human speech in videos as subtitle text.

To reduce the download size, the software only includes the smallest "tiny" model by default. This model has the lowest recognition accuracy. For better results, please download other larger models.

Models Usable in Both OpenAI and Faster Modes

tiny, tiny.en: The smallest model, fastest speed, and least resource consumption, but also the lowest accuracy.
base, base.en: Slightly larger than tiny.
small, small.en: Slightly larger than base.
medium, medium.en: Medium model, for Chinese recognition, you should choose a model of at least medium size or larger.
large-v1, large-v2, large-v3: Largest model, highest accuracy, requires 8GB or 12GB or more of available video memory (VRAM).

Models ending with .en can only be used for audio and video with English pronunciation.

Models Only Usable in Faster Mode

distil-whisper-small.en: Only for English videos.
distil-whisper-medium.en: Only for English videos.
distil-whisper-large-v2: Requires 8GB or more of VRAM, currently has good results for English videos, but poor results for other languages.

The First Category: Models with the .en Suffix

For example, tiny.en, base.en, medium.en, etc. As the name suggests, these models are only used for video processing where the original language is English. That is, if the spoken language in the video you are processing is English, then choosing a model with the .en suffix will yield better results than an equivalent model without the .en suffix.

The Second Category: Models without the .en Suffix

Can be used for all supported languages, such as tiny, large-v1, etc.

The Third Category: Models Starting with distil

There are currently only three models in this category, and they can only process videos where the original language is English. Even without the .en suffix, it is recommended to only use them for processing videos with English pronunciation. Processing videos in other languages will have very poor results.

The characteristic of these models is that they are faster. Note that distil models can only be used in faster mode and cannot be used in openai mode.

distil-whisper-small.en
distil-whisper-medium.en
distil-whisper-large-v2

Faster Model Download

All models are downloaded from this address: https://github.com/jianchang512/stt/releases/tag/0.0

After opening, select according to the mode you want to use. It is recommended to choose the faster model for faster speed.

After downloading the faster model, the compressed package is a folder. Copy the folder inside to the models folder in the software directory.

For example, after downloading the medium model, you will see a folder when you open the compressed package. Copy this folder to the models directory.

OpenAI Model Download

Also at this address: https://github.com/jianchang512/stt/releases/tag/0.0

Scroll down and download the file with the .pt suffix. Copy the file directly to the models directory.

Models Usable in Both OpenAI and Faster Modes ​

Models Only Usable in Faster Mode ​

The First Category: Models with the .en Suffix ​

The Second Category: Models without the .en Suffix ​

The Third Category: Models Starting with distil ​

Faster Model Download ​

OpenAI Model Download ​