Skip to content

Speech recognition, which is the conversion of human speech sounds in audio and video into text, is the first step in video translation and the key to determining the quality of subsequent dubbing and subtitles. Currently, the software mainly supports two types of local offline recognition: faster-whisper local and openai-whisper local.

image.png

The two are very similar. In essence, faster-whisper is a re-processed and optimized product of openai-whisper. The recognition accuracy is basically the same, but the former has a faster recognition speed. Relatively, the former has more requirements for environmental configuration when using CUDA acceleration.

faster-whisper Local Recognition Mode

The software also recommends using this mode by default. It is faster and more efficient.

In this mode, the model volume ranges from small to large: tiny -> base -> small -> medium -> large-v1 -> large-v3

image.png

From front to back, the model volume gradually increases from 60MB to 2.7G. The required memory, video memory, CPU/GPU consumption also gradually increase. If the available video memory is less than 10G, it is not recommended to use large-v3, otherwise, there may be problems such as crashes and freezes.

From tiny to large-v3, with the increase in volume and resource consumption, the corresponding recognition accuracy also becomes higher and higher. tiny/base/small are small models, which have fast recognition speed and low resource occupation, but the accuracy is very low;

medium is a medium model. If you want to recognize videos with Chinese pronunciation, it is recommended to use at least a model greater than or equal to medium, otherwise the effect will not be good.

If the CPU is strong enough and the memory is large enough, even without using CUDA acceleration, you can also choose the large-v1/v2 model, and the accuracy will be greatly improved compared to the previous small and medium models, although the recognition speed will be reduced.

large-v3 occupies more resources. Unless the computer is powerful enough, it is not recommended to use it. It is recommended to use large-v3-turbo to replace it. The accuracy of the two is the same, but large-v3-turbo is faster and occupies fewer resources.

Model names ending with .en and starting with distil can only be used for English pronunciation videos. Please do not use them for other language videos.

openai-whisper Local Recognition Mode

The models in this mode are basically the same as faster-whisper, and the model volume also ranges from small to large: tiny -> base -> small -> medium -> large-v1 -> large-v3. The precautions for use are also consistent. tiny/base/small are small models, and large-v1/v2/v3 are large models.

Summary of Selection Methods

  1. It is recommended to give priority to faster-whisper local mode, unless you want to use CUDA acceleration but always report environment errors, you can use openai-whisper local mode.
  2. Regardless of the mode, if you want to recognize videos with Chinese pronunciation, it is recommended to use at least the medium model, and at least small, and for English pronunciation videos, at least small. Of course, if computer resources are sufficient, it is recommended to use large-v3-turbo.
  3. Models ending with .en and starting with distil can only be used for English pronunciation videos.