Speech recognition is the process of converting human speech in audio and video into text. It is the first step in video translation and crucial for determining the quality of subsequent dubbing and subtitles. Currently, the software primarily supports local offline recognition using faster-whisper local and openai-whisper local.
The two are very similar. Essentially, faster-whisper is a reprocessed and optimized product of openai-whisper. The recognition accuracy is basically the same, but the former is faster. However, the former requires more environmental configuration when using CUDA acceleration.
faster-whisper Local Recognition Mode
This mode is the default and recommended mode of the software. It is faster and more efficient.
In this mode, the model sizes are tiny -> base -> small -> medium -> large-v1 -> large-v3
from small to large.
From front to back, the model size gradually increases from 60MB to 2.7G, and the required memory, video memory, and CPU/GPU consumption also gradually increase. If the available video memory is less than 10G, it is not recommended to use large-v3, otherwise, it may crash or freeze.
From tiny to large-v3, as the size and resource consumption increase, the corresponding recognition accuracy also increases. tiny/base/small
are small models with fast recognition speed and low resource usage, but the accuracy is very low;
medium
is a medium-sized model. If you want to recognize videos with Chinese pronunciation, it is recommended to use at least a medium
model or higher, otherwise, the effect is not good.
If the CPU is strong enough and the memory is large enough, even without using CUDA acceleration, you can choose the large-v1/v2
model. The accuracy will be much higher than the previous small and medium models, although the recognition speed will be reduced.
large-v3
occupies a lot of resources. Unless the computer is strong enough, it is not recommended. It is recommended to use large-v3-turbo
to replace it. The accuracy of the two is the same, but large-v3-turbo
is faster and occupies fewer resources.
Models with names ending in
.en
and starting withdistil
can only be used for English pronunciation videos. Do not use them for other language videos.
openai-whisper Local Recognition Mode
The models in this mode are basically the same as faster-whisper, and the sizes are also from small to large, namely tiny -> base -> small -> medium -> large-v1 -> large-v3
. The usage notes are also the same. tiny/base/small
are tiny models, and large-v1/v2/v3
are large models.
Summary of Selection Methods
- It is recommended to give priority to the faster-whisper local mode, unless you want to use CUDA acceleration but always report environment errors, you can use the openai-whisper local mode.
- Regardless of the mode, if you want to recognize videos with Chinese pronunciation, it is recommended to select at least the medium model, or at least use small. For English pronunciation videos, select at least small. Of course, if computer resources are sufficient, it is recommended to use large-v3-turbo.
- Models ending with
.en
and starting withdistil
can only be used for videos with English pronunciation.