The core principle of video translation software is: to recognize text from the speech in the video, then translate the text into the target language, then dub the translated text, and finally embed the dubbing and text into the video.
As you can see, the first step is to recognize the text from the speech in the video, and the accuracy of recognition directly affects subsequent translation and dubbing.
Faster Local Mode
Recommended. This is a model based on OpenAI's open-source whisper conversion, and as the name suggests, it recognizes faster without reducing accuracy.
After selecting faster mode
, you can select the model to use on the right. The default built-in is the tiny
model, which is the smallest model and the least accurate.
tiny--base--small--medium--large
The model size is getting larger and larger, and the recognition accuracy is also getting higher and higher.
For Chinese videos, it is recommended to select at least the medium
model. The model download address is at https://github.com/jianchang512/stt/releases/0.0
Models with the .en
suffix and models starting with distil
can only be used for English videos.
There is also an Overall Recognition
drop-down box on the right side of the model, which will display Equal Division
. Generally, select Overall Recognition
unless you have special needs. If you need to divide the audio into equal-length parts, such as wanting each subtitle to be 10 seconds long, then you can select equal division. And set the segment duration in seconds in the menu - Tools/Advanced Settings - Advanced Settings - VAD parameters section.
To speed up the task, on Windows and Linux, if you have an NVIDIA graphics card, you can configure and install the CUDA and cuDNN environments, and then enable CUDA acceleration
, which will significantly improve the execution speed.
View CUDA and cuDNN installation tutorial
Auto-Detect Language
In version v2.59 and later, the "Auto-Detect" option has been added to the original language drop-down box. When you don't know what language it is or the language is not among the 24 supported languages, you can select the "Auto-Detect" option, and the program will try to automatically recognize the spoken language.
Of course, if possible, avoid using this option as much as possible, especially when there is no clear speech within the first 30 seconds of the video, because the automatic detection principle is to use the first 30 seconds of sound clips to determine the language used for the entire video. Another point to note: some languages with similar pronunciations but different writing methods cannot be accurately identified and may be randomly identified as any one, for example, Chinese videos may be randomly identified as Simplified or Traditional.