Skip to content

The core principle of video translation software is: to recognize text from the speech in the video, then translate the text into the target language, then dub the translated text, and finally embed the dubbing and text into the video.

As you can see, the first step is to recognize the text from the speech in the video, and the accuracy of recognition directly affects subsequent translation and dubbing.

Faster Local Mode

Recommended. This is a model based on OpenAI's open-source whisper conversion, and as the name suggests, it recognizes faster without reducing accuracy.

image.png

After selecting faster mode, you can select the model to use on the right. The default built-in is the tiny model, which is the smallest model and the least accurate.

image.png

tiny--base--small--medium--large The model size is getting larger and larger, and the recognition accuracy is also getting higher and higher.

For Chinese videos, it is recommended to select at least the medium model. The model download address is at https://github.com/jianchang512/stt/releases/0.0

Models with the .en suffix and models starting with distil can only be used for English videos.

There is also an Overall Recognition drop-down box on the right side of the model, which will display Equal Division. Generally, select Overall Recognition unless you have special needs. If you need to divide the audio into equal-length parts, such as wanting each subtitle to be 10 seconds long, then you can select equal division. And set the segment duration in seconds in the menu - Tools/Advanced Settings - Advanced Settings - VAD parameters section.

To speed up the task, on Windows and Linux, if you have an NVIDIA graphics card, you can configure and install the CUDA and cuDNN environments, and then enable CUDA acceleration, which will significantly improve the execution speed.

image.png

View CUDA and cuDNN installation tutorial

Auto-Detect Language

In version v2.59 and later, the "Auto-Detect" option has been added to the original language drop-down box. When you don't know what language it is or the language is not among the 24 supported languages, you can select the "Auto-Detect" option, and the program will try to automatically recognize the spoken language.

Of course, if possible, avoid using this option as much as possible, especially when there is no clear speech within the first 30 seconds of the video, because the automatic detection principle is to use the first 30 seconds of sound clips to determine the language used for the entire video. Another point to note: some languages with similar pronunciations but different writing methods cannot be accurately identified and may be randomly identified as any one, for example, Chinese videos may be randomly identified as Simplified or Traditional.