Skip to content

The core principle of video translation software is to: recognize text from the speech in the video, then translate the text into the target language, then dub the translated text, and finally embed the dubbing and text into the video.

As you can see, the first step is to recognize text from the speech in the video, and the accuracy of recognition directly affects the subsequent translation and dubbing.

openai-whisper Local Mode

This mode is the officially open-sourced whisper model by OpenAI. Compared to the faster mode, it is slower but has the same accuracy.

image.png

The model selection method on the right is the same. From tiny to large-v3, it consumes more and more computer resources, and the accuracy becomes higher and higher.

Note: Although the faster mode and the OpenAI mode have mostly the same model names, the models are not universal. Please download the models for the OpenAI mode from https://github.com/jianchang512/stt/releases/0.0

large-v3-turbo Model

OpenAI-whisper has recently released a model based on large-v3 optimization, large-v3-turbo, which has similar recognition accuracy to the former, but the volume and resource consumption are greatly reduced, and can be used as a substitute for large-v3.

How to use

  1. Upgrade the software to version v2.67

  2. Select openai-whisper local in the drop-down box after speech recognition

  3. Select large-v3-turbo in the drop-down box after the model

  4. Download the large-v3-turbo.pt file to the models folder in the software directory