Using Hugging Face Models in Video Translation Software | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Video translation software typically comes with various built-in speech recognition channels to transcribe human speech in audio and video into subtitle files. While these software perform adequately in English and Chinese, their effectiveness diminishes when used for less common languages such as Japanese, Korean, and Indonesian.

This is because the training data for large language models developed abroad primarily focuses on English, with Chinese also falling short of ideal results. Similarly, domestic models predominantly concentrate on Chinese and English, with a higher emphasis on Chinese.

The lack of training data leads to poor model performance. Fortunately, the Hugging Face website at https://huggingface.co aggregates a vast collection of fine-tuned models, including several specifically designed for less common languages, which deliver impressive results.

This article will demonstrate how to use Hugging Face models within video translation software to recognize these less common languages, using Japanese as a practical example.

1. Circumventing Network Restrictions

Due to network restrictions, direct access to https://huggingface.co is unavailable within China. You'll need to configure your network environment to ensure access to the website.

Once accessed, you will see the Hugging Face website's homepage.

2. Navigate to the Models Directory

Click on the "Automatic Speech Recognition" category in the left-hand navigation bar. The right-hand side will display all available speech recognition models.

3. Finding Models Compatible with Faster-Whisper

The Hugging Face website currently hosts 20,384 speech recognition models. However, not all models are suitable for video translation software. Different models return data in different formats, and video translation software is only compatible with faster-whisper type models.

Enter "faster-whisper" in the search box.

The search results will primarily consist of models that can be used in video translation software.

Of course, some models are compatible with faster-whisper, but the name doesn't include "faster-whisper". How to find these models?

Search for the language name, such as "japanese", then click to enter the model details page, and check whether the model description states that it is compatible with faster-whisper.

If the model name or description does not explicitly mention faster-whisper, the model is not usable. Even if "whisper" or "whisper-large" appears, it is not usable, because "whisper" is used to be compatible with openai-whisper mode, and the current video translation software does not support it yet. Whether it will be supported in the future? It depends on the situation.

4. Copy the Model ID to the Video Translation Software

After finding a suitable model, copy its Model ID and paste it into the "Menu" -> "Tools" -> "Advanced Options" -> "Faster and OpenAI model list" section of the video translation software.

Copy the Model ID.
Paste it into the video translation software.

Save the settings.

5. Select Faster-Whisper Mode

In the speech recognition channel, select the model you just added. If it is not displayed, please restart the software.

After selecting the model and pronunciation voice, you can start recognition.

Note: A proxy must be set, otherwise it will not be able to connect and will report an error. You can try setting a global computer proxy or a system proxy. If the error persists, fill in the proxy IP and port into the "Network Proxy" text box on the main interface.
For an explanation of network proxies, please see https://pyvideotrans.com/proxy