Video translation software typically comes with multiple built-in speech recognition channels to transcribe human speech in audio and video into subtitle files. While these channels perform reasonably well for English and Chinese, their performance tends to be less satisfactory for low-resource languages like Japanese, Korean, and Indonesian.
This is primarily because the training data for large language models developed abroad predominantly consists of English materials. The performance for Chinese is also not ideal. Similarly, domestic models mainly focus on training data in Chinese and English, with a higher proportion of Chinese.
This lack of training data leads to subpar model performance. Fortunately, the Hugging Face website, https://huggingface.co, hosts a vast collection of fine-tuned models, including specialized ones for low-resource languages that deliver impressive results.
This article will demonstrate how to leverage Hugging Face models within video translation software to enhance the recognition of low-resource languages, using Japanese as a case study.
1. Circumventing Network Restrictions
Due to network restrictions, direct access to https://huggingface.co may be unavailable in some regions. You will need to configure your network environment to ensure accessibility.
Once you gain access, you'll be greeted by the Hugging Face website's homepage.
2. Navigating to the Models Directory
Click on the "Automatic Speech Recognition" category in the left-hand navigation bar. This will display all the available speech recognition models on the right.
3. Identifying Models Compatible with faster-whisper
The Hugging Face website currently boasts 20,384 speech recognition models. However, not all of them are suitable for use with video translation software. Different models return data in varying formats, and video translation software primarily supports models of the faster-whisper type.
- Search for "faster-whisper" in the search bar.
The search results will generally consist of models that can be utilized in video translation software.
Of course, some models might be compatible with faster-whisper even if their names don't explicitly include "faster-whisper." How do you find these models?
- Search for the language name, such as "japanese," and then click to enter the model details page. Check if the model description mentions compatibility with faster-whisper.
If the model's name or description doesn't explicitly mention faster-whisper, it's likely not compatible. Even if terms like "whisper" or "whisper-large" appear, they are intended for compatibility with the openai-whisper mode, which is currently not supported by the video translation software. Whether or not it will be supported in the future depends on various factors.
4. Copying the Model ID to the Video Translation Software
After locating a suitable model, copy its Model ID and paste it into the video translation software under "Menu" -> "Tools" -> "Advanced Options" -> "faster and openai model list."
Copy the Model ID.
Paste it into the video translation software.
Save the settings.
5. Selecting the faster-whisper Mode
In the speech recognition channel selection, choose the newly added model. If it doesn't appear, restart the software.
Once you've selected the model and the desired pronunciation language, you can begin the recognition process.
Note: A proxy must be configured; otherwise, the connection will fail, resulting in an error. You can try setting a global computer proxy or a system proxy. If errors persist, enter the proxy IP and port into the "Network Proxy" text box on the main interface.
For an explanation of network proxies, please refer to https://pyvideotrans.com/proxy
Depending on your network conditions, the download process may take a significant amount of time. As long as no red error messages appear, please be patient and wait.