Using Chinese Re-segmentation Feature
Whisper is currently the mainstream speech recognition model, but it still has significant shortcomings in Chinese recognition. Compared to English speech recognition, Whisper's Chinese recognition performance has a larger gap, not only frequently outputting traditional Chinese characters, but also lacking punctuation marks, resulting in poor sentence segmentation in the generated subtitles. Even with re-segmentation based on the returned character-level timestamps, the results are still not ideal if there are no clear silent divisions in the audio and video.
In comparison, Alibaba's FunASR series models perform well in Chinese recognition, but its language support range is limited and only applicable to Chinese, unable to handle other languages.
Therefore, in v2.92, Alibaba's Chinese punctuation restoration model has been introduced. This model can restore punctuation marks in Chinese recognition results and re-divide sentences based on punctuation and silent intervals. Due to the added punctuation restoration model, the software size has increased by approximately 400MB.
Enabling Chinese Re-segmentation
The Alibaba Chinese punctuation model will automatically be used to re-segment the results when the following conditions are met:
- Check the "Chinese Re-segmentation" option in the main interface or audio/video to subtitle interface;
- The language of the audio/video is Chinese;
- The speech recognition engine is selected as "faster-whisper", "openai-whisper", or "deepgram.com";
- The segmentation mode is selected as Recognize as a whole.
After the above conditions are met, the system will first restore punctuation marks after speech recognition is completed, and then re-divide sentences based on punctuation marks and silent intervals to improve the accuracy and readability of the subtitles.