Why Are Recognized Subtitles Uneven and Chaotic - How to Optimize and Adjust?
In the video translation process, automatically generated subtitles during the speech recognition stage often have unsatisfactory results. Either the subtitles are too long, almost filling the screen, or they only display two or three characters, appearing scattered and disjointed. Why does this happen?
Speech Recognition's Sentence Segmentation Standards
Speech Recognition:
That is, when human speech sounds are converted into text subtitles, sentences are usually segmented based on silent intervals. Generally, the duration of silent segments is set between 200 milliseconds and 500 milliseconds. Assuming it's set to 250 milliseconds, when silence lasting for 250 milliseconds is detected, the program will consider it the end of a sentence. At this point, a subtitle is generated from the previous ending point to here.
Factors Affecting Subtitle Quality
- Speaking Speed
If the speaking speed in the audio is very fast, with almost no pauses, or pauses shorter than 250 milliseconds, the segmented subtitles will be very long, possibly lasting for ten or even dozens of seconds, filling the screen when embedded in the video.
- Irregular Pauses:
Conversely, if there are unnecessary pauses while speaking, for example, if there are several pauses in the middle of a coherent sentence, the segmented subtitles will be very fragmented, with only a few words displayed in each subtitle.
- Background Noise
Background noise or music can also interfere with the determination of silent intervals, leading to inaccurate recognition.
- Pronunciation Clarity: This is obvious; even humans cannot understand unclear pronunciation.
How to Address These Issues?
- Reduce Background Noise:
If the background noise is significant, you can separate human speech from background sound before recognition to remove interfering sounds and improve recognition accuracy.
- Use a Larger Speech Recognition Model:
If computer performance allows, try to use a large model for recognition, such as large-v2
or large-v3-turbo
.
- Adjust Silent Segment Duration:
The software defaults to setting the silent segment duration to 200 milliseconds. Depending on the specific audio and video, you can adjust this value. If the video you want to recognize has a fast speaking speed, you can lower it to 100 milliseconds; if there are many pauses, you can increase it to 300 or 500 milliseconds. To set this, open Tools/Options in the menu, then select Advanced Options, and modify the minimum silent segment value in the faster/openai speech recognition adjustment section.
- Set Maximum Subtitle Duration:
You can set the maximum duration for subtitles, and subtitles exceeding this duration will be forcibly split. This setting is also in the Advanced Options.
As shown in the figure, subtitles exceeding 10 seconds will be re-segmented.
- Set the Maximum Number of Characters per Subtitle Line:
You can set the upper limit of characters per subtitle line, and subtitles exceeding this limit will automatically wrap or be split.
- Enable Re-segmentation Function: After enabling this option, combined with the settings in points 4 and 5 mentioned above, the program will automatically re-segment sentences.
After the above settings 3, 4, 5, and 6, the program will first generate subtitles based on silent intervals. When encountering excessively long subtitles or subtitles with too many characters, the program will split the subtitles by re-segmenting sentences. When re-segmenting sentences, the program uses the nltk
natural language processing library, combining silent interval duration, punctuation, the number of subtitle characters, and other factors to make a comprehensive judgment before splitting.