VAD Parameter Tuning in Speech Recognition
Subtitles generated during the speech recognition phase of video translation sometimes are very long, lasting tens of seconds or even minutes, while others are very short, less than 1 second. These can be optimized by adjusting VAD parameters.
What is VAD
Silero VAD is an efficient voice activity detection (VAD) tool that identifies whether audio contains speech and separates speech from silence or noise. Silero VAD can be used with other speech recognition libraries (such as Whisper) to detect and segment speech segments before or after speech recognition, thereby optimizing recognition performance.
In faster-whisper, VAD is used by default for speech analysis and segmentation. The following four parameters are mainly involved in controlling and adjusting the segmentation recognition effect. These parameters are used to control the judgment and segmentation of speech and silence. Below are detailed explanations and setting suggestions for each parameter:
threshold
Meaning: Represents the probability threshold for speech. Silero VAD outputs the speech probability of each audio segment. Probabilities higher than this value are considered speech (SPEECH), and probabilities lower than this value are considered silence or background noise.
Setting Suggestions: The default value is 0.5, which is applicable in most cases. However, for different datasets, you can adjust this value to more accurately distinguish between speech and noise. If you find too many false positives, try increasing it to 0.6 or 0.7; if too many speech segments are missing, you can decrease it to 0.3 or 0.4.
min_speech_duration_ms (Minimum Speech Duration, in milliseconds)
Meaning: If the length of a detected speech segment is less than this value, the speech segment will be discarded. The purpose is to remove some short non-speech sounds or noise.
Setting Suggestions: The default value is 250 milliseconds, which is suitable for most scenarios. You can adjust as needed. If speech segments are too short and easily misjudged as noise, you can increase this value, for example, set it to 500 milliseconds.
max_speech_duration_s (Maximum Speech Duration, in seconds)
Meaning: The maximum length of a single speech segment. If the speech segment exceeds this duration, an attempt will be made to split it at silences longer than 100 milliseconds. If no silent position is found, it will be forcibly split before this duration to avoid overly long continuous segments.
Setting Suggestions: The default is infinity (no limit). If you need to process longer speech segments, you can keep the default value; but if you want to control the segment length, such as processing dialogues or segmented output, you can set it according to specific needs, such as 10 seconds or 30 seconds.
min_silence_duration_ms (Minimum Silence Duration, in milliseconds)
Meaning: The silence time to wait after speech is detected. Only if the silence duration exceeds this value will the speech segment be split.
Setting Suggestions: The default value is 2000 milliseconds (2 seconds). If you want to detect and split speech segments more quickly, you can reduce this value, for example, set it to 500 milliseconds; if you want to split more loosely, you can increase it.
speech_pad_ms (Speech Padding Time, in milliseconds)
Meaning: The padding time added before and after the detected speech segment to avoid cutting the speech segment too tightly, which may cut off some marginal speech.
Setting Suggestions: The default value is 400 milliseconds. If you find that the cut speech segment has missing parts, you can increase this value, such as 500 milliseconds or 800 milliseconds. Conversely, if the speech segment is too long or contains too many invalid parts, you can reduce this value.
The specific settings of these parameters need to be optimized based on the speech dataset and application scenarios you are using. Reasonable configuration can significantly improve the performance of VAD.
The above parameters can be modified and adjusted in Menu--Tools/Options--Advanced Options--faster/openai Alternatively, after speech recognition in the main interface, select
faster-whisper local
, then click the "Speech Recognition" text on the left, and the modification text boxes for these parameters will be displayed below.
Summary:
threshold: Can be adjusted according to the dataset; the default value of 0.5 is more general.
min_speech_duration_ms and min_silence_duration_ms: Determine the length of the speech segment and the sensitivity of silence segmentation, fine-tune according to the application scenario.
max_speech_duration_s: Prevents unreasonable growth of long speech segments, usually set according to specific applications.
speech_pad_ms: Adds a buffer to the speech segment to prevent the segment from being over-cut. The specific value selection depends on your audio data and the requirements for speech segmentation.
The cleaner and clearer the sound without noise, the better the recognition effect. Even carefully modulated parameters are not as effective as a clean background sound.