Skip to content

Difference Between Whole Recognition and Equal Segmentation

Whole Recognition:

This method provides the best speech recognition results but consumes the most computer resources. For larger videos, especially when using the large-v3 model, it may cause crashes.

During recognition, the entire audio file is passed to the model, which internally uses VAD (Voice Activity Detection) for segmentation, recognition, and sentence breaking. By default, silence segmentation is set to 200ms, and the maximum sentence length is 3s. These settings can be configured in Menu > Tools/Options > Advanced Options > VAD.

Equal Segmentation:

As the name suggests, this method splits the audio file into segments of equal fixed length before passing them to the model. When using the OpenAI model, equal segmentation is enforced, regardless of whether "Whole Recognition" or "Pre-segmentation" is selected.

Each segment in equal segmentation is 10s long, with a silence segmentation interval of 500ms. These settings can be adjusted in Menu > Tools/Options > Advanced Options > VAD.

Note: Although set to 10s, each subtitle segment is generally around 10s in duration, but the actual audio length may not be exactly 10s due to speech duration and the removal of trailing silence.