Skip to content

Difference between Whole Recognition and Equal Segmentation

Whole Recognition:

This provides the best speech recognition results but consumes the most computer resources. If the video is large and you use the large-v3 model, it may cause a crash.

During recognition, the entire audio file is passed to the model. The model internally uses VAD to segment, recognize, and break sentences. The default silence segmentation is 200ms, and the maximum sentence length is 3s. You can configure these in Menu--Tools/Options--Advanced Options--VAD.

Equal Segmentation:

As the name suggests, this cuts the audio file into segments of the same length and then passes them to the model. Also, the OpenAI model will force equal segmentation. When using the OpenAI model, whether you choose "Whole Recognition" or "Pre-segmentation," "Equal Segmentation" will be forced.

Equal segmentation uses 10s segments, and the silence segmentation sentence interval is 500ms. You can configure these in Menu--Tools/Options--Advanced Options--VAD.

Note: Although 10s is set, each subtitle is generally 10s in length, but not every voice-over length is necessarily 10s. The duration of the pronunciation and the silence at the end of the voice-over will be removed.