Skip to content

Difference Between Whole Recognition and Equal Division

Whole Recognition:

This method offers the best speech recognition accuracy but consumes the most computer resources. If you're dealing with a large video file and using the large-v3 model, it might lead to crashes due to high resource demands.

During recognition, the entire audio file is fed to the model. The model then internally uses Voice Activity Detection (VAD) to segment and punctuate the audio. The default silence separation is 200ms, with a maximum statement length of 3 seconds. These settings can be configured in Menu -> Tools/Options -> Advanced Options -> VAD section.

Equal Division:

As the name suggests, this method cuts the audio file into segments of equal length before passing them to the model. It's a less resource-intensive approach. Note that using OpenAI models will force equal division, meaning "Equal Division" will be used regardless of whether you selected "Whole Recognition" or "Pre-Segmentation".

With Equal Division, each segment is 10 seconds long, and the silent gap between statements is 500ms. These values are configurable in Menu -> Tools/Options -> Advanced Options -> VAD section.

Important: Although each segment is nominally 10 seconds, the actual duration of each subtitle may vary. The displayed speech is limited to the duration of the actual pronunciation in the audio file and removes any trailing silence. Subtitles can be shorter than 10 seconds.