Skip to content

The Difference Between Unified Recognition and Equal Segmentation

Unified Recognition:

This method provides the best speech recognition results but consumes the most computer resources. For larger videos or when using the large-v3 model, it may cause crashes.

During recognition, the entire audio file is passed to the model, which internally uses VAD for segmentation, recognition, and sentence breaking. By default, silence segmentation is set to 200ms, and the maximum sentence length is 3s. These settings can be configured in the menu under Tools/Options → Advanced Options → VAD Settings.

Equal Segmentation:

As the name suggests, this method splits the audio file into segments of equal fixed length before passing them to the model. When using the OpenAI model, equal segmentation is enforced regardless of whether you select "Unified Recognition" or "Pre-segmentation."

Each segment in equal segmentation is 10s long, with a silence segmentation interval of 500ms. These settings can be adjusted in the menu under Tools/Options → Advanced Options → VAD Settings.

Note: Although set to 10s, not every subtitle will be exactly 10s in duration. The actual length depends on the speech duration, and silence at the end of the audio is removed.