Skip to content

Using LLMs to Re-segment Speech Recognition Results

To enhance the naturalness and accuracy of subtitle segmentation, pyVideoTrans, starting from v3.69, introduces an intelligent sentence segmentation feature based on LLMs (Large Language Models), aimed at optimizing your subtitle processing experience.

Background: Limitations of Traditional Segmentation

In v3.68 and earlier versions, we provided a "Re-segment" feature. This feature, after initial speech recognition by faster-whisper, openai-whisper, or deepgram, would call an Alibaba model to re-split and segment the generated subtitles.

Traditional method: Splitting and segmenting recognized subtitle results

The original "Re-segment" feature also had some drawbacks:

  1. Inconvenient first-time use: Required downloading three large model files online from ModelScope.
  2. Suboptimal efficiency and results: Processing speed was slow, and the segmentation effect was sometimes still unsatisfactory.

Although models like faster-whisper can output segmented results themselves, in practical applications, issues such as sentences being too long, too short, or awkwardly segmented often occur.

Innovation: v3.69+ Introduces LLM-Powered Intelligent Segmentation

To address the above issues, starting from v3.69, we have upgraded the "Re-segment" feature to LLM Re-segmentation.

How it works: When you use faster-whisper (local), openai-whisper (local), or Deepgram.com for speech recognition, enable the new LLM Re-segmentation feature, and have correctly configured the model, API Key (SK), etc., in Translation Settings - OpenAI API & Compatible AI:

  1. pyVideoTrans will send the entire recognized text content, including word-level timestamps, to your configured LLM in one go.
  2. The LLM will intelligently segment the text based on the prompt (Prompt) instructions in the /videotrans/recharge-llm.txt file.
  3. After segmentation, the results will be reorganized into standard SRT subtitle format for subsequent translation or direct use.

Prerequisites for Enabling "LLM Re-segmentation"

To successfully enable and use this feature, please ensure the following conditions are met:

  1. Check to enable: In the software interface, select the LLM Re-segmentation option.

  2. Specify speech recognition model: The speech recognition engine must be one of the following three:

    • faster-whisper (local)
    • openai-whisper (local)
    • Deepgram.com
  3. Select voice splitting mode: Must be set to Process entire audio.

  4. Configure LLM API: In Menu -> Translation Settings -> OpenAI API & Compatible AI, correctly fill in your API Key (SK), select the model name, and set other related parameters.

Important Note: Token Length Limit

To reduce complexity, the current version sends the entire subtitle information recognized from the audio/video to the LLM for segmentation in a single pass, without using batch processing.

This means that if your audio/video file is too long, causing the segmented text to exceed the Max Output Tokens limit of the selected LLM model (many models default to 4096 tokens), the output will be truncated, leading to an error.

If LLM re-segmentation fails, the software will automatically fall back to using the segmentation results provided by faster-whisper/openai-whisper/deepgram itself.

How to Avoid Segmentation Failure Due to Exceeding Output Length?

You can adjust the maximum output token limit of the LLM model you are using according to its capabilities.

Note that it must be the Max output token, not the context token. Context length is usually very large, such as 128k, 256k, 1M, etc., while the max output token is much smaller than the context token, typically 8k (8192) / 32k (32768), etc.

In the Translation Settings -> OpenAI API & Compatible AI interface, find the Max output tokens setting, which defaults to 4096.

Default max output token is 4096, usually supported by all models

  • For example, if you are using the deepseek-chat (i.e., deepseek-v3) model, its supported maximum context length is 8k, i.e., 8 x 1024 = 8192. You can set this value to 8192. (Note: the image shows 8k as max output tokens for deepseek, which is correct for the setting).
  • OpenAI's gpt-4.1 series models support up to 32768 max output tokens. You can enter 32768.

Steps:

  1. Check the maximum output token count supported by your chosen LLM model (note, it must be max output tokens, not context tokens).
  2. Enter the queried value (note: remove commas from the number) into pyVideoTrans's "Max output tokens" setting box.

Enter the maximum output token number supported by the selected model here

How to Check Max Output Tokens for Different Models?

1. OpenAI Series Models

You can view details for each model in the OpenAI official model documentation: https://platform.openai.com/docs/models

  • Click on the name of the model you plan to use to go to its details page.
  • Look for descriptions like "Max output tokens".

Click on a model you use to enter its details page

Find the maximum output tokens supported by this model, not the context window

2. Other OpenAI-Compatible Models

For other large model providers compatible with the OpenAI API, their maximum output token (not context length) is usually listed in their official API documentation or model descriptions.

DeepSeek's max output token is 8k, i.e., 8192

Others are similar; just make sure it's the maximum output length or maximum output token, not the context length.

Important Reminder: Please make sure to find the maximum output tokens for the model you are using, not the context tokens, and enter it correctly into pyVideoTrans settings.

Remember to fill it in the correct place