Using LLMs to Re-segment Speech Recognition Results
To enhance the naturalness and accuracy of subtitle segmentation, pyVideoTrans, starting from v3.69, introduces an intelligent sentence segmentation feature based on LLMs (Large Language Models), aimed at optimizing your subtitle processing experience.
Background: Limitations of Traditional Segmentation
In v3.68 and earlier versions, we provided a "Re-segment" feature. This feature, after initial speech recognition by faster-whisper
, openai-whisper
, or deepgram
, would call an Alibaba model to re-split and segment the generated subtitles.
The original "Re-segment" feature also had some drawbacks:
- Inconvenient first-time use: Required downloading three large model files online from ModelScope.
- Suboptimal efficiency and results: Processing speed was slow, and the segmentation effect was sometimes still unsatisfactory.
Although models like faster-whisper
can output segmented results themselves, in practical applications, issues such as sentences being too long, too short, or awkwardly segmented often occur.
Innovation: v3.69+ Introduces LLM-Powered Intelligent Segmentation
To address the above issues, starting from v3.69, we have upgraded the "Re-segment" feature to LLM Re-segmentation.
How it works: When you use faster-whisper (local)
, openai-whisper (local)
, or Deepgram.com
for speech recognition, enable the new LLM Re-segmentation feature, and have correctly configured the model, API Key (SK), etc., in Translation Settings - OpenAI API & Compatible AI:
- pyVideoTrans will send the entire recognized text content, including word-level timestamps, to your configured LLM in one go.
- The LLM will intelligently segment the text based on the prompt (Prompt) instructions in the
/videotrans/recharge-llm.txt
file. - After segmentation, the results will be reorganized into standard SRT subtitle format for subsequent translation or direct use.
Prerequisites for Enabling "LLM Re-segmentation"
To successfully enable and use this feature, please ensure the following conditions are met:
Check to enable: In the software interface, select the LLM Re-segmentation option.
Specify speech recognition model: The speech recognition engine must be one of the following three:
faster-whisper (local)
openai-whisper (local)
Deepgram.com
Select voice splitting mode: Must be set to
Process entire audio
.Configure LLM API: In Menu -> Translation Settings -> OpenAI API & Compatible AI, correctly fill in your API Key (SK), select the model name, and set other related parameters.
Important Note: Token Length Limit
To reduce complexity, the current version sends the entire subtitle information recognized from the audio/video to the LLM for segmentation in a single pass, without using batch processing.
This means that if your audio/video file is too long, causing the segmented text to exceed the Max Output Tokens limit of the selected LLM model (many models default to 4096 tokens), the output will be truncated, leading to an error.
If LLM re-segmentation fails, the software will automatically fall back to using the segmentation results provided by faster-whisper/openai-whisper/deepgram
itself.
How to Avoid Segmentation Failure Due to Exceeding Output Length?
You can adjust the maximum output token limit of the LLM model you are using according to its capabilities.
Note that it must be the Max output token, not the
context token
. Context length is usually very large, such as 128k, 256k, 1M, etc., while the max output token is much smaller than the context token, typically 8k (8192) / 32k (32768), etc.
In the Translation Settings -> OpenAI API & Compatible AI interface, find the Max output tokens setting, which defaults to 4096
.
- For example, if you are using the
deepseek-chat
(i.e., deepseek-v3) model, its supported maximum context length is 8k, i.e.,8 x 1024 = 8192
. You can set this value to8192
. (Note: the image shows 8k as max output tokens for deepseek, which is correct for the setting). - OpenAI's
gpt-4.1
series models support up to 32768 max output tokens. You can enter32768
.
Steps:
- Check the maximum output token count supported by your chosen LLM model (note, it must be max output tokens, not context tokens).
- Enter the queried value (note: remove commas from the number) into pyVideoTrans's "Max output tokens" setting box.
How to Check Max Output Tokens for Different Models?
1. OpenAI Series Models
You can view details for each model in the OpenAI official model documentation: https://platform.openai.com/docs/models
- Click on the name of the model you plan to use to go to its details page.
- Look for descriptions like "Max output tokens".
2. Other OpenAI-Compatible Models
For other large model providers compatible with the OpenAI API, their maximum output token (not context length) is usually listed in their official API documentation or model descriptions.
- DeepSeek (e.g.,
deepseek-chat
ordeepseek-reasoner
): Consult their pricing or model description page, e.g.: https://platform.deepseek.com/api-docs/pricing
SiliconFlow (硅基流动): Look it up in their model documentation: https://docs.siliconflow.cn/en/faqs/misc#2-about-max-tokens (Note: I've changed the link to the English version if available, or you can keep the Chinese one if more relevant for users finding the info.) (Original link for reference: https://docs.siliconflow.cn/cn/faqs/misc#2-%E5%85%B3%E4%BA%8Emax-tokens%E8%AF%B4%E6%98%8E)
Alibaba Bailian (阿里百炼): Find parameter limits for specific models in the Model Square or model documentation: https://help.aliyun.com/en/model-studio/models (Note: Changed link to English version.) (Original link for reference: https://help.aliyun.com/zh/model-studio/models)
Others are similar; just make sure it's the maximum output length or maximum output token, not the context length.
Important Reminder: Please make sure to find the maximum output tokens for the model you are using, not the context tokens, and enter it correctly into pyVideoTrans settings.