
Adding subtitles to videos has become quite convenient with the help of speech recognition technology (ASR). In particular, OpenAI's Whisper series of models excel at converting speech to text, making automatic subtitle generation possible.
However, extracting hard subtitles (subtitles embedded in the video frames) from videos still presents many challenges.
Videos are essentially composed of consecutive image frames. A common frame rate is 30fps (30 frames per second), meaning a 1-hour video contains 108,000 images. For high-definition videos, the number of frames is even higher. Such a large volume of data poses a significant challenge to OCR processing capabilities.
Google's Gemini-2.0-flash model not only supports text generation but also video and image recognition and processing. It offers a generous daily free quota and can be used as an OCR tool.
The domestic Zhipu AI glm-4v-flash model is not only free but also has powerful image understanding capabilities, making it suitable as an OCR tool. Although it currently only supports Chinese and English recognition, it is sufficient for most scenarios.
Developed a hard subtitle extraction software based on Gemini and Zhipu AI
Download GVS Chinese-English Video Hard Subtitle Extraction Software (640MB)
Baidu Netdisk Download: https://pan.baidu.com/s/1SDKm5tWsr6dkajhsf8T5Ew?pwd=95i4
Github Download: https://github.com/jianchang512/stt/releases/download/0.0/GVS-v0.2-AI.7z
Software Usage Guide
Here are the steps to use the software:
Download and Extract: Download the software package, extract it, and double-click
app.exeto run it.Select Video: Click the button at the top of the software interface to select the video file from which you want to extract subtitles. Please ensure the video contains hard subtitles.
Select Subtitle Position: Choose the position of the subtitles in the video. The default is "Bottom," but you can also select "Top," "Middle," or "Entire" area.
Enter API Key:
You can enter the API key for Zhipu AI, which is free domestically.
Alternatively, you can enter the API key for Gemini AI, which offers 1500 free calls per day. However, using it domestically requires a VPN. You can enter multiple keys separated by English commas.
You can register for free and obtain an API Key on the Zhipu AI platform: https://bigmodel.cn/usercenter/proj-mgmt/apikeys

For Gemini, you can get the API key here: https://aistudio.google.com/app/apikey
Select Model: Zhipu AI supports the free GLM-4V-FLASH model. Gemini AI supports the
gemini-2.0-flash-expandgemini-1.5-flashmodels.If using Gemini AI, you need to enter the proxy IP and port, or enable system proxy in your VPN software.
Start Extraction: Click the "Start" button. The progress and log information will be displayed in the text box below the software. After extraction is complete, an SRT subtitle file with the same name will be generated in the same directory as the video file.

Technical Principles
The core steps for extracting hard subtitles using this software are as follows:
- Video Frame Splitting: First, use the FFmpeg tool to split the video into image frames at 1-second intervals. Choosing 1-second intervals instead of extracting every frame significantly reduces the number of images that need to be recognized. This also accounts for the fact that subtitles typically last for at least 1 second, and too many frames would increase the difficulty of deduplication.
- OCR Recognition: Send the split image frames to the AI model for OCR recognition to extract text from the images.
- Subtitle Deduplication: Since consecutive image frames may contain the same subtitle content, to avoid repetition, we use the sentence-transformers model to calculate the similarity between the currently recognized subtitle and the previous one. If the similarity exceeds 60%, the two subtitles are considered identical and are deduplicated.
- Generate Subtitle File: Finally, the deduplicated subtitle texts are assembled according to their corresponding timestamps and saved as an SRT format subtitle file.
