Adding subtitles to videos has become quite convenient with the help of Automatic Speech Recognition (ASR) technology. In particular, OpenAI's Whisper series models excel in speech-to-text conversion, making automatic subtitle generation possible.
However, extracting existing hard subtitles (subtitles embedded in the video frame) from videos still poses several challenges.
A video is essentially composed of a series of image frames. A common video frame rate is 30fps (30 frames per second), which means that a 1-hour video contains 108,000 images. For high-definition videos, the number of frames will be even higher. This massive amount of data presents a severe test for OCR processing capabilities.
Google's Gemini-2.0-flash model not only supports text generation but also supports video and image recognition and processing, and provides a large number of free quotas every day, which can be used as an OCR tool.
The domestic Zhipu AI glm-4v-flash model is not only free but also has powerful image understanding capabilities and can be used as an OCR tool. Although it currently only supports Chinese and English recognition, it is sufficient for most scenarios.
Developed a hard subtitle extraction software based on Gemini and Zhipu AI
Download Address GVS Chinese and English Video Hard Subtitle Extraction Software (640MB)
Baidu Netdisk Download: https://pan.baidu.com/s/1SDKm5tWsr6dkajhsf8T5Ew?pwd=95i4
Github Download: https://github.com/jianchang512/stt/releases/download/0.0/GVS-v0.2-AI.7z
Software Usage Guide
The following are the steps to use the software:
Download and Unzip: Download the software package, unzip it, and double-click
app.exe
to run it.Select Video: Click the button at the top of the software interface, select the video file from which you want to extract subtitles, and make sure there are hard subtitles in the video.
Select Subtitle Position: Select the position of the subtitles in the video. The default is "Bottom". You can also select "Top", "Middle", or "All" areas.
Fill in API Key:
You can fill in the api key of Zhipu AI, which is free in China
You can also fill in the api key of Gemini AI, which has a free call quota of 1500 times per day, but you need to use scientific Internet access in China. You can fill in multiple keys separated by English commas.
The Zhipu AI platform can be registered for free and obtain an API Key: https://bigmodel.cn/usercenter/proj-mgmt/apikeys
Gemini can go to this page to get it https://aistudio.google.com/app/apikey
Select Model: Zhipu AI supports the GLM-4V-FLASH free model. GeminiAI supports the
gemini-2.0-flash-exp
andgemini-1.5-flash
modelsIf you use GeminiAI, you need to fill in the proxy IP and port, or enable the system proxy in the vpn software
Start Extraction: Click the "Start" button, and the text box at the bottom of the software will display progress and log information. After the extraction is completed, an SRT subtitle file with the same name will be generated in the directory where the video file is located.
Technical Principles
The core steps of the software to extract hard subtitles are as follows:
- Video Frame Cutting: First, use the FFmpeg tool to divide the video into image frames at 1-second intervals. Choosing a 1-second interval instead of extracting frame by frame can greatly reduce the number of images that need to be recognized. On the other hand, considering that the duration of subtitles is usually no less than 1 second, too many frames will also increase the difficulty of deduplication.
- OCR Recognition: Send the segmented image frames to the AI model for OCR recognition to extract the text in the images.
- Subtitle Deduplication: Since continuous image frames may contain the same subtitle content, in order to avoid repetition, we use the sentence-transformers model to calculate the similarity between the currently recognized subtitle and the previous subtitle. If the similarity exceeds 60%, it is considered that the two subtitle contents are the same and deduplication is performed.
- Generate Subtitle File: Finally, stitch the deduplicated subtitle text according to the corresponding timestamps and save it as an SRT format subtitle file.