Skip to content

image.png

Adding subtitles to videos has become quite convenient nowadays, thanks to Automatic Speech Recognition (ASR) technology. Especially OpenAI's Whisper series models, which excel in speech-to-text conversion, making automatic subtitle generation a reality.

However, extracting existing hardcoded subtitles (subtitles embedded in the video frame) from videos still presents significant challenges.

Videos are essentially composed of continuous image frames. A common video frame rate is 30fps (30 frames per second), which means a 1-hour video contains 108,000 images. For high-definition videos, the frame count is even higher. This massive amount of data poses a severe challenge to OCR processing capabilities.

Google's Gemini-2.0-flash model supports not only text generation but also video and image recognition and processing. Moreover, it provides a large daily free quota, which can be used as an OCR tool.

Zhipu AI's glm-4v-flash model in China is not only free but also boasts powerful image understanding capabilities, making it a suitable OCR tool. Although it currently supports only Chinese and English recognition, it is sufficient for most scenarios.

We have developed a hardcoded subtitle extraction software based on Gemini and Zhipu AI.

Download GVS - Chinese & English Hardcoded Video Subtitle Extractor (640MB)

Baidu Netdisk Download: https://pan.baidu.com/s/1SDKm5tWsr6dkajhsf8T5Ew?pwd=95i4

Github Download: https://github.com/jianchang512/stt/releases/download/0.0/GVS-v0.2-AI.7z

Software User Guide

Here are the steps to use the software:

  1. Download and Extract: Download the software package, extract it, and double-click app.exe to run.

  2. Select Video: Click the button at the top of the software interface and select the video file from which you want to extract subtitles. Please ensure that the video contains hardcoded subtitles.

  3. Select Subtitle Position: Select the position of the subtitles in the video. The default is "Bottom," but you can also choose "Top," "Middle," or "All" regions.

  4. Enter API Key:

    You can fill in the API key of Zhipu AI, which is free in China.

    You can also fill in the API key of Gemini AI, which has a free daily quota of 1500 calls. However, using it in China requires a proxy. You can fill in multiple keys separated by English commas.

    You can register for free on the Zhipu AI platform and obtain an API Key: https://bigmodel.cn/usercenter/proj-mgmt/apikeys

    image.png

    You can obtain a Gemini API key at this page: https://aistudio.google.com/app/apikey

  5. Select Model: Zhipu AI supports the free GLM-4V-FLASH model. Gemini AI supports the gemini-2.0-flash-exp and gemini-1.5-flash models.

  6. If you are using Gemini AI, you need to fill in the proxy IP and port, or enable system proxy in your VPN software.

  7. Start Extraction: Click the "Start" button. The text box at the bottom of the software will display progress and log information. Once the extraction is complete, an SRT subtitle file with the same name will be generated in the directory where the video file is located.

image.png

Technical Principles

The core steps of this software for extracting hardcoded subtitles are as follows:

  1. Video Frame Extraction: First, the FFmpeg tool is used to split the video into image frames at 1-second intervals. Choosing a 1-second interval instead of extracting frame by frame significantly reduces the number of images to be recognized. Furthermore, considering that subtitles typically last for at least 1 second, too many frames would also increase the difficulty of deduplication.
  2. OCR Recognition: The extracted image frames are sent to the AI model for OCR recognition to extract the text in the images.
  3. Subtitle Deduplication: Since consecutive image frames may contain the same subtitle content, we use the sentence-transformers model to calculate the similarity between the currently recognized subtitle and the previous subtitle to avoid repetition. If the similarity exceeds 60%, the two subtitles are considered to be the same, and deduplication is performed.
  4. Generate Subtitle File: Finally, the deduplicated subtitle text is concatenated according to the corresponding timestamps and saved as an SRT subtitle file.