Skip to content

image.png

Adding subtitles to videos has become quite convenient with the help of Automatic Speech Recognition (ASR) technology. OpenAI's Whisper series models, in particular, excel in speech-to-text conversion, making automatic subtitle generation possible.

However, extracting existing hardcoded subtitles (subtitles embedded in the video frames) still presents several challenges.

Videos are essentially composed of continuous image frames. A common video frame rate is 30fps (30 frames per second), which means that a 1-hour video contains 108,000 images. For high-definition videos, the number of frames is even higher. This massive amount of data poses a severe test for OCR processing capabilities.

Google's Gemini-2.0-flash model not only supports text generation but also video and image recognition and processing, and provides a large amount of free daily quota, which can be used as an OCR tool.

China's Zhipu AI glm-4v-flash model is not only free but also has powerful image understanding capabilities and can be used as an OCR tool. Although it currently only supports Chinese and English recognition, it is sufficient for most scenarios.

We have developed hardcoded subtitle extraction software based on Gemini and Zhipu AI.

Download GVS Chinese and English Video Hardcoded Subtitle Extraction Software (640MB)

Baidu Netdisk Download: https://pan.baidu.com/s/1SDKm5tWsr6dkajhsf8T5Ew?pwd=95i4

Github Download: https://github.com/jianchang512/stt/releases/download/0.0/GVS-v0.2-AI.7z

Software Usage Guide

Here are the steps for using the software:

  1. Download and Unzip: Download the software package, unzip it, and double-click app.exe to run it.

  2. Select Video: Click the button at the top of the software interface, select the video file for which you want to extract subtitles. Ensure that there are hardcoded subtitles in the video.

  3. Select Subtitle Position: Select the position of the subtitles in the video. The default is "Bottom," but you can also select "Top," "Middle," or "All" areas.

  4. Fill in API Key:

    You can fill in the api key of Zhipu AI, which is free in China.

    You can also fill in the api key of Gemini AI. There are 1500 free calls per day, but you need a proxy to use it in China. You can fill in multiple keys separated by English commas.

    The Zhipu AI platform can be registered for free and obtain an API Key: https://bigmodel.cn/usercenter/proj-mgmt/apikeys

    image.png

    You can go to this page to get Gemini's API Key: https://aistudio.google.com/app/apikey

  5. Select Model: Zhipu AI supports the free GLM-4V-FLASH model. Gemini AI supports the gemini-2.0-flash-exp and gemini-1.5-flash models.

  6. If using GeminiAI, you need to fill in the proxy IP and port, or enable system proxy in the VPN software.

  7. Start Extraction: Click the "Start" button, and the progress and log information will be displayed in the text box below the software. After the extraction is complete, an SRT subtitle file with the same name will be generated in the directory where the video file is located.

image.png

Technical Principles

The core steps of this software for extracting hardcoded subtitles are as follows:

  1. Video Frame Cutting: First, use the FFmpeg tool to cut the video into image frames at 1-second intervals. Choosing 1-second intervals instead of frame-by-frame extraction can significantly reduce the number of images that need to be recognized. Also, considering that subtitles usually last for no less than 1 second, too many frames would increase the difficulty of deduplication.
  2. OCR Recognition: Send the segmented image frames to the AI model for OCR recognition to extract the text in the images.
  3. Subtitle Deduplication: Since consecutive image frames may contain the same subtitle content, to avoid duplication, we use the sentence-transformers model to calculate the similarity between the currently recognized subtitle and the previous subtitle. If the similarity exceeds 60%, the two subtitles are considered the same and are deduplicated.
  4. Generate Subtitle File: Finally, the deduplicated subtitle text is spliced according to the corresponding timestamps and saved as an SRT format subtitle file.