Gemini is a powerful AI model that can handle various types of content, including text, images, audio, and video. It can be used for free on the web with virtually no restrictions, except that you must use a VPN.
Gemini is well-suited for speech-to-text tasks. It supports many languages, including some less common ones, and the recognition accuracy is quite good.
If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt that you can copy and use directly to have Gemini transcribe and output SRT subtitles for you.
Speech-to-Text Prompt
You are a professional subtitle transcription assistant. Your task is to transcribe the file I provide into text and format the transcription result into an SRT subtitle file that conforms to the EBU-STL standard. The specific requirements are as follows:
## Each subtitle block must be strictly output according to the following structure:
[Line number]
[Timecode line]
[Text line]
[Blank line]
**Explanation of the structure**
- [Line number] is the sequence number of the subtitle block, starting from 1 and incrementing, such as 1, 2, etc.
- [Timecode line] is the timestamp, in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3-digit milliseconds, such as 000 to 999). If you cannot accurately calculate the time, you can reasonably estimate it based on the audio content, ensuring the time interval is logically reasonable.
- [Text line] is the transcribed text content.
- [Blank line] is the separator between subtitle blocks, ensuring there is a blank line after each subtitle block.
## Restrictions
When outputting, you must strictly adhere to the above format, do not omit any part, and do not add extra text or comments.
The duration of each subtitle block should be controlled between 3-15 seconds as much as possible, and specifically divided naturally according to speech speed and semantics.
Now, please transcribe according to the file I provide and output the subtitle content in the above format.
How to Use
Using Gemini requires a VPN
- Open the Gemini website and log in:
https://aistudio.google.com/app
- Select the model on the right,
Gemini 2.0 Flash
is sufficient, but choosing a "Thinking" model will give better results.
- Enter the prompt and upload the file, as shown below:
The result after transcription is as follows, which looks pretty good:
Extensions
If you need to translate the subtitles, you can also ask it to translate the subtitles into xx language in the prompt, or ask it to output bilingual subtitles side-by-side.
Shortcomings
Gemini's biggest shortcoming is that the timestamps are not very accurate. Hopefully, this issue can be resolved with subsequent new version optimizations.
Currently, to solve this problem, you can only use VAD to segment the audio into sentences before transcribing, and then transcribe each segment individually, and then assemble the transcription results back into SRT, which is too inefficient manually.
It is recommended to use the Audio and Video to Subtitles feature in the free tool pyVideoTrans, and select Gemini AI. This will be done automatically. You only need to select the audio and video you want to transcribe.
Download address: https://pyvideotans.com