Gemini is a powerful AI model that can handle various types of content, including text, images, audio, and video. It can be used for free on the web with almost no restrictions, except that you must have access to a VPN.
Gemini is well-suited for speech-to-text tasks. It supports many languages, including some lesser-known ones, and the recognition accuracy is quite good.
If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt that you can copy and use directly to have Gemini transcribe and output SRT subtitles.
Speech Transcription Prompt
You are a professional subtitle transcription assistant. Your task is to transcribe the file I provide into text and format the transcription result into an SRT subtitle file that complies with the EBU-STL standard. The specific requirements are as follows:
## Each subtitle block must strictly follow the following structure for output:
[Line number]
[Time line]
[Text line]
[Blank line]
**Explanation of the structure**
- [Line number] is the sequence number of the subtitle block, starting from 1 and incrementing, such as 1, 2, etc.
- [Time line] is the timestamp, in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3-digit milliseconds, such as 000 to 999). If you cannot precisely calculate the time, you can reasonably estimate based on the audio content, ensuring that the time interval is logically sound.
- [Text line] is the transcribed text content.
- [Blank line] is the separator between subtitle blocks, ensuring that there is a blank line after each subtitle block.
## Restrictions
When outputting, you must strictly adhere to the above format, do not omit any part, and do not add extra text or comments.
The duration of each subtitle should be controlled between 3-15 seconds as much as possible, depending on the speech rate and semantic natural segmentation.
Now, please transcribe based on the file I provide and output the subtitle content in the above format.
How to use
You need to have access to a VPN to use Gemini
- Open the Gemini website and log in:
https://aistudio.google.com/app
- Select the model on the right,
Gemini 2.0 Flash
is sufficient. Of course, choosing a "Thinking" model with a thinking process will yield better results.
- Enter the prompt and upload the file, as shown below.
The result after transcription is as follows, which looks pretty good.
Extension
If you need to translate the subtitles, you can also ask it to translate the subtitles into xx language in the prompt, or ask it to output bilingual subtitles for comparison.
Shortcomings
Gemini's biggest shortcoming is that the timestamps are not very accurate. Perhaps this problem can be solved with subsequent optimizations in new versions.
Currently, to solve this problem, you can only use VAD to segment the audio into sentences before transcription, then transcribe each segment individually, and then assemble the transcription results back into SRT. Manual efficiency is too low.
It is recommended to use the Audio/Video to Subtitles function in the free tool pyVideoTrans, and select Gemini AI. This will be done automatically, you only need to select the audio or video to be transcribed.
Download address: https://pyvideotans.com