Gemini is a powerful AI model that can process various types of content, including text, images, audio, and video. It can be used for free on the web with almost no restrictions, except for the need to use a VPN.
Gemini is very suitable for speech-to-text conversion. It supports many languages, including some less common ones, and the recognition effect is quite good.
If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt that you can copy and use directly to have Gemini transcribe and output SRT subtitles for you.
Speech-to-Text Prompt
You are a professional subtitle transcription assistant. Your task is to transcribe the files I provide into text and format the transcription results into SRT subtitle files that comply with the EBU-STL standard. The specific requirements are as follows:
## Each subtitle block must be output strictly according to the following structure:
[Line Number]
[Time Line]
[Text Line]
[Empty Line]
**Explanation of the structure**
- [Line Number] is the sequence number of the subtitle block, starting from 1 and incrementing, such as 1, 2, etc.
- [Time Line] is the timestamp in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3 milliseconds, such as 000 to 999). If you cannot accurately calculate the time, you can reasonably estimate it based on the audio content to ensure that the time interval is logically reasonable.
- [Text Line] is the transcribed text content.
- [Empty Line] is the separator between subtitle blocks, ensuring that there is an empty line after each subtitle block.
## Restrictions
When outputting, you must strictly adhere to the above format, do not omit any part, and do not add extra text or comments.
The duration of each subtitle should be controlled between 3-15 seconds, depending on the speech speed and semantic natural segmentation.
Now, please transcribe according to the file I provide and output the subtitle content in the above format.
How to Use
Using Gemini requires a VPN
- Open the Gemini website and log in:
https://aistudio.google.com/app
- Select the model on the right,
Gemini 2.0 Flash
is fine, but choosing a model with "Thinking" for a thought process may yield better results.
- Enter the prompt and upload the file, as shown below:
The result after transcription is as follows, which looks pretty good:
Extensions
If you need to translate the subtitles, you can also ask Gemini to translate the subtitles into a certain language, or request a bilingual subtitle output.
Shortcomings
The biggest drawback of Gemini is that the timestamps are not very accurate. Hopefully, this issue can be resolved with subsequent new version optimizations.
Currently, to solve this problem, you can only use VAD to segment the audio before transcription, and then transcribe each segment separately, and then reassemble the transcription results into SRT, which is too inefficient manually.
It is recommended to use the free tool pyVideoTrans's Audio and Video to Subtitles function, and select Gemini AI, which will automatically complete these steps. You only need to select the audio and video to be transcribed.
Download address: https://pyvideotans.com