Skip to content

Gemini is a powerful AI model capable of processing various types of content, including text, images, audio, and video. It can be used for free on the web with almost no restrictions, except for the requirement of using a VPN to access it.

Gemini is well-suited for speech-to-text tasks, supporting a wide range of languages, including some less common ones, and delivering quite good recognition results.

If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt that you can copy and use to have Gemini transcribe and output SRT subtitles for you.

Speech Transcription Prompt

You are a professional subtitle transcription assistant. Your task is to transcribe the file I provide into text and format the transcription into an SRT subtitle file that complies with the EBU-STL standard. Specific requirements are as follows:

## Each subtitle block must strictly follow the structure below:

[Line Number]
[Time Line]
[Text Line]
[Blank Line]

**Explanation of the Structure**
- [Line Number] is the sequence number of the subtitle block, starting from 1 and incrementing, e.g., 1, 2, etc.
- [Time Line] is the timestamp, formatted as HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3-digit milliseconds, e.g., 000 to 999). If you cannot calculate the time precisely, estimate it reasonably based on the audio content to ensure logical time intervals.
- [Text Line] is the transcribed text content.
- [Blank Line] is the separator between subtitle blocks, ensuring there is a blank line after each subtitle block.

## Restrictions
When outputting, strictly adhere to the above format. Do not omit any part or add extra text or comments.
The duration of each subtitle block should ideally be between 3 and 15 seconds, segmented naturally based on speech speed and meaning.

Now, please transcribe the file I provide and output the subtitle content in the format described above.

How to Use

Using Gemini requires a VPN

  1. Open the Gemini website and log in: https://aistudio.google.com/app
  2. On the right, select the model. Gemini 2.0 Flash is sufficient, but choosing the Thinking model with reasoning capabilities may yield better results.

  1. Enter the prompt and upload the file, as shown below.

After transcription, the result will look something like this, which appears quite good.

Extensions

If you need to translate the subtitles, you can also include a request in the prompt to translate the subtitles into a specific language or output bilingual subtitles side by side.

Limitations

The biggest drawback of Gemini is the inaccuracy of timestamps. This issue may be resolved with future updates and optimizations.

Currently, to address this problem, you can use VAD (Voice Activity Detection) to segment the audio into sentences before transcription, transcribe each segment individually, and then reassemble the results into an SRT file. However, doing this manually is inefficient.

It is recommended to use the Audio/Video to Subtitle feature in the free tool pyVideoTrans and select Gemini AI. This will automate the entire process, and you only need to choose the audio or video file to transcribe.

Download link: https://pyvideotans.com