Skip to content

Gemini is a powerful AI model capable of handling various content types, including text, images, audio, and video. It's available for free on the web with virtually no limitations, except for the requirement of a VPN.

Gemini excels at speech-to-text conversion, supporting numerous languages, including some less common ones, with remarkably good recognition results.

If you want Gemini to directly generate SRT subtitle files, you'll need to use specific prompts. Below is a prompt you can copy and use to have Gemini transcribe your audio and output SRT subtitles.

Speech-to-Text Prompt

You are a professional subtitle transcription assistant. Your task is to transcribe the files I provide into text and format the transcription results into an EBU-STL standard SRT subtitle file. Specific requirements are as follows:

## Each subtitle block must strictly adhere to the following structure:

[Line Number]
[Timecode Line]
[Text Line]
[Blank Line]

**Explanation of the structure**
- [Line Number] is the sequence number of the subtitle block, starting from 1 and incrementing, such as 1, 2, etc.
- [Timecode Line] is the timestamp in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3 milliseconds, e.g., 000 to 999). If you cannot precisely calculate the time, you can reasonably estimate it based on the audio content, ensuring the time intervals are logically sound.
- [Text Line] is the transcribed text content.
- [Blank Line] is the separator between subtitle blocks, ensuring each block is followed by a blank line.

## Restrictions
During output, you must strictly adhere to the above format, without omitting any parts or adding extra text or comments.
The duration of each subtitle should ideally be controlled between 3-15 seconds, specifically based on speech rate and natural semantic segmentation.

Now, please transcribe according to the files I provide and output the subtitle content in the above format.

How to Use

Using Gemini requires a VPN

  1. Open the Gemini website and log in: https://aistudio.google.com/app
  2. Select the model on the right, Gemini 2.0 Flash is sufficient. Of course, choosing a "Thinking" model that includes a thought process might yield better results.

  1. Enter the prompt and upload the file, as shown below.

The result after transcription is as follows; it looks pretty good.

Extension

If you need to translate the subtitles, you can also ask Gemini to translate the subtitles into XX language in the prompt, or request bilingual subtitles for comparison.

Shortcomings

Gemini's biggest shortcoming is the inaccuracy of timestamps. Hopefully, this issue can be resolved with subsequent new version optimizations.

Currently, to solve this problem, you can only use VAD to segment the audio into sentences before transcription, then transcribe each segment individually, and then assemble the transcription results back into SRT, but manual efficiency is too low.

It is recommended to use the free tool pyVideoTrans's "Audio/Video to Subtitles" function and select "Gemini AI," which will complete these steps automatically. You only need to select the audio or video to be transcribed.

Download address: https://pyvideotrans.com