Skip to content

Gemini is a powerful AI model capable of processing various types of content, including text, images, audio, and video. It can be used for free on the web with almost no restrictions, except that you need to use a VPN to access it.

Gemini is excellent for speech-to-text conversion. It supports a wide range of languages, including some less common ones, and offers impressive recognition accuracy.

If you want Gemini to directly generate SRT subtitle files, you need to use specific prompts. Below is a prompt you can copy and use to have Gemini transcribe and output SRT subtitles for you.

Speech Transcription Prompt

You are a professional subtitle transcription assistant. Your task is to transcribe the file I provide into text and format the transcription into an SRT subtitle file that complies with the EBU-STL standard. Specific requirements are as follows:

## Each subtitle block must strictly follow the structure below:

[Line Number]
[Time Line]
[Text Line]
[Blank Line]

**Explanation of the Structure**
- [Line Number] is the sequence number of the subtitle block, starting from 1 and incrementing, e.g., 1, 2, etc.
- [Time Line] is the timestamp in the format HH:MM:SS,FFF --> HH:MM:SS,FFF, indicating the start and end time of the subtitle (FFF represents 3-digit milliseconds, e.g., 000 to 999). If you cannot calculate the time precisely, estimate it reasonably based on the audio content to ensure logical time intervals.
- [Text Line] is the transcribed text content.
- [Blank Line] is the separator between subtitle blocks. Ensure there is a blank line after each subtitle block.

## Restrictions
When outputting, strictly adhere to the above format. Do not omit any part, and do not add extra text or comments.
Try to keep the duration of each subtitle block between 3 and 15 seconds, splitting naturally based on speech speed and meaning.

Now, please transcribe the file I provide and output the subtitle content in the specified format.

How to Use

Using Gemini requires a VPN

  1. Open the Gemini website and log in: https://aistudio.google.com/app
  2. On the right, select the model. Gemini 2.0 Flash is sufficient, but choosing the "Thinking" model with a reasoning process may yield better results.

  1. Enter the prompt and upload the file, as shown below.

After transcription, the results will look something like this, which appears quite good.

Extensions

If you need to translate the subtitles, you can also include a request in the prompt to translate them into a specific language or output bilingual subtitles side by side.

Limitations

The biggest drawback of Gemini is its inaccuracy in timestamping. This issue may be resolved with future updates and optimizations.

Currently, to address this problem, you can use VAD (Voice Activity Detection) to segment the audio into sentences before transcription, transcribe each segment individually, and then reassemble the results into an SRT file. However, doing this manually is inefficient.

It is recommended to use the Audio/Video to Subtitle feature in the free tool pyVideoTrans and select Gemini AI. This will automate the entire process, and you only need to choose the audio or video file to transcribe.

Download link: https://pyvideotans.com