Skip to content

Andrew Ng's "Three-Step Reflection Translation Method" is highly effective, as it improves translation quality by having the model self-review the translation results and propose enhancements. However, applying this method directly to SRT subtitle translation presents some challenges.

Special Requirements of SRT Subtitle Format

The SRT subtitle format has strict requirements:

  • First line: Subtitle number (numeric)
  • Second line: Two timestamps connected by --> in the format hours:minutes:seconds,3-digit milliseconds
  • Third line and beyond: Subtitle text content

Subtitles are separated by two blank lines.

Example:

1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

Common Issues in SRT Translation

When using AI to translate SRT subtitles, the following problems may occur:

  • Format Errors:
    • Missing subtitle numbers or duplicate timestamps
    • Translating English punctuation in timestamps to Chinese punctuation
    • Merging adjacent subtitle lines into one, especially when the previous and next sentences form a complete grammatical sentence
  • Translation Quality Issues:
    • Translation errors often occur even with strict prompt restrictions.

Common Error Examples:

  • Subtitle text merging causing blank lines

image.png

  • Format confusion

image.png

  • Subtitle numbers being translated

image.png

  • Inconsistent number of original and translated subtitles

As mentioned above, when two consecutive subtitles form a single grammatical sentence, they are likely to be translated as one, resulting in fewer subtitle entries.

image.png

Format errors directly cause subsequent SRT-dependent processes to fail. Different models exhibit varying errors and error probabilities; generally, more intelligent models are more likely to return valid, compliant content, while locally deployed smaller models are almost unusable.

Despite this, given the translation quality improvement from the three-step reflection method, I attempted to use it. I ultimately chose gemini-1.5-flash for a small trial, mainly because it is intelligent enough, free, and has almost no restrictions aside from frequent usage limits.

Prompt Writing Strategy

Following Andrew Ng's three-step reflection workflow, write the prompts:

  • Step 1: Require the AI to perform a literal translation.
  • Step 2: Require the AI to evaluate the literal translation and provide optimization suggestions.
  • Step 3: Require the AI to perform a refined translation based on the optimization suggestions.

The difference is to strongly emphasize that the returned content must be in valid SRT format, although it may not always comply 100%.

Building a Simple API

One issue with the three-step reflection mode is the significantly higher token consumption—prompts become longer, and output results increase. Additionally, due to Gemini's rate limits, exceeding the frequency returns a 429 error, requiring a pause after each request.

I used Flask to build a backend API and Bootstrap 5 for a simple frontend single-page application. The overall interface is as follows:

image.png

Obviously, using Gemini in China requires a VPN.

  • Lines to translate simultaneously: Refers to the number of subtitle lines in one translation request. If too large, it may exceed token limits and cause errors; if too small, it's meaningless. Recommended range is 30–100, default is 50.
  • Pause seconds after translation: Prevents frequent requests from causing 429 errors. It pauses for 10 seconds after each request before initiating the next one.

Example Return Result:

<step1_initial_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this photo in Nature,

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
There\'s also this circle of orange light,

8
00:00:21,900 --> 00:00:23,740
This is a new drama-scale sweet donut,

9
00:00:24,380 --> 00:00:25,640
This is a portal.

10
00:00:26,280 --> 00:00:28,100
This is the generation ring of an alien civilization,

</step1_initial_translation>

<step2_reflection>
* **Line 1:** "Five-star system" is likely a mistranslation.  It probably refers to a five-member committee or group, not a star system. Clarify the context.
* **Line 2:** "Multiple universes" seems like an over-exaggeration.  Rephrase for clarity and accuracy.
* **Line 3:** "Weibo" should be explained as a Chinese social media platform.  "Filming missions" is unclear.  Does it mean "posting videos/images"?
* **Line 8:** "Drama-scale sweet donut" is a nonsensical literal translation. Figure out the intended meaning.
* **Line 9:** "Portal" seems out of context.  Verify the intended meaning.
* **Line 10:** "Generation ring" is likely a mistranslation.  Clarify the context.
* **Line 11:** "Organic polycyclic aromatic hydrocarbons" is overly technical for a general audience. Simplify if possible.
* **Line 12 and 14:** Use the correct formatting for the galaxy\'s name: SPT0418-47.
* **Line 15:** "It hasn\'t been shortened" is awkward. Remove or rephrase.
* **Line 28:**  The name of the organization and the resource should be translated more naturally and accurately.  Consider breaking this long line into two for better readability.
* **Line 29:**  "Cute plush dolls" may sound childish. Consider rephrasing as "animated characters" or similar.
* **Line 35:** "James Webb Space Telescope" should be used consistently throughout. Shortening to "Webb Telescope" after the first mention is acceptable.
* **Line 44:** "SPD048" is likely a typo.  It should be SPT0418-47 to be consistent.
* **Line 45-46:**  "Standard beautiful photo" is redundant.  Simplify to "beautiful photo".
* **Line 48:** "Grovitational Lenshin" is a typo. Correct to "Gravitational Lensing".
* **Line 50:** The sentence is incomplete. Finish the thought.

</step2_reflection>

<step3_refined_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-member group's area of focus.

2
00:00:04,720 --> 00:00:06,780
We are still far from making contact with extraterrestrial life.

3
00:00:07,260 --> 00:00:09,880
The James Webb Space Telescope has been capturing images for a year now,

4
00:00:10,140 --> 00:00:12,920
and has recently transmitted many previously unseen photos.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this image in Nature.

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
there's a ring of orange light.

8
00:00:21,900 --> 00:00:23,740
This is a large, ring-shaped structure.

9
00:00:24,380 --> 00:00:25,640
This is being investigated.

10
00:00:26,280 --> 00:00:28,100
This is thought to be a sign of an early galaxy.

</step3_refined_translation>

Extract the text within the <step3_refined_translation></step3_refined_translation> tags as the final translation result.

Simple Packaging for Local Trial

Directly download, extract, and double-click app.exe to automatically open the above UI interface in your browser. Enter the Gemini API key, fill in the proxy address, select the SRT subtitle file to translate, choose the target language, and test the results.

image.png

Q1: How does the reflection workflow differ from traditional machine translation?

A1: The reflection workflow introduces self-evaluation and optimization mechanisms, simulating the thought process of human translators, resulting in more accurate and natural translations.

Q2: How long does the reflection workflow take?

A2: Although the reflection workflow requires multiple AI processing steps, it typically only takes 10–20 seconds longer than traditional methods. Given the improvement in translation quality, this time investment is worthwhile.

Q3: Does the reflection workflow guarantee that the subtitle translation result is always valid SRT?

A3: No, issues like blank lines or inconsistent subtitle counts with the original may still occur. For example, if a subsequent subtitle has only 3–5 words and is grammatically continuous with the previous one, the translation might merge them into a single line.



Added a feature to the tool that supports uploading video or audio files simultaneously. Using Gemini, it can transcribe audio/video into subtitles and translate them in the same request.

Gemini's large model supports both text and audio/video inputs, allowing a single request to transcribe audio/video into subtitles and translate them.

For example, sending an English-spoken video to Gemini and specifying translation to Chinese will return Chinese subtitles.

image.png

image.png

1. Translate Subtitles Only

Paste SRT subtitle content into the left text box or click the "Upload SRT Subtitle" button to select a subtitle file from your local computer.

Then set the target language, and use the "Three-Step Reflection Translation" method to command Gemini to perform the translation task. The result will be output to the right text box. Click the "Download" button in the lower right corner to save it as an SRT file locally.

2. Transcribe Audio/Video into Subtitles

Click the "Upload Audio/Video to Transcribe into Subtitles" button on the left, select any audio or video file to upload. After uploading, submit it. Gemini will process it and return the subtitle content recognized from the speech in the audio/video, with good results.

If a target language is specified, Gemini will translate the recognized subtitles into the specified language before returning, completing both subtitle generation and translation in one step.

Download Link:

https://github.com/jianchang512/ai2srt/releases/download/v0.2/windows-ai2srt-0.2.7z