Skip to content

Professor Andrew Ng's "Three-Step Reflection Translation Method" is very effective. It enhances translation quality by having the model self-examine the translation results and suggest improvements. However, directly applying this method to SRT format subtitle translation presents some challenges.

Special Requirements of SRT Subtitle Format

SRT format subtitles have strict formatting requirements:

  • First line: Line number (integer)
  • Second line: Two timestamps, connected by -->, in the format hours:minutes:seconds,milliseconds
  • Third line and onwards: Subtitle text content

Subtitles are separated by two blank lines.

Example:

1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

Common Issues in SRT Translation

When using AI to translate SRT subtitles, the following issues may occur:

  • Formatting Errors:
    • Missing line numbers or duplicated timestamps
    • Translating English symbols in timestamps into Chinese symbols
    • Merging two adjacent subtitle texts into one line, especially when the sentences form a complete sentence grammatically.
  • Translation Quality Issues:
    • Translation errors often occur, even when using strict prompts.

Common Error Examples:

  • Subtitle text merging leading to blank lines

image.png

  • Format mess

image.png

  • Line numbers being translated

image.png

  • Inconsistent number of original and translated subtitles

As mentioned above, if two consecutive subtitles are grammatically part of the same sentence, they are likely to be translated into a single subtitle, resulting in fewer subtitles in the output.

image.png

Formatting errors directly prevent subsequent processes that rely on SRT format from working. Different models exhibit different errors and error probabilities. Generally, more intelligent models are more likely to return valid and compliant content, while locally deployed small-scale models are almost unusable.

However, given the improvement in translation quality offered by the three-step reflection method, I tried it. I ultimately chose to use gemini-1.5-flash for a small trial, mainly because it is intelligent enough and free, with almost no limitations other than rate limiting.

Prompt Engineering Ideas

Following Andrew Ng's three-step reflection workflow, prompts were written:

  • Step 1: Ask the AI to translate literally.
  • Step 2: Ask the AI to evaluate the literal translation and provide suggestions for improvement.
  • Step 3: Re-translate based on the suggestions, focusing on meaning.

The difference is strengthening the requirement that the returned content must be a valid SRT format, although it may not always be followed perfectly.

Setting Up a Simple API

One issue with the three-step reflection model is the significantly increased token consumption, longer prompts, and longer output results. Additionally, due to Gemini's rate limits, exceeding the limit will return a 429 error. It is necessary to pause for a while after each request.

A backend API was built using Flask, and a simple single-page interface was created using Bootstrap 5. The overall interface is as follows:

image.png

Obviously, accessing Gemini within China requires a VPN.

  • "Translate Lines Simultaneously": Refers to the number of subtitle lines in a single translation request. If it is too large, it may exceed the token limit and cause errors. If it is too small, it is meaningless. A value between 30-100 is recommended, with 50 as the default.
  • "Pause Seconds After Translation": Prevents frequent requests from returning a 429 error by pausing for 10 seconds after each request.

Example of the return result

<step1_initial_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this photo in Nature,

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
There\'s also this circle of orange light,

8
00:00:21,900 --> 00:00:23,740
This is a new drama-scale sweet donut,

9
00:00:24,380 --> 00:00:25,640
This is a portal.

10
00:00:26,280 --> 00:00:28,100
This is the generation ring of an alien civilization,

</step1_initial_translation>

<step2_reflection>
* **Line 1:** "Five-star system" is likely a mistranslation.  It probably refers to a five-member committee or group, not a star system. Clarify the context.
* **Line 2:** "Multiple universes" seems like an over-exaggeration.  Rephrase for clarity and accuracy.
* **Line 3:** "Weibo" should be explained as a Chinese social media platform.  "Filming missions" is unclear.  Does it mean "posting videos/images"?
* **Line 8:** "Drama-scale sweet donut" is a nonsensical literal translation. Figure out the intended meaning.
* **Line 9:** "Portal" seems out of context.  Verify the intended meaning.
* **Line 10:** "Generation ring" is likely a mistranslation.  Clarify the context.
* **Line 11:** "Organic polycyclic aromatic hydrocarbons" is overly technical for a general audience. Simplify if possible.
* **Line 12 and 14:** Use the correct formatting for the galaxy\'s name: SPT0418-47.
* **Line 15:** "It hasn\'t been shortened" is awkward. Remove or rephrase.
* **Line 28:**  The name of the organization and the resource should be translated more naturally and accurately.  Consider breaking this long line into two for better readability.
* **Line 29:**  "Cute plush dolls" may sound childish. Consider rephrasing as "animated characters" or similar.
* **Line 35:** "James Webb Space Telescope" should be used consistently throughout. Shortening to "Webb Telescope" after the first mention is acceptable.
* **Line 44:** "SPD048" is likely a typo.  It should be SPT0418-47 to be consistent.
* **Line 45-46:**  "Standard beautiful photo" is redundant.  Simplify to "beautiful photo".
* **Line 48:** "Grovitational Lenshin" is a typo. Correct to "Gravitational Lensing".
* **Line 50:** The sentence is incomplete. Finish the thought.

</step2_reflection>

<step3_refined_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-member group\'s area of focus.

2
00:00:04,720 --> 00:00:06,780
We are still far from making contact with extraterrestrial life.

3
00:00:07,260 --> 00:00:09,880
The James Webb Space Telescope has been capturing images for a year now,

4
00:00:10,140 --> 00:00:12,920
and has recently transmitted many previously unseen photos.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this image in Nature.

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
there\'s a ring of orange light.

8
00:00:21,900 --> 00:00:23,740
This is a large, ring-shaped structure.

9
00:00:24,380 --> 00:00:25,640
This is being investigated.

10
00:00:26,280 --> 00:00:28,100
This is thought to be a sign of an early galaxy.

</step3_refined_translation>

The text within the <step3_refined_translation></step3_refined_translation> tags is extracted as the translation result.

A simple package has been created, and you can download it locally to try it out.

Download directly, unzip, and double-click app.exe to automatically open the UI interface in your browser. Enter the key you applied for in Gemini, fill in the proxy address, select the SRT subtitle file to be translated, and select the target language to be translated to. Try the result.

image.png

Q1: How is the reflection workflow different from traditional machine translation?

A1: The reflection workflow introduces a self-evaluation and optimization mechanism, simulating the thinking process of human translators, and can produce more accurate and natural translation results.

Q2: How long does it take to use the reflection workflow?

A2: Although the reflection workflow requires multiple AI processing steps, it usually only takes 10–20 seconds longer than traditional methods. Considering the improvement in translation quality, this time investment is worthwhile.

Q3: Can the reflection workflow guarantee that the subtitle translation results are always valid SRT?

A3: No, there may still be problems with blank lines and inconsistencies in the number of original subtitles. For example, if the second of two subtitles has only 3-5 words and is grammatically a continuation of the previous sentence, the translation results are likely to be merged into one.



A feature has been added to the small tool to support simultaneous uploading of video or audio files. With the help of Gemini, audio and video can be converted into subtitles, and the translation can be performed at the same time, and the translation results can be returned.

The Gemini large model itself supports both text and audio-visual formats, so a single request can achieve transcription from audio and video to subtitles and translation.

For example, sending an English-speaking video to Gemini and specifying the translation to Chinese will return Chinese subtitles.

image.png

image.png

1. Translate Subtitles Only

You can paste the SRT format subtitle content in the left text box, or directly click the "Upload SRT Subtitles" button to select the subtitle file from your local computer.

Then set the target language you want to translate to, and you can use the "Three-Step Reflection Translation Method" to instruct Gemini to perform the translation task, return the results to the text box on the right, and click the "Download Button" in the lower right corner to save it as an SRT file to your local computer.

2. Transcribe audio and video to subtitles

Click the "Upload Audio/Video to Transcribe to Subtitles" button on the left, select any audio or video file to upload, and after uploading, submit it. Gemini will return the subtitle content identified based on the speech sounds in the audio and video, and the effect is not bad.

If a target language is specified at the same time, Gemini will continue to translate the result into the language you specify after identifying it and then return it. That is, it completes the two tasks of generating subtitles and translating subtitles at the same time.

Download address:

https://github.com/jianchang512/ai2srt/releases/download/v0.2/windows-ai2srt-0.2.7z