Translating SRT Subtitles with the Three-Step Reflection Method

Andrew Ng's "Reflective Three-Step Translation Method" is highly effective. It enhances translation quality by allowing the model to self-review its results and suggest improvements. However, applying this method directly to SRT subtitle translation presents some challenges.

Specific Requirements of SRT Subtitle Format

The SRT format has strict formatting requirements:

First Line: Line number (integer)
Second Line: Two timestamps connected by -->, formatted as hours:minutes:seconds,milliseconds
Third Line and onward: Subtitle text content

Subtitles are separated by two blank lines.

Example:

1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

Common Problems in SRT Translation

When using AI to translate SRT subtitles, the following problems may occur:

Format Errors:
- Missing line numbers or repeated timestamps
- Translating English symbols in timestamps into Chinese symbols
- Merging adjacent subtitle texts into one line, especially when the upper and lower sentences form a complete sentence grammatically.
Translation Quality Issues:
- Even with strict prompt restrictions, translation errors still occur frequently.

Examples of Common Errors:

Subtitle Text Merging Leading to Blank Lines

Format Confusion

Line Numbers Being Translated

Inconsistent Number of Original and Result Subtitles

As mentioned above, when the previous and subsequent subtitles belong to the same sentence grammatically, they are likely to be translated into the same line, resulting in a missing number of subtitle entries in the result.

Format errors directly prevent subsequent processes that rely on SRT files from working. Different models have different errors and error probabilities. Generally speaking, the more intelligent the model, the more likely it is to return legal, compliant content. Small-scale models deployed locally are almost completely unusable.

However, given the improvement in translation quality brought about by the three-step reflection method, I still tried it. I ultimately chose to use gemini-1.5-flash for a small test, mainly because it's intelligent enough and free. Aside from the frequency limitation, it's almost unrestricted.

Writing Prompt Ideas

Write prompts according to Andrew Ng's three-step reflection workflow:

Step 1: Require the AI to directly translate literally.
Step 2: Require the AI to evaluate the literal translation and provide optimization suggestions.
Step 3: Re-translate idiomatically based on the optimization suggestions.

The difference is to strengthen the requirement that the returned content must be in a valid SRT format, even if it doesn't always comply 100%.

Building a Simple API

One problem with the three-step reflection mode is the extra consumption of many more tokens. The prompt becomes longer, the output result becomes longer. In addition, due to Gemini's frequency limit, exceeding the frequency will return a 429 error, so you need to pause for a while after each request.

Use Flask to build the backend API, and Bootstrap5 to create a simple single page for the front end. The overall interface is as follows:

Obviously, you must have a proxy to use Gemini domestically.

Simultaneous translation lines: refers to the number of subtitle lines in a single translation request. If it is too large, it may exceed the token limit and cause errors. If it is too small, it is meaningless. It is recommended to be within 30-100, default is 50.
Pause seconds after translation: To prevent requests from being too frequent and causing a 429 error, the program will pause for 10 seconds after each request returns, and then initiate the next request.

Example of return result

<step1_initial_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-star system,

2
00:00:04,720 --> 00:00:06,780
We are still multiple universes away from third-type contact.

3
00:00:07,260 --> 00:00:09,880
Weibo has been carrying out filming missions for years now,

4
00:00:10,140 --> 00:00:12,920
Many previously difficult-to-capture photos have been transmitted recently.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this photo in Nature,

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
There\'s also this circle of orange light,

8
00:00:21,900 --> 00:00:23,740
This is a new drama-scale sweet donut,

9
00:00:24,380 --> 00:00:25,640
This is a portal.

10
00:00:26,280 --> 00:00:28,100
This is the generation ring of an alien civilization,

</step1_initial_translation>

<step2_reflection>
* **Line 1:** "Five-star system" is likely a mistranslation. It probably refers to a five-member committee or group, not a star system. Clarify the context.
* **Line 2:** "Multiple universes" seems like an over-exaggeration. Rephrase for clarity and accuracy.
* **Line 3:** "Weibo" should be explained as a Chinese social media platform. "Filming missions" is unclear. Does it mean "posting videos/images"?
* **Line 8:** "Drama-scale sweet donut" is a nonsensical literal translation. Figure out the intended meaning.
* **Line 9:** "Portal" seems out of context. Verify the intended meaning.
* **Line 10:** "Generation ring" is likely a mistranslation. Clarify the context.
* **Line 11:** "Organic polycyclic aromatic hydrocarbons" is overly technical for a general audience. Simplify if possible.
* **Line 12 and 14:** Use the correct formatting for the galaxy\'s name: SPT0418-47.
* **Line 15:** "It hasn\'t been shortened" is awkward. Remove or rephrase.
* **Line 28:** The name of the organization and the resource should be translated more naturally and accurately. Consider breaking this long line into two for better readability.
* **Line 29:** "Cute plush dolls" may sound childish. Consider rephrasing as "animated characters" or similar.
* **Line 35:** "James Webb Space Telescope" should be used consistently throughout. Shortening to "Webb Telescope" after the first mention is acceptable.
* **Line 44:** "SPD048" is likely a typo. It should be SPT0418-47 to be consistent.
* **Line 45-46:** "Standard beautiful photo" is redundant. Simplify to "beautiful photo".
* **Line 48:** "Grovitational Lenshin" is a typo. Correct to "Gravitational Lensing".
* **Line 50:** The sentence is incomplete. Finish the thought.

</step2_reflection>

<step3_refined_translation>
1
00:00:01,950 --> 00:00:04,430
Several molecules have been discovered in the five-member group\'s area of focus.

2
00:00:04,720 --> 00:00:06,780
We are still far from making contact with extraterrestrial life.

3
00:00:07,260 --> 00:00:09,880
The James Webb Space Telescope has been capturing images for a year now,

4
00:00:10,140 --> 00:00:12,920
and has recently transmitted many previously unseen photos.

5
00:00:13,440 --> 00:00:17,500
In early June, astronomers published this image in Nature.

6
00:00:18,040 --> 00:00:19,180
Outside the blue core,

7
00:00:19,360 --> 00:00:21,380
there\'s a ring of orange light.

8
00:00:21,900 --> 00:00:23,740
This is a large, ring-shaped structure.

9
00:00:24,380 --> 00:00:25,640
This is being investigated.

10
00:00:26,280 --> 00:00:28,100
This is thought to be a sign of an early galaxy.

</step3_refined_translation>

Extract the text within the <step3_refined_translation></step3_refined_translation> tags from the result, which is the translation result.

Simple Package Available for Local Testing

Download directly, extract, and double-click app.exe to automatically open the above UI interface in your browser. Enter the key you applied for in Gemini, fill in the proxy address, select the SRT subtitle file you want to translate, and select the target language you want to translate to, and try the result.

Q1: How does the reflection workflow differ from traditional machine translation?

A1: The reflection workflow introduces a self-assessment and optimization mechanism, simulating the thinking process of human translators, and can produce more accurate and natural translation results.

Q2: How long does it take to use the reflection workflow?

A2: Although the reflection workflow requires multiple AI processing steps, it usually only takes 10-20 seconds longer than traditional methods. Considering the improvement in translation quality, this time investment is worthwhile.

Q3: Can the reflection workflow guarantee that the subtitle translation result is a valid SRT?

A3: No, there may still be blank lines and inconsistent numbers of original subtitles. For example, if the previous and subsequent subtitles, the latter one has only 3-5 words, and grammatically belongs to the continuation of the previous sentence, then the translation result is likely to be merged into one.

Added a function to the gadget to support uploading video or audio files at the same time. With the help of Gemini, audio and video can be converted into subtitles, and the translation can be performed at the same time as the conversion, and the translation results are returned.

The Gemini large model itself supports both text and audio-visual formats, so a request can realize the transcription and translation of audio and video into subtitles.

For example, if an English-speaking video is sent to Gemini and specified to be translated into Chinese, then the returned result is a Chinese subtitle.

1. Translate Subtitles Only

You can paste the SRT formatted subtitle content in the left text box, or directly click the "Upload SRT Subtitle" button to select a subtitle file from your local computer.

Then set the target language you want to translate to, and you can use the "Three-Step Reflection Translation Method" to instruct Gemini to perform the translation task. The returned result is output to the right text box, and you can click the "Download Button" in the lower right corner to save it as an SRT file to your local computer.

2. Transcribe Audio and Video into Subtitles

Click the "Upload Audio and Video to Transcribe to Subtitles" button on the left, select any audio or video file to upload. After the upload is complete, submit it. After processing, Gemini will return the subtitle content recognized based on the speech sounds in the audio and video. The effect is not bad.

If the target language is also specified, then after recognition, Gemini will continue to translate the result into the language you specified before returning it. That is, it completes the two tasks of generating subtitles and translating subtitles at the same time.

Download address:

https://github.com/jianchang512/ai2srt/releases/download/v0.2/windows-ai2srt-0.2.7z