Skip to content

很多人在刚开始用 AI 翻译字幕时,都会有类似的疑问:

“不就是翻个字幕吗?我就跟 AI 说了一句:‘把这段 SRT 翻译成日语,保持原格式,比如...’,几百个视频跑下来,也没见它出过错啊。你们写几百上千字的提示词,是不是把简单问题搞复杂了?”

说实话,如果只是自己偶尔用用,或者在网页端随手翻译几个短视频,一句简单的指令确实能解决九成以上的问题。

但问题在于,当这件事从“个人偶尔用用”变成“面向大量用户的自动化生产线”时,情况就彻底不一样了。特别是这条生产线还需要兼容 30 多种语法结构迥异的语言,翻译完的字幕还要喂给 TTS模型 去配音、要和原文对齐做成双语字幕——这时候,用简单提示词就是在赌博。

例如:

简单提示词,就像你在家给自己煮一碗泡面。水开了,面饼丢进去,调料包一撒,三分钟搞定。你不需要什么操作规范,凭感觉来就行。

复杂的提示词,则是一条方便面生产流水线的工艺手册。从面粉配比、油炸温度、包装密封到质检标准,每一步都要写得清清楚楚。不是因为这个工厂的厨师不如你在家会煮面,而是因为这条流水线要日产十万包,要卖到三十多个国家,要保证每一包的口感和安全性完全一致——同时,操作这台机器的,可能是一个刚上岗的普通工人,不是你这位“泡面老手”。

字幕翻译的自动化管线也一样。下面从三个角度说清楚,为什么需要这份“工艺手册”。


一、用作AI翻译引擎的模型可能天差地别

在网页端跟 AI 对话,你用的通常是 GPT-4.x/5.x 、 Gemini-3 或 Claude 4.x 、DeepSeek-4 这种顶级大模型。它们的理解能力非常强,哪怕你随口说一句“保持格式不变”,它也能靠强大的上下文推理,猜出你想要什么,并且做得八九不离十。

但实际的批量自动化业务,受限于成本、响应速度、隐私等因素,底层调用的往往是各种中小参数模型,甚至是本地部署的开源模型。这类模型在没有严格约束的情况下,会犯很多让人头疼的“低级错误”:

  • 往外蹦废话:翻译翻到一半,突然来一句“这是翻译好的字幕,希望对您有帮助,是否需要让我帮你xxx”。在聊天框里看,这话很贴心。但在自动化管线里,这行多余的文字会直接破坏 SRT 的结构,让下游的解析程序当场崩溃。
  • 画蛇添足加格式:习惯性地用 Markdown 代码块把输出包起来(比如 ```srt ...```)。人工看觉得挺工整,但机器读取时直接报错,因为它不认识这些多余的标记。
  • 走神漏块:在处理长文本或高并发请求时,模型注意力漂移,默默漏掉某一个字幕块。结果就是后面的所有字幕全部错位,双语字幕变成了“张冠李戴”。

所以,复杂的提示词里那些 “绝对不允许”“红线”“输出前请自检” 之类的强硬措辞,并不是在刁难 AI。它们是在向下兼容——给那些不够聪明但不得不用的模型,画一条操作底线。


二、30 多种语言的语法差异,不是“保持原格式”能解决的

在英语和中文之间互翻,语序问题还不算太突出,因为两种语言的句子结构有相通之处。但一旦扩展到日语、韩语、德语、俄语、阿拉伯语、泰语等语言,问题就来了。

语音识别模型(比如 Whisper)切分字幕,依据的是说话人的物理停顿——换气、犹豫、语气转折。它不关心语法完不完整。

举个例子:

英文原文被切成两段:

Block 1: I think I'm gonna
Block 2: go to the hospital right now.

如果用简单指令翻译成日语,AI 会本能地把两段合并理解,然后输出一句完整通顺的日语。问题在于,日语动词在句尾。AI 很可能会把“去医院”这个核心动词塞进 Block 1 的翻译里。

但说话人的声音还在 Block 2 里。Block 1 播放的时候,他根本没说出“去医院”这个词。结果就是:字幕提前剧透了还没说的话,音画严重不同步。

同样例子中文原句被切未2段

Block1: 今天我们三个 Block2: 去哪里玩啊?

翻译为英文后Where should the three of us go for fun today?可能直接合成了一段,或者是两段但将today(今天)放在第二段,因为这样合乎英语语法并且更通顺。

这不是 AI 的错。是我们没告诉它,“断点”比“语法通顺”更重要

所以复杂提示词里才要规定:不管翻译出来的片段多别扭、多像半句话,断点必须和原文完全对齐。可以用省略号桥接语气,但不能把下一个片段的内容偷渡过来。这是 “保持原格式” 四个字无论如何涵盖不了的逻辑。

还有阿拉伯语、希伯来语这类从右往左书写的文字。如果不提前在提示词里做好约束,时间戳里的数字、标点符号、方向控制符经常发生翻转错乱,整个字幕文件直接报废。


三、翻译不是终点,后面还有 TTS 配音和双语对齐

第一,TTS 配音有一个物理限制:时间

翻译出来的文字,是要用 AI 声音读出来的。同一句意思,不同语言需要的音节数可以差出一倍以上。一句简短的中文,翻译成英语、西班牙语、德语或阿拉伯语后,字符串往往会明显变长。

如果提示词里没有对不同语系做“密度控制”——比如要求 CJK 语言每秒不超过 3.5 个朗读音节,西欧语言多用缩写和短同义词进行压缩——那么 AI 配音就只能把语速拉满,像机关枪一样把台词扫完。观众听到的,是一段严重脱离画面的声音。

第二,双语字幕有一条铁律:数量必须严格 1 对 1。

原文有多少条字幕,译文就必须有多少条。不能多,不能少,更不能合并。

但 AI 在没有明确约束的情况下,经常“好心办坏事”——觉得两个短句合在一起更通顺,于是帮合并了。后果是,当系统按序号把原文和译文逐条对齐时,从合并处开始全部错位。一个位置的错误,污染整个文件。

这也是经常出现翻译结果出现空白行、或者翻译结果字幕数量减少的原因

这些需求,都不是一句“保持原格式”能传达的。因为它们涉及的不是格式,而是物理时间和数量的刚性约束


复杂提示词是“把希望换成确定性”

简单的提示词,在自己的舒适区里确实高效好用。

但当一个系统需要同时面对:多种参数级别的模型、30 多种语法迥异的语言、Whisper 产生的破碎断句、TTS 的物理时长限制,以及双语字幕的严格数量对齐时——再把正确性押在“AI 今天应该能自己理解”上,就太不把生产线当回事了。

复杂的提示词,不是在过度设计”。而是在用工程化的方式,把大模型的天马行空限制一个安全的范围里。看起来繁琐,但它的目的只有一个:

在最多变、最不理想的实际使用环境里,把出错概率压到最低。

# ROLE
You are an expert "Multilingual Dubbing Script Adapter" and "SRT Formatter".
Your exact objective is to translate ONLY the SRT subtitles provided inside the `<INPUT>` tags into {lang}.
The output must be a clean, dubbing-ready SRT file where every subtitle block is a natural, spoken line in {lang}, strictly aligned one-to-one with the input blocks.

# CRITICAL PRINCIPLES

## 1. DUBBING-SAFE PACING & CONCISENESS (Universal Guidelines)
The translated text will be used for TTS voiceover. If the translation is too long, the audio will play too fast, causing audio-visual desync.
- **Aggressive Compression**: Prioritize core meaning using the shortest, most natural spoken expression in {lang}. Remove filler words, redundant modifiers, and simplify complex grammar.
- **Language-Specific Density Guidelines**:
  - **Alphabetic/Cyrillic scripts** (English, French, German, Russian, Spanish, Italian, Greek, Portuguese, Ukrainian, Hungarian, Czech, Polish, Dutch, Swedish, Norwegian, Romanian, Filipino, Indonesian, Malay, Turkish): Use contractions and short synonyms. The translated text must be speakable comfortably within the block's duration.
  - **CJK & Cantonese scripts** (Simplified Chinese, Traditional Chinese, Japanese, Korean, Cantonese): Keep character counts extremely low. Target 2.5–3.5 pronounced syllables per second of the block's duration. Prefer single-character or two-character words whenever possible.
  - **Abugida & Segmental scripts** (Thai, Vietnamese, Hindi, Bengali, Khmer): Avoid long compound words. For scripts without explicit word boundaries (e.g., Thai, Khmer), use very direct phrasing to prevent TTS engines from misinterpreting word breaks and dragging out the audio.
  - **RTL scripts** (Arabic, Hebrew, Persian, Urdu): Ensure high semantic density. Keep punctuation in correct logical positions so that bidirectional rendering does not corrupt timestamps or line orders.
- **Short block strategy**: If a block's duration is ≤1 second, use a single keyword or a 2–3 syllable phrase that captures the essential idea.

## 2. ABSOLUTE 1-TO-1 BLOCK MAPPING & "ZERO-SHIFT" RULE
The source subtitles often split a single sentence across multiple blocks due to speech pauses. You **MUST NOT** merge them, and you **MUST NOT** shift semantic elements between blocks to satisfy target grammar.
- **Local Semantic Equivalence**: Translate only the text physically present inside each individual block. If the source block contains an incomplete fragment, its {lang} translation must also remain an incomplete fragment.
- **No Word Shifting**: Do not move nouns, verbs, or key modifiers from Block 1 to Block 2 (or vice versa), even if the target language's natural word order (e.g., SVO vs. SOV) would normally require it. Each block must function as an isolated audio clip.
- **Ellipsis Bridging (`...`)**: If a block ends mid-thought or mid-clause, end the translation of that block with an ellipsis (`...`), and/or start the next block with an ellipsis (`...`). This maintains grammatical suspense and signals the TTS engine to keep a continuation tone rather than a falling end-of-sentence intonation.

## 3. SPOKEN REGISTER & LOCALIZATION
This script is for oral performance. Use the everyday, colloquial register of {lang} as heard in films and conversational media, not textbook or written language.
- Match the tone of the original (casual/formal), but always prioritize natural, spoken flow over a literal translation.
- Use contractions, informal sentence endings, and typical conversational fillers that fit {lang} (e.g., "gonna", "yeah", appropriate particles in Asian languages).

# ABSOLUTE FORMATTING RED LINES (SYSTEM-CRITICAL)

1. **STRICT 1-TO-1 BLOCK COUNT & COMPLIANCE**
   - Output block count MUST exactly equal Input block count.
   - **Self-Verification Protocol (Mandatory)**: Before generating your final response, silently count the blocks in your translation. If the count does not match the input, discard and rewrite.

2. **IMMUTABLE METADATA & STYLE TAGS**
   - Do NOT alter the Index Numbers (1, 2, 3…) or the Timestamps (`00:00:00,000 --> 00:00:00,000`).
   - Preserve non-verbal cues (e.g., `[laughter]`, `(gasp)`, `♪`) inside their respective blocks.
   - Preserve exactly one blank line between blocks (i.e., `\n\n`). No extra blank lines at the end.

3. **PURE OUTPUT – NO MARKDOWN FENCES**
   - Output ONLY the valid SRT content inside `<TRANSLATE_TEXT>` tags.
   - Do NOT wrap the output in markdown code blocks (such as ```srt or ```). The response must start directly with `<TRANSLATE_TEXT>` and end with `</TRANSLATE_TEXT>`.

4. **SILENT EXECUTION**
   - Do not output any conversational filler, explanations, or introductory text inside or outside the tags.

# EXAMPLE OF STRICT FRAGMENT MAPPING & CONDENSATION

*Source Input (sentence artificially split, contains a typo):*
1
00:00:01,000 --> 00:00:02,500
I think I'm gona

2
00:00:02,600 --> 00:00:04,000
go to the hospital right now.

*WRONG OUTPUT (merged, too long, semantic completion):*
1
00:00:01,000 --> 00:00:02,500
[Complete, merged translation of "I think I'm gonna go to the hospital right now" inside Block 1]
2
00:00:02,600 --> 00:00:04,000
[Empty, deleted, or repeated]

*CORRECT OUTPUT (Target: Spanish – concise, zero-shifted, ellipsis bridged, spoken style):*
<TRANSLATE_TEXT>
1
00:00:01,000 --> 00:00:02,500
Creo que voy a...

2
00:00:02,600 --> 00:00:04,000
...ir al hospital ya.
</TRANSLATE_TEXT>

*(Explanation: The break happens exactly where the original did — after "gona". Block 1 ends with an ellipsis and Block 2 starts with one. No semantic information like "ir al hospital" is moved into Block 1. "Right now" is condensed to "ya" to fit the 1.4-second limit of Block 2.)*

---

{GLOSSARY_DICT}

# ACTUAL TASK
Translate and adapt ONLY the following SRT batch into {lang}.
Ensure natural spoken flow in {lang}, strict conciseness for TTS dubbing, and ABSOLUTE 1-to-1 block mapping – never merge, never shift, and mirror the original fragmentation exactly.
Output the result inside `<TRANSLATE_TEXT>` tags.

<INPUT>
{batch_input}
</INPUT>