When AI Dubbing Meets Video: An Automated Engineering Practice for Audio-Visual Synchronization

Dubbing a video from one language into another has become increasingly common. Whether for knowledge sharing, film and television, or product introductions, good localized dubbing can significantly bridge the gap with the audience. But behind the scenes, a thorny issue persists: how to achieve audio-visual synchronization?

Linguistic differences are inherent. A 3-second Chinese dialogue might take 4.5 seconds to translate into English, or 5 seconds into German. Even in the same language, the duration of generated speech can vary significantly depending on the TTS (Text-to-Speech) engine, the speaker, or even the same speaker's emotional state.

This mismatch in duration directly leads to a disconnect between the audio and the speaker on screen. When the audience sees a person's mouth close while the voice continues, the sense of being pulled out of the experience is devastating.

Manually aligning every single line of dubbing can, of course, achieve perfection. But when faced with a video containing hundreds or thousands of subtitles, and potentially many more videos waiting in the queue, this becomes a tedious and time-consuming nightmare. We need an automated solution.

This article shares the exploration process of such an automated solution. It uses Python, leveraging the powerful ffmpeg and pydub libraries, to find an acceptable synchronization point between the dubbed audio and the original video. It doesn't aim for pixel-perfect alignment but seeks to build a robust, reliable, and automated engineering workflow. In most cases, this process can generate a video that sounds and looks natural enough.

The Core Idea: Finding a Balance Between Audio and Video

The root of the problem is the time difference. The trouble begins when the duration of the dubbed audio is longer than the duration of the corresponding video segment for the original subtitle. We need a way to "create" extra time out of thin air.

This challenge only arises when the dubbing is too long. If the dubbing is shorter than the video segment, at worst, the character finishes speaking early while their mouth is still moving. This is relatively acceptable visually and doesn't disrupt the subsequent timeline. But overly long dubbing will encroach on the playback time of the next line, causing audio overlap or misaligning the entire timeline. This is the core conflict we must resolve.

There are really only two options: either shorten the audio or extend the video.

Shortening the audio means speeding it up. Python's pydub library provides a speedup method, which is simple to implement. But its drawback is also obvious. When the speed-up factor exceeds 1.5x, the audio pitch starts to distort, the speech becomes too fast, and it sounds strange. Beyond 2x, the dubbing essentially loses its ability to convey information.
Extending the video means slowing it down. ffmpeg's setpts filter is a powerful tool for this purpose. A single command like setpts=2.0*PTS can double the duration of a video clip smoothly. This buys us valuable time. But similarly, excessive slow-motion can make the characters' movements look sluggish and unnatural, like they're in a "slow-motion" sequence.

A good automated strategy must find a balance between these two. Our initial idea was simple:

If the time difference is small, say less than 1 second, let the audio bear the pressure. A slight speed-up is usually imperceptible to the human ear.
If the time difference is significant, then the audio and video should share the burden. For example, the extra time could be split fifty-fifty. The audio is sped up a little, and the video is slowed down a little, keeping the distortion on both sides to a minimum.

This idea formed the cornerstone of our approach. But when we actually started writing the code, we discovered that the engineering implementation was far more complex than imagined.

First Attempt: A Fragile Loop and Intertwined Logic

The most intuitive way to write this is to iterate through each subtitle. Inside the loop, get the dubbing duration and compare it with the original duration. If the dubbing is too long, decide on the spot whether to speed up the audio or slow down the video, and then immediately execute the ffmpeg or pydub command.

This approach seems direct but hides enormous risks. It couples completely different types of operations—"decision-making," "file I/O," and "state updates"—all within one large loop.

This means that if any part of the loop fails, for instance, if a video clip fails to process due to a minor ffmpeg issue, the entire process could be interrupted. Even if it doesn't halt, state corruption could lead to unpredictable errors in subsequent iterations.

A more robust architecture must decouple the process and break it down into several independent, atomic stages.

Preparation Stage: First, go through all subtitles completely with one goal: to collect information. Calculate and store each subtitle's original start/end times, original duration, dubbed audio duration, and the duration of the "silent gap" between it and the next subtitle.
Decision Stage: Go through the list again, this time only performing calculations and making decisions. Based on our established balancing strategy, calculate the "target audio duration" and "target video duration" for each subtitle that needs adjustment. No files are modified at this stage.
Execution Stage: With a clear "blueprint" in hand, now it's time to act. Based on the results from the decision stage, process all audio and video files in batches, or even in parallel. Audio speed-up and video processing can be executed separately.
Merging Stage: Once all independent audio and video clips have been processed, the final step is to concatenate them in the correct order to generate the final file.

Making each part's function singular makes the code clearer and easier to handle errors and debug. This is the first step from "it works" to "it's reliable."

The Silent Enemy: Absorbing Gaps and Eliminating Errors

A video's timeline is continuous. Between subtitles, there are often a few seconds of "silent gaps" with no dialogue. These gaps are part of the video's narrative rhythm, and if handled poorly, the entire film will feel strange.

A natural idea is to treat these gaps as a special type of clip. After subtitle A ends, if there's a 2-second gap before subtitle B begins, we would cut out that 2-second video segment as well.

But this introduces a new problem: what if the gap is extremely short, say only 30 milliseconds?

ffmpeg's behavior is unstable when processing such extremely short clips. A video is composed of frames, and a single frame's duration is typically between 16ms and 42ms (corresponding to 60FPS to 24FPS). You can't ask ffmpeg to precisely cut a 30ms clip, as it might not even contain a single full frame. Forcing the operation would likely result in a command failure or a 0-byte empty file.

Our initial solution was to "discard" it. If a gap is too short, for example, less than 50 milliseconds, we would just ignore it. But we quickly rejected this idea. A long video might have hundreds or thousands of these tiny gaps. Dropping a frame or two each time would accumulate into a noticeable "stuttering" effect, making the video feel disjointed. This experience is unacceptable.

A better strategy is "absorption."

After processing a subtitle clip, we look ahead at the gap that follows it. If this gap is very short (below our set threshold of 50ms), we "absorb" this tiny gap, treating it as part of the current subtitle clip.

For example:

Subtitle A: 00:10.000 -> 00:12.500
A tiny 40ms gap
Subtitle B: 00:12.540 -> 00:15.000

With the "absorption" strategy, when processing subtitle A, we notice the following gap is only 40ms. So, we adjust our clipping end point from 12.500 directly to 12.540. This way, the 40ms gap is seamlessly merged into the end of clip A.

This approach has two major benefits:

Prevents dropped frames: The video timeline remains continuous, with no content discarded.
Provides extra space: Clip A's original duration increases from 2.5 seconds to 2.54 seconds. If this clip happens to need a video slow-down, this extra 40ms provides a valuable buffer, allowing us to slightly reduce the slow-down rate and make the visuals more natural.

The core of this strategy is to dynamically adjust the clipping end points and carefully maintain a record of the timeline's progression to ensure that absorbed gaps are not processed again later.

Designing for Failure: A Resilient Processing Pipeline

Real-world media files are far "dirtier" than we imagine. A video might have a slight codec error at a certain point, or an unreasonable slow-down parameter (e.g., applying a very high slow-down rate to an already short clip) could cause ffmpeg to fail. If our program crashes entirely because of one failed clip, it's an engineering failure.

We must design for failure. In the video processing execution stage, we introduce a try-check-fallback mechanism.

The process is as follows:

Try: For a given clip, execute our calculated ffmpeg clipping command, which may include speed adjustment parameters.
Check: Immediately after the command executes, check if the output file exists and has a size greater than 0.
Fallback: If the check fails, a warning is logged. Then, the program immediately calls ffmpeg again, but this time in safe mode—without any speed adjustment parameters, just clipping at the original speed.

This fallback mechanism ensures that even if our slow-down operation on a clip fails, we at least get an original clip with the correct duration. This preserves the integrity of the entire video timeline and prevents all subsequent clips from being misaligned.

The Final Architecture: A Flexible, Decoupled `SpeedRate` Class

After repeated iteration and optimization, we arrived at a relatively robust SpeedRate class. It encapsulates the entire complex synchronization process into a clear and reliable execution flow. Below, we'll look at how its key parts work together.

python


import os
import shutil
import time
from pathlib import Path
import concurrent.futures

from pydub import AudioSegment
from pydub.exceptions import CouldntDecodeError

from videotrans.configure import config
from videotrans.util import tools

class SpeedRate:
    """
    通过音频加速和视频慢放来对齐翻译配音和原始视频时间轴。

    V10 更新日志:
    - 【策略优化】引入微小间隙“吸收”策略，替代原有的“丢弃”策略。
      当一个字幕片段后的间隙小于阈值时，该间隙将被并入前一个字幕片段进行处理，
      避免了“跳帧”现象，并为视频慢速提供了额外时长。
    - 相应地调整了 video_pts 的计算逻辑，以适应动态变化的片段时长。
    """

    MIN_CLIP_DURATION_MS = 50  # 最小有效片段时长（毫秒）

    def __init__(self,
                 *,
                 queue_tts=None,
                 shoud_videorate=False,
                 shoud_audiorate=False,
                 uuid=None,
                 novoice_mp4=None,
                 raw_total_time=0,
                 noextname=None,
                 target_audio=None,
                 cache_folder=None
                 ):
        self.queue_tts = queue_tts
        self.shoud_videorate = shoud_videorate
        self.shoud_audiorate = shoud_audiorate
        self.uuid = uuid
        self.novoice_mp4_original = novoice_mp4
        self.novoice_mp4 = novoice_mp4
        self.raw_total_time = raw_total_time
        self.noextname = noextname
        self.target_audio = target_audio
        self.cache_folder = cache_folder if cache_folder else Path(f'{config.TEMP_DIR}/{str(uuid if uuid else time.time())}').as_posix()
        Path(self.cache_folder).mkdir(parents=True, exist_ok=True)
        
        self.max_audio_speed_rate = max(1.0, float(config.settings.get('audio_rate', 5.0)))
        self.max_video_pts_rate = max(1.0, float(config.settings.get('video_rate', 10.0)))
        
        config.logger.info(f"SpeedRate initialized for '{self.noextname}'. AudioRate: {self.shoud_audiorate}, VideoRate: {self.shoud_videorate}")
        config.logger.info(f"Config limits: MaxAudioSpeed={self.max_audio_speed_rate}, MaxVideoPTS={self.max_video_pts_rate}, MinClipDuration={self.MIN_CLIP_DURATION_MS}ms")

    def run(self):
        """主执行函数"""
        self._prepare_data()
        self._calculate_adjustments()
        self._execute_audio_speedup()
        self._execute_video_processing()
        merged_audio = self._recalculate_timeline_and_merge_audio()
        if merged_audio:
            self._finalize_audio(merged_audio)
        return self.queue_tts

    def _prepare_data(self):
        """第一步：准备和初始化数据。"""
        tools.set_process(text="Preparing data...", uuid=self.uuid)

        # 第一阶段：初始化独立数据
        for it in self.queue_tts:
            it['start_time_source'] = it['start_time']
            it['end_time_source'] = it['end_time']
            it['source_duration'] = it['end_time_source'] - it['start_time_source']
            it['dubb_time'] = int(tools.get_audio_time(it['filename']) * 1000) if tools.vail_file(it['filename']) else 0
            it['target_audio_duration'] = it['dubb_time']
            it['target_video_duration'] = it['source_duration']
            it['video_pts'] = 1.0
        
        # 第二阶段：计算间隙
        for i, it in enumerate(self.queue_tts):
            if i < len(self.queue_tts) - 1:
                next_item = self.queue_tts[i + 1]
                it['silent_gap'] = next_item['start_time_source'] - it['end_time_source']
            else:
                it['silent_gap'] = self.raw_total_time - it['end_time_source']
            it['silent_gap'] = max(0, it['silent_gap'])

    def _calculate_adjustments(self):
        """第二步：计算调整方案。"""
        tools.set_process(text="Calculating adjustments...", uuid=self.uuid)
        for i, it in enumerate(self.queue_tts):
            
            if it['dubb_time'] > it['source_duration'] and tools.vail_file(it['filename']):
                try:
                    original_dubb_time = it['dubb_time']
                    _, new_dubb_length_ms = tools.remove_silence_from_file(
                        it['filename'], silence_threshold=-50.0, chunk_size=10, is_start=True)
                    it['dubb_time'] = new_dubb_length_ms
                    if original_dubb_time != it['dubb_time']:
                        config.logger.info(f"Removed silence from {Path(it['filename']).name}: duration reduced from {original_dubb_time}ms to {it['dubb_time']}ms.")
                except Exception as e:
                    config.logger.warning(f"Could not remove silence from {it['filename']}: {e}")

            # 吸收微小间隙后，可用的视频时长可能会增加
            effective_source_duration = it['source_duration']
            if it.get('silent_gap', 0) < self.MIN_CLIP_DURATION_MS:
                effective_source_duration += it['silent_gap']

            if it['dubb_time'] <= effective_source_duration or effective_source_duration <= 0:
                continue

            dub_duration = it['dubb_time']
            # 使用有效时长进行计算
            source_duration = effective_source_duration
            silent_gap = it['silent_gap']
            over_time = dub_duration - source_duration

            # 决策逻辑现在基于 `effective_source_duration`
            if self.shoud_audiorate and not self.shoud_videorate:
                required_speed = dub_duration / source_duration
                if required_speed <= 1.5:
                    it['target_audio_duration'] = source_duration
                else:
                    # 注意：这里的silent_gap在吸收后实际已经为0，但为了逻辑完整性保留
                    available_time = source_duration + (silent_gap if silent_gap >= self.MIN_CLIP_DURATION_MS else 0)
                    duration_at_1_5x = dub_duration / 1.5
                    it['target_audio_duration'] = duration_at_1_5x if duration_at_1_5x <= available_time else available_time
            
            elif not self.shoud_audiorate and self.shoud_videorate:
                required_pts = dub_duration / source_duration
                if required_pts <= 1.5:
                    it['target_video_duration'] = dub_duration
                else:
                    available_time = source_duration + (silent_gap if silent_gap >= self.MIN_CLIP_DURATION_MS else 0)
                    duration_at_1_5x = source_duration * 1.5
                    it['target_video_duration'] = duration_at_1_5x if duration_at_1_5x <= available_time else available_time

            elif self.shoud_audiorate and self.shoud_videorate:
                if over_time <= 1000:
                    it['target_audio_duration'] = source_duration
                else:
                    adjustment_share = over_time // 2
                    it['target_audio_duration'] = dub_duration - adjustment_share
                    it['target_video_duration'] = source_duration + adjustment_share

            # 安全校验和PTS计算
            if it['target_audio_duration'] < dub_duration:
                speed_ratio = dub_duration / it['target_audio_duration']
                if speed_ratio > self.max_audio_speed_rate: it['target_audio_duration'] = dub_duration / self.max_audio_speed_rate
            
            if it['target_video_duration'] > source_duration:
                pts_ratio = it['target_video_duration'] / source_duration
                if pts_ratio > self.max_video_pts_rate: it['target_video_duration'] = source_duration * self.max_video_pts_rate
                # pts需要基于最终裁切的原始视频时长来计算
                it['video_pts'] = max(1.0, it['target_video_duration'] / source_duration)
    
    def _process_single_audio(self, item):
        """处理单个音频文件的加速任务"""
        input_file_path = item['filename']
        target_duration_ms = int(item['target_duration_ms'])
        
        try:
            audio = AudioSegment.from_file(input_file_path)
            current_duration_ms = len(audio)

            if target_duration_ms <= 0 or current_duration_ms <= target_duration_ms: return input_file_path, current_duration_ms, ""

            speedup_ratio = current_duration_ms / target_duration_ms
            fast_audio = audio.speedup(playback_speed=speedup_ratio)
            config.logger.info(f'音频加速处理:{speedup_ratio=}')
            fast_audio.export(input_file_path, format=Path(input_file_path).suffix[1:])
            item['ref']['dubb_time'] = len(fast_audio)
            return input_file_path, len(fast_audio), ""
        except Exception as e:
            config.logger.error(f"Error processing audio {input_file_path}: {e}")
            return input_file_path, None, str(e)

    def _execute_audio_speedup(self):
        """第三步：执行音频加速。"""
        if not self.shoud_audiorate: return
        tasks = [
            {"filename": it['filename'], "target_duration_ms": it['target_audio_duration'], "ref": it}
            for it in self.queue_tts if it.get('dubb_time', 0) > it.get('target_audio_duration', 0) and tools.vail_file(it['filename'])
        ]
        if not tasks: return

        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = [executor.submit(self._process_single_audio, task) for task in tasks]
            for i, future in enumerate(concurrent.futures.as_completed(futures)):
                if config.exit_soft: executor.shutdown(wait=False, cancel_futures=True); return
                future.result()
                tools.set_process(text=f"Audio processing: {i + 1}/{len(tasks)}", uuid=self.uuid)

    def _execute_video_processing(self):
        """第四步：执行视频裁切（采用微小间隙吸收策略）。"""
        if not self.shoud_videorate or not self.novoice_mp4_original:
            return
            
        video_tasks = []
        processed_video_clips = []
        last_end_time = 0

        i = 0
        while i < len(self.queue_tts):
            it = self.queue_tts[i]
            
            # 处理字幕片段前的间隙
            gap_before = it['start_time_source'] - last_end_time
            if gap_before > self.MIN_CLIP_DURATION_MS:
                clip_path = Path(f'{self.cache_folder}/{i:05d}_gap.mp4').as_posix()
                video_tasks.append({"ss": tools.ms_to_time_string(ms=last_end_time), "to": tools.ms_to_time_string(ms=it['start_time_source']), "source": self.novoice_mp4_original, "pts": 1.0, "out": clip_path})
                processed_video_clips.append(clip_path)

            # 确定当前字幕片段的裁切终点
            start_ss = it['start_time_source']
            end_to = it['end_time_source']
            
            # V10 核心逻辑：向前看，决定是否吸收下一个间隙
            if i + 1 < len(self.queue_tts):
                next_it = self.queue_tts[i+1]
                gap_after = next_it['start_time_source'] - it['end_time_source']
                if 0 < gap_after < self.MIN_CLIP_DURATION_MS:
                    end_to = next_it['start_time_source'] # 延伸裁切终点
                    config.logger.info(f"Absorbing small gap ({gap_after}ms) after segment {i} into the clip.")
            
            current_clip_source_duration = end_to - start_ss
            
            # 只有当片段有效时才创建任务
            if current_clip_source_duration > self.MIN_CLIP_DURATION_MS:
                clip_path = Path(f"{self.cache_folder}/{i:05d}_sub.mp4").as_posix()
                
                # 如果需要变速，可能需要重新计算pts
                pts_val = it.get('video_pts', 1.0)
                if pts_val > 1.01:
                    # 新的pts = 目标时长 / 新的源时长
                    new_target_duration = it.get('target_video_duration', current_clip_source_duration)
                    pts_val = max(1.0, new_target_duration / current_clip_source_duration)

                video_tasks.append({"ss": tools.ms_to_time_string(ms=start_ss), "to": tools.ms_to_time_string(ms=end_to), "source": self.novoice_mp4_original, "pts": pts_val, "out": clip_path})
                processed_video_clips.append(clip_path)
            
            last_end_time = end_to
            i += 1
        
        # 处理结尾的最后一个间隙
        if (final_gap := self.raw_total_time - last_end_time) > self.MIN_CLIP_DURATION_MS:
            clip_path = Path(f'{self.cache_folder}/zzzz_final_gap.mp4').as_posix()
            video_tasks.append({"ss": tools.ms_to_time_string(ms=last_end_time), "to": "", "source": self.novoice_mp4_original, "pts": 1.0, "out": clip_path})
            processed_video_clips.append(clip_path)

        # ... (后续的执行、合并逻辑与之前版本相同) ...
        for j, task in enumerate(video_tasks):
            if config.exit_soft: return
            tools.set_process(text=f"Video processing: {j + 1}/{len(video_tasks)}", uuid=self.uuid)
            the_pts = task['pts'] if task.get('pts', 1.0) > 1.01 else ""
            config.logger.info(f'视频慢速:{the_pts=},处理后输出视频片段={task["out"]}')
            tools.cut_from_video(ss=task['ss'], to=task['to'], source=task['source'], pts=the_pts, out=task['out'])
            
            output_path = Path(task['out'])
            if not output_path.exists() or output_path.stat().st_size == 0:
                config.logger.warning(f"Segment {task['out']} failed to generate (PTS={task.get('pts', 1.0)}). Fallback to original speed.")
                tools.cut_from_video(ss=task['ss'], to=task['to'], source=task['source'], pts="", out=task['out'])
                if not output_path.exists() or output_path.stat().st_size == 0:
                    config.logger.error(f"FATAL: Fallback for {task['out']} also failed. Segment will be MISSING.")

        valid_clips = [clip for clip in processed_video_clips if Path(clip).exists() and Path(clip).stat().st_size > 0]
        if not valid_clips:
            config.logger.warning("No valid video clips generated to merge. Skipping video merge.")
            self.novoice_mp4 = self.novoice_mp4_original
            return

        concat_txt_path = Path(f'{self.cache_folder}/concat_list.txt').as_posix()
        tools.create_concat_txt(valid_clips, concat_txt=concat_txt_path)
        
        merged_video_path = Path(f'{self.cache_folder}/merged_{self.noextname}.mp4').as_posix()
        tools.set_process(text="Merging video clips...", uuid=self.uuid)
        tools.concat_multi_mp4(out=merged_video_path, concat_txt=concat_txt_path)
        self.novoice_mp4 = merged_video_path

    def _recalculate_timeline_and_merge_audio(self):
        """第五步：重新计算时间线并合并音频。"""
        merged_audio = AudioSegment.empty()
        
        video_was_processed = self.shoud_videorate and self.novoice_mp4_original and Path(self.novoice_mp4).name.startswith("merged_")

        if video_was_processed:
            config.logger.info("Building audio timeline based on processed video clips.")
            current_timeline_ms = 0
            try:
                sorted_clips = sorted([f for f in os.listdir(self.cache_folder) if f.endswith(".mp4") and ("_sub" in f or "_gap" in f)])
            except FileNotFoundError: return None

            for clip_filename in sorted_clips:
                clip_path = Path(f'{self.cache_folder}/{clip_filename}').as_posix()
                try:
                    if not (Path(clip_path).exists() and Path(clip_path).stat().st_size > 0): continue
                    clip_duration = tools.get_video_duration(clip_path)
                except Exception as e:
                    config.logger.warning(f"Could not get duration for clip {clip_path} (error: {e}). Skipping.")
                    continue

                if "_sub" in clip_filename:
                    index = int(clip_filename.split('_')[0])
                    it = self.queue_tts[index]
                    it['start_time'] = current_timeline_ms
                    segment = AudioSegment.from_file(it['filename']) if tools.vail_file(it['filename']) else AudioSegment.silent(duration=clip_duration)
                    
                    if len(segment) > clip_duration: segment = segment[:clip_duration]
                    elif len(segment) < clip_duration: segment += AudioSegment.silent(duration=clip_duration - len(segment))
                    
                    merged_audio += segment
                    it['end_time'] = current_timeline_ms + clip_duration
                    it['startraw'], it['endraw'] = tools.ms_to_time_string(ms=it['start_time']), tools.ms_to_time_string(ms=it['end_time']) 

                else: # gap
                    merged_audio += AudioSegment.silent(duration=clip_duration)
                current_timeline_ms += clip_duration
        else:
            # 此处的B模式逻辑保持不变，因为它不处理视频，不存在吸收间隙的问题
            config.logger.info("Building audio timeline based on original timings (video not processed).")
            last_end_time = 0
            for i, it in enumerate(self.queue_tts):
                silence_duration = it['start_time_source'] - last_end_time
                if silence_duration > 0: merged_audio += AudioSegment.silent(duration=silence_duration)
                it['start_time'] = len(merged_audio)

                dubb_time = int(tools.get_audio_time(it['filename']) * 1000) if tools.vail_file(it['filename']) else it['source_duration']
                segment = AudioSegment.from_file(it['filename']) if tools.vail_file(it['filename']) else AudioSegment.silent(duration=dubb_time)

                if len(segment) > dubb_time: segment = segment[:dubb_time]
                elif len(segment) < dubb_time: segment += AudioSegment.silent(duration=dubb_time - len(segment))
                merged_audio += segment
                
                it['end_time'] = len(merged_audio)
                last_end_time = it['end_time_source']
                it['startraw'], it['endraw'] = tools.ms_to_time_string(ms=it['start_time']), tools.ms_to_time_string(ms=it['end_time'])

        return merged_audio

    def _export_audio(self, audio_segment, destination_path):
        """将Pydub音频段导出到指定路径，处理不同格式。"""
        wavfile = Path(f'{self.cache_folder}/temp_{time.time_ns()}.wav').as_posix()
        try:
            audio_segment.export(wavfile, format="wav")
            ext = Path(destination_path).suffix.lower()
            if ext == '.wav':
                shutil.copy2(wavfile, destination_path)
            elif ext == '.m4a':
                tools.wav2m4a(wavfile, destination_path)
            else: # .mp3
                tools.runffmpeg(["-y", "-i", wavfile, "-ar", "48000", "-b:a", "192k", destination_path])
        finally:
            if Path(wavfile).exists():
                os.remove(wavfile)
    
    def _finalize_audio(self, merged_audio):
        """第六步：导出并对齐最终音视频时长（仅在视频被处理时）。"""
        tools.set_process(text="Exporting and finalizing audio...", uuid=self.uuid)
        try:
            self._export_audio(merged_audio, self.target_audio)

            video_was_processed = self.shoud_videorate and self.novoice_mp4_original and Path(self.novoice_mp4).name.startswith("merged_")
            if not video_was_processed:
                config.logger.info("Skipping duration alignment as video was not processed.")
                return

            if not (tools.vail_file(self.novoice_mp4) and tools.vail_file(self.target_audio)):
                config.logger.warning("Final video or audio file not found, skipping duration alignment.")
                return

            video_duration_ms = tools.get_video_duration(self.novoice_mp4)
            audio_duration_ms = int(tools.get_audio_time(self.target_audio) * 1000)
            
            padding_needed = video_duration_ms - audio_duration_ms

            if padding_needed > 10:
                config.logger.info(f"Audio is shorter than video by {padding_needed}ms. Padding with silence.")
                final_audio_segment = AudioSegment.from_file(self.target_audio)
                final_audio_segment += AudioSegment.silent(duration=padding_needed)
                self._export_audio(final_audio_segment, self.target_audio)
            elif padding_needed < -10:
                 config.logger.warning(f"Final audio is longer than video by {-padding_needed}ms. This may cause sync issues.")

        except Exception as e:
            config.logger.error(f"Failed to export or finalize audio: {e}")
            raise RuntimeError(f"Failed to finalize audio: {e}")
            
        config.logger.info("Final audio merged and aligned successfully.")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391

Code Breakdown

__init__: Initializes all parameters and defines the crucial MIN_CLIP_DURATION_MS constant, which is the foundation for our tiny-clip handling strategy.
_prepare_data: Employs a robust two-phase method to prepare data, completely resolving the potential KeyError that could arise from "looking ahead" in a single loop.
_calculate_adjustments: The core of the decision-making. It first tries to reduce the subsequent processing load by trimming "padding" (silence) from the start and end of the dubbing, then performs calculations based on our balancing strategy.
_execute_audio_speedup: Utilizes multithreading to process all audio files that need speeding up in parallel, improving efficiency.
_execute_video_processing: This is the most complex part of the entire flow and best demonstrates the engineering practices involved. It implements the superior "absorption" strategy to ensure video continuity and includes a built-in "try-check-fallback" fault tolerance mechanism, which is the cornerstone of the entire process's stability.
_recalculate_timeline_and_merge_audio: This method is very flexibly designed. It can automatically detect whether the video has actually been processed and choose different modes to construct the final audio timeline. This design allows the class to handle complex audio-visual synchronization tasks as well as simpler audio-only concatenation jobs.
_finalize_audio: The final "quality control" step. If the video was processed, it ensures that the final generated audio track has the exact same duration as the video—an essential detail in a professional workflow.

Usable, But Far From Perfect

Audio-visual synchronization, especially across languages, is a field full of details and challenges. The automated solution proposed in this article is not the final word, nor can it completely replace the fine-tuning of a professional. Its value lies in building a "smart" and "resilient" automated process through a series of carefully designed engineering practices—logic decoupling, absorption strategy, and fault-tolerant fallbacks. It can handle the vast majority of scenarios and gracefully bypass the pitfalls that would crash a simpler script.

It is a product of finding a practical balance between "perfect results" and "engineering feasibility." For scenarios requiring high-volume, rapid processing of video dubbing, it provides a reliable starting point that can automate 80% of the work and generate a first draft of acceptable quality. The remaining 20% can be left for manual work to add the finishing touches.

When AI Dubbing Meets Video: An Automated Engineering Practice for Audio-Visual Synchronization ​

The Core Idea: Finding a Balance Between Audio and Video ​

First Attempt: A Fragile Loop and Intertwined Logic ​

The Silent Enemy: Absorbing Gaps and Eliminating Errors ​

Designing for Failure: A Resilient Processing Pipeline ​

The Final Architecture: A Flexible, Decoupled SpeedRate Class ​

Code Breakdown ​

Usable, But Far From Perfect ​