Skip to content

Previous Post

This is the third post in the series, where we've turned the narrow path of audio-visual subtitle synchronization into a passable dirt road. In the previous two posts, we acted like mechanics, tightening screws here and there with a wrench: if a segment's audio and video were off by a dozen seconds, we'd patch it; if a speed-adjusted segment sounded shrill, we'd switch algorithms and recalculate. Ultimately, a 23-minute video's drift, once visually noticeable at over ten seconds, converged to around 200 ms—acceptable for an engineering prototype.

But there's a gap between "it runs" and "it's usable," and bridging it requires a thorough overhaul. This post isn't about showing off more tricks. Instead, I want to lay out the entire approach so you can clearly see:

  • What problem are we actually solving?
  • What "strategic routes" did we devise to tackle it?
  • What does the final implemented code look like, and why was it designed that way?

If you've read the first two posts, you can treat this one as a "design document + lessons learned." If you haven't, starting here is fine—all key information will be covered again.


The Core Problem: In a Nutshell, Mismatched Timing

When dubbing a Chinese video into English, or other languages like Russian or German, the most common issue is "different speaking rates." The same line of dialogue might take 3 seconds in Chinese but 4 seconds in English. The person on screen has closed their mouth, but the audio is still playing—this instantly breaks the audience's immersion.

We can only do two things:

  1. Speed up the audio (compress).
  2. Slow down the video (stretch).

Both have side effects:

  • Compressing too much makes the audio high-pitched and shrill.
  • Stretching too much makes the action look like slow-motion playback.

So, the problem becomes: how to combine "compression" and "stretching" to minimize these side effects.


Four Strategic Routes

We broke down the possible approaches into four "modes," implemented as separate branches in the code. You can switch between them with a single click based on the content type.

ModeCore IdeaUse CaseNotes
Shared Burden: Speed up audio and slow down video simultaneouslyAudio and video each compromise, distributing the distortion.General dialogue, news reportsRecommended default
Video Concedes: Slow down video onlyPreserve audio quality at all costs, sacrificing video.Music videos, high-quality narrationMax 10x slowdown
Audio Concedes: Speed up audio onlyPreserve video integrity at all costs, sacrificing audio.Dance, action scenesNo limit on speed-up factor
Preserve Original: No speed changesNo speed changes, just pure concatenation.When explicitly requested by the userPads the end with a freeze frame or silence.

All subsequent code is built around supporting these four strategies within a single pipeline.


From Blueprint to Reality: Three Major Revisions

V1: Direct Concatenation — The Snowballing Error

The initial approach was simple:

  • Calculate the required duration for each segment,
  • Cut them out with FFmpeg,
  • And concatenate them one by one.

It worked fine for a 5-minute clip, but on a 23-minute video, the error snowballed to 13 seconds. Floating-point inaccuracies, frame rate rounding, and timebase differences all came into play.

V2: The Theoretical Model — Smaller, but Unresolved, Error

We introduced a "dynamic time offset":

  • The start time of each segment no longer depended on the actual result of the previous one;
  • Instead, it was calculated from a formula to determine its "theoretical start time."

The error dropped from 13 seconds to 3, but it was still not good enough.

V3: Prioritizing Physical Reality — Error Converges to 200 ms

We completely abandoned prediction and started "measuring" directly:

  • After generating each video segment, immediately measure its actual duration using ffprobe.
  • The audio is then stitched together strictly according to this "measured blueprint."

After this change, the 23-minute video's sync was stable within 200 ms for the first time. For a 2-hour video, the error is controllable within about 1 second, which is acceptable.


Core Process Breakdown

Let's walk through the main steps of the SpeedRate class.

Entry Point run(): The Initial Fork

  • If the user selects the "Preserve Original" mode, it directly calls _run_no_rate_change_mode(), a separate branch that doesn't interfere with the more complex logic.
  • Otherwise, it follows the full pipeline: Prepare Data → Calculate Adjustments → Process Audio → Process Video → Rebuild Audio → Export.

_prepare_data(): Laying the Foundation

  • Read the frame rate, calculate "original durations," and determine "gaps between subtitles."
  • This data is used in every subsequent step, so we calculate it upfront to avoid redundant work.

_calculate_adjustments(): Making Decisions

Calculates the "theoretical target durations" based on the four modes. This step only involves calculations, not file manipulation.

_execute_audio_speedup(): Modifying the Audio

  • Uses pydub.speedup to process the audio at the calculated rate.
  • After processing, it's "trimmed" again to ensure the error is less than 10 ms.

_execute_video_processing(): Modifying the Video

  • First, it cuts the entire video into small segments and encodes them into a uniform intermediate format to prevent screen tearing or artifacts during concatenation.
  • After each segment is cut, its "actual duration" is immediately measured and written back to a dictionary for the subsequent audio alignment.

_recalculate_timeline_and_merge_audio(): Rebuilding Audio Based on Measured Durations

  • It no longer refers to the original subtitle durations, but only to the "actual video durations."
  • If the video is longer, the audio is padded with silence; if the video is shorter, the end of the audio is trimmed.

_finalize_files(): Final Alignment

  • If the total audio and video durations don't match, it uses padding with silence or a freeze frame of the last frame as a fallback.

Code Skeleton at a Glance

The following pseudocode outlines the main process for quick reference:

def run():
    if no_speed_change:
        pure_concatenate()
        return
    prepare_data()
    calculate_theoretical_durations()
    speed_up_audio()
    speed_up_video_and_measure_actual_durations()
    rebuild_audio_based_on_actual_durations()
    finalize_and_export()

The actual implementation is spread across a dozen small functions, each doing one thing, with verb-based names like _cut, _concat, _export... When reading the code, you can just follow the call chain.


Pitfalls Encountered

  • Concatenation Artifacts: Directly concatenating video segments with inconsistent frame rates or color spaces (which can happen when using FFmpeg hardware acceleration) will cause artifacts. We solve this by standardizing all segments to an "intermediate format" before lossless concatenation.
  • Audio Resampling Noise: To align timing, we once tried resampling all dubbed segments to a uniform 44.1 kHz and then normalizing them. This introduced noticeable background noise that was difficult to eliminate completely. We eventually abandoned this, preferring to trim or pad with silence instead.
  • PTS Limit: FFmpeg's setpts filter often fails when the multiplier exceeds 10, and the resulting video is as slow as a slideshow, making it impractical. We therefore imposed a hard limit, preferring to trim the audio further if necessary.

How to Use

Use the SpeedRate class like any other:

sr = SpeedRate(
        queue_tts=subtitle_queue,
        shoud_audiorate=True,
        shoud_videorate=True,
        novoice_mp4=path_to_silent_video, # ffmpeg -i video.mp4 -an silent_video.mp4
        uuid=random_string,
        cache_folder=temp_directory
)
sr.run()

Parameter Descriptions:

  • queue_tts: A list of dictionaries, one for each subtitle.
[
 {'line': 33, 'start_time': 131170, 'end_time': 132250,  'startraw': '00:02:11,170', 'endraw': '00:02:12,250', 'time': '00:02:11,170 --> 00:02:12,250','filename': 'path/to/dubbed/audio/segment'}
...
]
  • shoud_audiorate / shoud_videorate: Boolean switches that determine which strategy to use.
  • The remaining path-related parameters should be set as needed.

Conclusion

The greatest value of this solution isn't in its advanced algorithms, but in its "practicality":

  • It covers most content types with four distinct strategies.
  • It resolves floating-point errors using "measurement-based alignment."
  • It ensures stable concatenation using an "intermediate format."
  • It reduces maintenance difficulty with "short functions + clear naming."