Skip to content

Thanks to the rapid advancements in AI technology, the once challenging task of video translation has become more accessible, although the results may not yet be perfect.

Video translation is more complex than text translation, but at its core, it still relies on text-based translation (while technologies that directly convert sound into another language exist, they are currently not mature enough for practical use).

The video translation workflow can be roughly divided into the following stages:

  1. Speech Recognition: Extracting speech from the video and converting it into text;

  2. Text Translation: Translating the extracted text into the target language;

  3. Speech Synthesis: Generating speech in the target language based on the translated text;

  4. Synchronization: Ensuring that the dubbed audio and subtitle files are synchronized with the video content;

  5. Embedding: Embedding the translated subtitles and dubbing into the video to generate a new video file.

Detailed Discussion of Each Stage:

Speech Recognition

The goal of this step is to accurately convert the speech content in the video into text, with timestamps attached. There are currently several ways to achieve this, including using OpenAI's Whisper model, Alibaba's FunASR series models, or directly calling online speech recognition APIs, such as Baidu Voice Recognition.

When selecting a model, you can choose from small (tiny) to large (large-v3) depending on your needs. The larger the model, the higher the recognition accuracy.

Text Translation

Once the text is obtained, translation can be performed. It is important to note that subtitle translation differs from ordinary text translation in that it requires consideration of timestamp matching.

When using traditional translation engines (such as Baidu Translate or Tencent Translate), only the subtitle text lines should be transmitted for translation, avoiding the transmission of line numbers and timestamp lines to prevent exceeding character limits or altering the subtitle format.

Ideally, the translated subtitles should have the same number of lines as the original subtitles, with no blank lines.

However, different translation engines, especially AI translation, will intelligently merge lines based on context, especially when the next line contains only a few characters or one or two words and is semantically connected to the previous sentence. It is likely to merge this into the previous line.

Although the translation result is more fluent and elegant, it also means that the subtitles cannot be strictly matched with the original subtitles, resulting in blank lines.

Speech Synthesis (Dubbing)

After translation, dubbing can be generated based on the translated subtitles.

Currently, EdgeTTS is an almost unlimited and free dubbing channel. By sending subtitles line by line to EdgeTTS, you can obtain dubbed audio files, which are then merged into a complete audio file.

Synchronization and Alignment Adjustment

Ensuring synchronization between subtitles, audio, and video is the biggest challenge in video translation.

Differences in pronunciation duration between different languages are inevitable, which leads to synchronization problems. Strategies to solve this problem include speeding up audio playback or extending the length of video clips, as well as using the blank intervals between subtitles to make adjustments to achieve optimal synchronization.

If no adjustments are made and the subtitles are embedded directly according to the original timestamps, it will inevitably happen that the subtitles have disappeared, but the person is still speaking, or the person in the video has finished speaking and closed their mouth, but the audio is still playing.

To solve this problem, there are two relatively simple methods:

  1. Speed up audio playback to force playback within the subtitle time interval to achieve synchronization. The disadvantage is that the speaking speed is sometimes fast and sometimes slow, resulting in a poor experience.

  2. Play the video clip in the subtitle interval at a slower speed, that is, extend the video clip until the length matches the new dubbing length, which can also achieve synchronization. The disadvantage is that the picture will have a stuttering effect.

Both methods can be used simultaneously, that is, the audio is accelerated while the video clip is extended, which prevents the audio from accelerating too quickly and also prevents the video from extending too much.

Depending on the actual situation of the video, you can also use the blank interval between 2 subtitles. First, try to accelerate the audio within the blank interval time specified by the subtitles without accelerating the audio. If it can be played normally, there is no need to accelerate, which will be better. Of course, the disadvantage is that the video screen has finished speaking, but the actual audio is still playing.

Synthesis and Output

After completing the above steps, embed the translated subtitles and dubbing into the original video. This can be easily achieved using tools such as ffmpeg. The final generated video file completes the translation process.

ffmpeg -y -i 原视频.mp4 -i 配音音频.m4a -c:v libx264 -c:a aac -vf subtitles=字幕.srt out.mp4

Difficult to Solve Problems: Multiple Speaker Recognition

Speaker role recognition, that is, synthesizing different dubbing according to different character roles in the video, involves speaker recognition and requires pre-specifying several speaker roles. It is barely suitable for ordinary one- or two-person dialogues, but for most videos, it is difficult to determine several speaker roles in advance, and the final synthesized effect is also very poor, so this part has not been considered for the time being.

Summary

The above is just a simple process principle. In fact, to achieve good translation results, there are many points to pay attention to, such as the pre-processing of the original video input format (mov/mp4/avi/mkv), splitting the video into audio and silent video, separating human voice from background noise in the audio, processing the results of batch translation to speed up subtitle translation, re-splitting when blank lines appear in subtitles, generating and embedding dual subtitles, and so on.

Through this series of steps, the video translation task can be successfully completed, seamlessly converting video content into the target language. Although there may be some technical challenges in the process, with the continuous progress and optimization of technology, the quality and efficiency of video translation are expected to be further improved in the future.