Speaker-Identified Subtitles with ModelScope cam++ and Paraformer: A Full Troubleshooting and Implementation Guide
If you work with multimedia content, you know that adding subtitles to videos is a basic operation. But if you want to go a step further and automatically label each sentence with the speaker's name, the challenge arises.
This article documents my complete hands-on experience. Starting from a simple idea, I used ModelScope's open-source models to build and debug a tool that automatically identifies speakers and generates SRT subtitles. The pitfalls I encountered, the problem-solving approaches, and the reflections on technical boundaries during this journey may be more valuable than the final code itself.
The Initial Blueprint: Two Major Models, Each with Its Own Role
The goal was clear: Input a multi-person dialogue audio and output SRT subtitles with speaker labels like [spk_0]
, [spk_1]
.
To achieve this goal, a single model wasn't enough. We needed a combination of approaches:
Speaker Diarization
- Task: Identify "who spoke when." It acts like a detective, scanning the entire audio and dividing it into segments spoken by different people. It doesn't care what they said.
- Selected Model:
iic/speech_campplus_speaker-diarization_common
, a seasoned and powerful model in the field of speaker recognition. - Model Address: https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common
Automatic Speech Recognition (ASR)
- Task: Figure out "what was said." It's responsible for converting speech signals into time-stamped text.
- Selected Model:
iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
, the Paraformer model from the FunASR ecosystem, known for its accuracy and efficiency in Chinese speech recognition. - Model Address: https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Result Fusion
- Task: This is the "glue" code that we wrote ourselves. It takes the results from the previous two steps and, like a director, accurately assigns each line of dialogue (text + time) to the corresponding actor (speaker), ultimately generating the speaker-labeled SRT subtitles we want.
The blueprint was beautiful, but the challenges only surfaced when the actual construction began.
The Debugging Journey: Navigating Through "Unexpected" and "Surprising" Moments
The First Hurdle: The "Guessing Game" of API Parameters
Rolling up our sleeves and getting to work, the initial code repeatedly hit walls, with TypeError
becoming a common occurrence. The error log, like an impatient guide, repeatedly warned: "Incorrect parameter name!" Through repeated attempts and comparisons with the documentation, it became clear that even within the ModelScope ecosystem, different models have different "tastes":
- The
diarization_pipeline
for speaker diarization requires the audio parameter to beaudio
. - While the
asr_model.generate
for speech recognition only recognizesinput
.
A small difference in parameter names became the first step of a long journey.
The Second Hurdle: The "Mysterious Blind Box" of Model Output
I had assumed that the speaker diarization model would politely return a file path in a standard format (such as RTTM). However, the reality was that it directly provided a list in memory. To make matters worse, the format of this list changed several times, from the initial "list of lists" to the final, extremely simple [[start_sec, end_sec, speaker_ID]]
format, which was truly baffling.
This reminded us that we cannot write code based on assumptions. We must print and understand the real output of each step of the model.
The Third Hurdle: How to Elegantly "Segment" Sentences?
The ASR model gave us an entire block of text with punctuation and a list of timestamps accurate to the "character" level, but it didn't provide readily available results with sentences already segmented. This was another challenge.
Initial Attempt: Clumsy but Effective I hand-wrote a
reconstruct_sentences_from_asr
function, using regular expressions to "roughly" split sentences by periods, question marks, and other punctuation. Then, I accumulated the timestamps based on the number of characters in each split sentence. This method worked, but it didn't feel "AI-like" and was awkward.Final Optimization: Discovering the "Hidden Switch" After in-depth research, I discovered that the FunASR model itself integrates the function of Voice Activity Detection (VAD), which is naturally designed for sentence segmentation. By simply adding a
sentence_timestamp=True
parameter when callingmodel.generate
, we can directly obtain a field calledsentence_info
. This field contains perfectly segmented and time-stamped results, all in one step.
The Final Confusion: Why is the Result Wrong When the Code is Right?
When all the code logic had been straightened out, I used an audio recording containing clear male and female dialogues for testing, but the result showed that all the speech was spoken by the same person. This was the most puzzling moment: the code logic was flawless, so why was the result so far off the mark?
As shown in the figure above, the first sentence is spoken by a male voice and the second by a female voice, but they were not successfully distinguished.
I cross-validated using the official sample audio and finally confirmed a harsh fact: my code logic was completely correct, but my test audio was too difficult for the cam++
model. Even though I explicitly informed the model that there were two people on the scene through the oracle_num=2
parameter, it still failed to distinguish them successfully.
The Gap Between the Final Code and Reality
This code incorporates all the above optimizations. It is logically clear and maximizes the model's own capabilities.
import os
import re
from funasr import AutoModel
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import soundfile
audio_file = 'he.wav'
try:
data, sample_rate = soundfile.read(audio_file)
if sample_rate != 16000:
print(f"Warning: The audio sample rate is {sample_rate}Hz. For best results, it is recommended to use audio with a 16kHz sample rate.")
except Exception as e:
print(f"Error: Failed to read audio file {audio_file}. Please ensure the file exists and is in the correct format. Error message: {e}")
exit()
# === Speaker Diarization Model ===
print("Initializing speaker diarization model (cam++)...")
diarization_pipeline = pipeline(
task=Tasks.speaker_diarization,
model='iic/speech_campplus_speaker-diarization_common',
model_revision='v1.0.0'
)
# === Speech Recognition Model ===
print("Initializing speech recognition model (paraformer-zh)...")
asr_model = AutoModel(model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
vad_model="fsmn-vad",
punc_model="ct-punc-c")
# --- 2. Execute Model Pipeline ---
print(f"Start processing audio file: {audio_file}")
print("Start performing speaker diarization...")
# If you can determine the number of speakers, increasing this parameter can improve accuracy
num_speakers = 2
diarization_result = diarization_pipeline(audio_file, oracle_num=num_speakers)
diarization_output = diarization_result['text']
print(f"Speaker diarization completed.")
print(f"--- Original output of speaker diarization model ---\n{diarization_output}\n---------------------------------")
print("Start performing speech recognition...")
# Use the model's built-in VAD for intelligent sentence segmentation to directly obtain the sentence list
res = asr_model.generate(input=audio_file, sentence_timestamp=True)
print("Speech recognition completed.")
# --- 3. Merge and Process ---
def parse_diarization_result(diarization_segments):
"""Parse the [[start, end, id]] format list returned by the speaker diarization model."""
speaker_segments = []
if not isinstance(diarization_segments, list): return []
for segment in diarization_segments:
if isinstance(segment, list) and len(segment) == 3:
try:
start_sec, end_sec = float(segment[0]), float(segment[1])
speaker_id = f"spk_{segment[2]}"
speaker_segments.append({'speaker': speaker_id, 'start': start_sec, 'end': end_sec})
except (ValueError, TypeError) as e: print(f"Warning: Skipping incorrectly formatted separation segment: {segment}. Error: {e}")
return speaker_segments
def merge_results(asr_sentences, speaker_segments):
"""Merge ASR results and speaker diarization results"""
merged_sentences = []
if not speaker_segments:
# If speaker diarization fails, all sentences are marked as unknown
for sentence in asr_sentences:
sentence['speaker'] = "spk_unknown"
merged_sentences.append(sentence)
return merged_sentences
for sentence in asr_sentences:
sentence_start_sec, sentence_end_sec = sentence['start'] / 1000.0, sentence['end'] / 1000.0
found_speaker, best_overlap = "spk_unknown", 0
# Find the speaker segment with the longest time overlap with the current sentence
for seg in speaker_segments:
overlap_start = max(sentence_start_sec, seg['start'])
overlap_end = min(sentence_end_sec, seg['end'])
overlap_duration = max(0, overlap_end - overlap_start)
if overlap_duration > best_overlap:
best_overlap = overlap_duration
found_speaker = seg['speaker']
sentence['speaker'] = found_speaker
merged_sentences.append(sentence)
return merged_sentences
def format_time(milliseconds):
"""Convert milliseconds to SRT time format (HH:MM:SS,ms)"""
seconds = milliseconds / 1000.0
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds - int(seconds)) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def to_srt(sentences):
"""Convert the merged results to SRT format with speaker ID"""
srt_content = ""
for i, sentence in enumerate(sentences):
if 'start' not in sentence or 'end' not in sentence: continue
start_time = format_time(sentence['start'])
end_time = format_time(sentence['end'])
speaker_id = sentence.get('speaker', 'spk_unknown')
text = sentence.get('text', '')
srt_content += f"{i + 1}\n{start_time} --> {end_time}\n[{speaker_id}] {text}\n\n"
return srt_content
# --- 4. Generate Final SRT Subtitles ---
speaker_info = parse_diarization_result(diarization_output)
sentence_list = []
if res and 'sentence_info' in res[0]:
sentence_list = res[0]['sentence_info']
else:
print("Error or warning: Failed to get 'sentence_info' from ASR results.")
final_sentences = merge_results(sentence_list, speaker_info)
srt_output = to_srt(final_sentences)
print("\n--- Generated SRT subtitle content ---")
if srt_output:
print(srt_output)
output_srt_file = 'output_with_speakers.srt'
with open(output_srt_file, 'w', encoding='utf-8') as f: f.write(srt_output)
print(f"SRT subtitle file with speaker identification saved to: {output_srt_file}")
else:
print("Failed to generate SRT content.")
Environment Configuration Tips
- Install Dependencies:
pip install -U modelscope funasr addict
- Version Compatibility Issues: If you encounter errors after installation, try downgrading the
numpy
anddatasets
packages. This usually resolves some common compatibility issues:pip install --force-reinstall numpy==1.26.4 datasets==3.0.0
Execution Result on Google Colab
How Far Are Open-Source Speaker Diarization Models from Production Environments?
This practice demonstrates that it is technically possible to build a "hand-crafted" speaker recognition pipeline. However, this solution also has its obvious limitations:
The Separation Model is the Bottleneck: This cannot be overemphasized. The
cam++
model is the weak link in the entire workflow. If it can't "hear" several people in your audio, no amount of good code can save it. Its performance is unsatisfactory when dealing with complex scenarios such as background noise, accents, and variations in speech speed.Afraid of "Talking Over" and "Interruptions": Our merging logic is "winner-takes-all," and a sentence will be completely assigned to the speaker with the longest overlap time. It cannot handle complex situations where two people are speaking simultaneously or the dialogue is intersecting.
So, looking at the industry, what are other players doing?
Solution Type | Representative Tools/Services | Advantages | Disadvantages | One-Sentence Comment |
---|---|---|---|---|
Open-Source Integrated Pipeline | WhisperX, Pyannote.audio | Top-notch performance, active community, representing the highest level of academia and open-source. | Complex configuration, high resource consumption, not beginner-friendly. | Suitable for tech experts who like to tinker and pursue ultimate performance; novice users should run away. |
Commercial Cloud Service API | Google STT, AWS Transcribe, iFlytek | Simple and easy to use, stable and reliable, basically no need to worry about underlying details. | Pay-as-you-go, high cost, data needs to be uploaded to the cloud. | A "cash-powered" choice, suitable for rapid development and companies that don't mind spending money. |
This Project's Solution | ModelScope + FunASR | Free and open-source, flexible and controllable, can freely combine and modify models. | Need to troubleshoot and integrate yourself, current performance is unstable. | Suitable for learning, experimentation, and scenarios where performance requirements are not high. |
My initial goal was to test the effectiveness of this solution, and if it was good enough, integrate it into my other project pyVideoTrans
. But based on the actual test results, its performance, especially for long audio recordings, is far from meeting production-ready standards. And the complexity of deploying and integrating top-tier open-source solutions like Pyannote.audio
is indeed daunting.
Therefore, this exploration has come to a temporary end. Although the final product is not satisfactory, the whole process has been a valuable learning experience.