Skip to content

Speaker-Identified Subtitles with ModelScope cam++ and Paraformer: A Full Troubleshooting and Implementation Guide

If you work with multimedia content, you know that adding subtitles to videos is a basic operation. But if you want to go a step further and automatically label each sentence with the speaker's name, the challenge arises.

This article documents my complete hands-on experience. Starting from a simple idea, I used ModelScope's open-source models to build and debug a tool that automatically identifies speakers and generates SRT subtitles. The pitfalls I encountered, the problem-solving approaches, and the reflections on technical boundaries during this journey may be more valuable than the final code itself.

The Initial Blueprint: Two Major Models, Each with Its Own Role

The goal was clear: Input a multi-person dialogue audio and output SRT subtitles with speaker labels like [spk_0], [spk_1].

To achieve this goal, a single model wasn't enough. We needed a combination of approaches:

  1. Speaker Diarization

    • Task: Identify "who spoke when." It acts like a detective, scanning the entire audio and dividing it into segments spoken by different people. It doesn't care what they said.
    • Selected Model: iic/speech_campplus_speaker-diarization_common, a seasoned and powerful model in the field of speaker recognition.
    • Model Address: https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common
  2. Automatic Speech Recognition (ASR)

  3. Result Fusion

    • Task: This is the "glue" code that we wrote ourselves. It takes the results from the previous two steps and, like a director, accurately assigns each line of dialogue (text + time) to the corresponding actor (speaker), ultimately generating the speaker-labeled SRT subtitles we want.

The blueprint was beautiful, but the challenges only surfaced when the actual construction began.

The Debugging Journey: Navigating Through "Unexpected" and "Surprising" Moments

The First Hurdle: The "Guessing Game" of API Parameters

Rolling up our sleeves and getting to work, the initial code repeatedly hit walls, with TypeError becoming a common occurrence. The error log, like an impatient guide, repeatedly warned: "Incorrect parameter name!" Through repeated attempts and comparisons with the documentation, it became clear that even within the ModelScope ecosystem, different models have different "tastes":

  • The diarization_pipeline for speaker diarization requires the audio parameter to be audio.
  • While the asr_model.generate for speech recognition only recognizes input.

A small difference in parameter names became the first step of a long journey.

The Second Hurdle: The "Mysterious Blind Box" of Model Output

I had assumed that the speaker diarization model would politely return a file path in a standard format (such as RTTM). However, the reality was that it directly provided a list in memory. To make matters worse, the format of this list changed several times, from the initial "list of lists" to the final, extremely simple [[start_sec, end_sec, speaker_ID]] format, which was truly baffling.

This reminded us that we cannot write code based on assumptions. We must print and understand the real output of each step of the model.

The Third Hurdle: How to Elegantly "Segment" Sentences?

The ASR model gave us an entire block of text with punctuation and a list of timestamps accurate to the "character" level, but it didn't provide readily available results with sentences already segmented. This was another challenge.

  • Initial Attempt: Clumsy but Effective I hand-wrote a reconstruct_sentences_from_asr function, using regular expressions to "roughly" split sentences by periods, question marks, and other punctuation. Then, I accumulated the timestamps based on the number of characters in each split sentence. This method worked, but it didn't feel "AI-like" and was awkward.

  • Final Optimization: Discovering the "Hidden Switch" After in-depth research, I discovered that the FunASR model itself integrates the function of Voice Activity Detection (VAD), which is naturally designed for sentence segmentation. By simply adding a sentence_timestamp=True parameter when calling model.generate, we can directly obtain a field called sentence_info. This field contains perfectly segmented and time-stamped results, all in one step.

The Final Confusion: Why is the Result Wrong When the Code is Right?

When all the code logic had been straightened out, I used an audio recording containing clear male and female dialogues for testing, but the result showed that all the speech was spoken by the same person. This was the most puzzling moment: the code logic was flawless, so why was the result so far off the mark?

As shown in the figure above, the first sentence is spoken by a male voice and the second by a female voice, but they were not successfully distinguished.

I cross-validated using the official sample audio and finally confirmed a harsh fact: my code logic was completely correct, but my test audio was too difficult for the cam++ model. Even though I explicitly informed the model that there were two people on the scene through the oracle_num=2 parameter, it still failed to distinguish them successfully.

The Gap Between the Final Code and Reality

This code incorporates all the above optimizations. It is logically clear and maximizes the model's own capabilities.

python
import os
import re
from funasr import AutoModel
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import soundfile


audio_file = 'he.wav' 

try:
    data, sample_rate = soundfile.read(audio_file)
    if sample_rate != 16000:
        print(f"Warning: The audio sample rate is {sample_rate}Hz. For best results, it is recommended to use audio with a 16kHz sample rate.")
except Exception as e:
    print(f"Error: Failed to read audio file {audio_file}. Please ensure the file exists and is in the correct format. Error message: {e}")
    exit()

# === Speaker Diarization Model ===
print("Initializing speaker diarization model (cam++)...")
diarization_pipeline = pipeline(
    task=Tasks.speaker_diarization,
    model='iic/speech_campplus_speaker-diarization_common',
    model_revision='v1.0.0'
)

# === Speech Recognition Model ===
print("Initializing speech recognition model (paraformer-zh)...")
asr_model = AutoModel(model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
                      vad_model="fsmn-vad",
                      punc_model="ct-punc-c")


# --- 2. Execute Model Pipeline ---

print(f"Start processing audio file: {audio_file}")
print("Start performing speaker diarization...")

# If you can determine the number of speakers, increasing this parameter can improve accuracy
num_speakers = 2 
diarization_result = diarization_pipeline(audio_file, oracle_num=num_speakers) 
diarization_output = diarization_result['text'] 
print(f"Speaker diarization completed.")
print(f"--- Original output of speaker diarization model ---\n{diarization_output}\n---------------------------------")


print("Start performing speech recognition...")
# Use the model's built-in VAD for intelligent sentence segmentation to directly obtain the sentence list
res = asr_model.generate(input=audio_file, sentence_timestamp=True)
print("Speech recognition completed.")


# --- 3. Merge and Process ---

def parse_diarization_result(diarization_segments):
    """Parse the [[start, end, id]] format list returned by the speaker diarization model."""
    speaker_segments = []
    if not isinstance(diarization_segments, list): return []
    for segment in diarization_segments:
        if isinstance(segment, list) and len(segment) == 3:
            try:
                start_sec, end_sec = float(segment[0]), float(segment[1])
                speaker_id = f"spk_{segment[2]}"
                speaker_segments.append({'speaker': speaker_id, 'start': start_sec, 'end': end_sec})
            except (ValueError, TypeError) as e: print(f"Warning: Skipping incorrectly formatted separation segment: {segment}. Error: {e}")
    return speaker_segments

def merge_results(asr_sentences, speaker_segments):
    """Merge ASR results and speaker diarization results"""
    merged_sentences = []
    if not speaker_segments:
        # If speaker diarization fails, all sentences are marked as unknown
        for sentence in asr_sentences:
            sentence['speaker'] = "spk_unknown"
            merged_sentences.append(sentence)
        return merged_sentences
        
    for sentence in asr_sentences:
        sentence_start_sec, sentence_end_sec = sentence['start'] / 1000.0, sentence['end'] / 1000.0
        found_speaker, best_overlap = "spk_unknown", 0
        
        # Find the speaker segment with the longest time overlap with the current sentence
        for seg in speaker_segments:
            overlap_start = max(sentence_start_sec, seg['start'])
            overlap_end = min(sentence_end_sec, seg['end'])
            overlap_duration = max(0, overlap_end - overlap_start)
            
            if overlap_duration > best_overlap:
                best_overlap = overlap_duration
                found_speaker = seg['speaker']
                
        sentence['speaker'] = found_speaker
        merged_sentences.append(sentence)
    return merged_sentences

def format_time(milliseconds):
    """Convert milliseconds to SRT time format (HH:MM:SS,ms)"""
    seconds = milliseconds / 1000.0
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

def to_srt(sentences):
    """Convert the merged results to SRT format with speaker ID"""
    srt_content = ""
    for i, sentence in enumerate(sentences):
        if 'start' not in sentence or 'end' not in sentence: continue
        start_time = format_time(sentence['start'])
        end_time = format_time(sentence['end'])
        speaker_id = sentence.get('speaker', 'spk_unknown')
        text = sentence.get('text', '')
        srt_content += f"{i + 1}\n{start_time} --> {end_time}\n[{speaker_id}] {text}\n\n"
    return srt_content

# --- 4. Generate Final SRT Subtitles ---
speaker_info = parse_diarization_result(diarization_output)

sentence_list = []
if res and 'sentence_info' in res[0]:
    sentence_list = res[0]['sentence_info']
else:
    print("Error or warning: Failed to get 'sentence_info' from ASR results.")

final_sentences = merge_results(sentence_list, speaker_info)
srt_output = to_srt(final_sentences)

print("\n--- Generated SRT subtitle content ---")
if srt_output:
    print(srt_output)
    output_srt_file = 'output_with_speakers.srt'
    with open(output_srt_file, 'w', encoding='utf-8') as f: f.write(srt_output)
    print(f"SRT subtitle file with speaker identification saved to: {output_srt_file}")
else:
    print("Failed to generate SRT content.")

Environment Configuration Tips

  • Install Dependencies: pip install -U modelscope funasr addict
  • Version Compatibility Issues: If you encounter errors after installation, try downgrading the numpy and datasets packages. This usually resolves some common compatibility issues: pip install --force-reinstall numpy==1.26.4 datasets==3.0.0

Execution Result on Google Colab

How Far Are Open-Source Speaker Diarization Models from Production Environments?

This practice demonstrates that it is technically possible to build a "hand-crafted" speaker recognition pipeline. However, this solution also has its obvious limitations:

  1. The Separation Model is the Bottleneck: This cannot be overemphasized. The cam++ model is the weak link in the entire workflow. If it can't "hear" several people in your audio, no amount of good code can save it. Its performance is unsatisfactory when dealing with complex scenarios such as background noise, accents, and variations in speech speed.

  2. Afraid of "Talking Over" and "Interruptions": Our merging logic is "winner-takes-all," and a sentence will be completely assigned to the speaker with the longest overlap time. It cannot handle complex situations where two people are speaking simultaneously or the dialogue is intersecting.

So, looking at the industry, what are other players doing?

Solution TypeRepresentative Tools/ServicesAdvantagesDisadvantagesOne-Sentence Comment
Open-Source Integrated PipelineWhisperX, Pyannote.audioTop-notch performance, active community, representing the highest level of academia and open-source.Complex configuration, high resource consumption, not beginner-friendly.Suitable for tech experts who like to tinker and pursue ultimate performance; novice users should run away.
Commercial Cloud Service APIGoogle STT, AWS Transcribe, iFlytekSimple and easy to use, stable and reliable, basically no need to worry about underlying details.Pay-as-you-go, high cost, data needs to be uploaded to the cloud.A "cash-powered" choice, suitable for rapid development and companies that don't mind spending money.
This Project's SolutionModelScope + FunASRFree and open-source, flexible and controllable, can freely combine and modify models.Need to troubleshoot and integrate yourself, current performance is unstable.Suitable for learning, experimentation, and scenarios where performance requirements are not high.

My initial goal was to test the effectiveness of this solution, and if it was good enough, integrate it into my other project pyVideoTrans. But based on the actual test results, its performance, especially for long audio recordings, is far from meeting production-ready standards. And the complexity of deploying and integrating top-tier open-source solutions like Pyannote.audio is indeed daunting.

Therefore, this exploration has come to a temporary end. Although the final product is not satisfactory, the whole process has been a valuable learning experience.