Skip to content

Gemini + VAD Hybrid Architecture: Solving Whisper's Challenges with Low-Resource Languages to Generate Accurate SRT Subtitles

Well-known open-source speech recognition models like Whisper perform impressively when processing English. However, once outside the comfort zone of English, their performance in other languages drops sharply. For low-resource languages without massive datasets for specialized fine-tuning, the transcription results are often unsatisfactory. This makes creating subtitles for languages like Thai, Vietnamese, Malay, and even some dialects a costly and time-consuming task.

This is where Gemini enters as a game-changer.

Unlike many tools that rely on specific language models, Google Gemini was born in a truly global, multimodal, and multilingual environment. Its out-of-the-box high-quality recognition capabilities when handling various "low-resource languages" are its core competitive advantage. This means that without any additional fine-tuning, we can achieve recognition results that previously required targeted training.

However, even with such a powerful "language brain," Gemini has a common weakness: it cannot provide the frame-level accurate timestamps necessary for generating SRT subtitles.

This article presents a battle-tested "hybrid architecture" solution:

  • Precise Voice Activity Detection (sileroVAD) from faster-whisper: Leveraging only its strongest feature—locating the start and end times of speech with millisecond precision.
  • Gemini's unparalleled language talent: Letting it focus on its core task—performing high-quality, multilingual content transcription and speaker identification on the short audio segments segmented by VAD.

Through this workflow, we get the best of both worlds, ultimately generating professional-grade, multilingual SRT subtitle files with precise timestamps. Whether your audio is in mainstream languages like English and Chinese, or other languages that are difficult for typical models to handle, this solution will provide unprecedented convenience and accuracy.

Core Challenge: Why Not Use Gemini Directly?

Gemini's strength lies in content understanding. It excels at:

  • High-Quality Transcription: High text accuracy with contextual understanding.
  • Multilingual Recognition: Automatic detection of the audio language.
  • Speaker Identification: Recognizing the same speaker across multiple audio segments.

But its weakness is temporal precision. For generating SRT subtitles, the crucial question of "at which minute and second does this word appear" is something Gemini currently cannot answer with sufficient accuracy. This is precisely where tools like faster-whisper (with built-in sileroVAD), designed specifically for speech processing, excel.

Solution: A Hybrid Architecture of VAD and LLM

Our solution divides the task in two, letting specialized tools do their specialized jobs:

  1. Precise Segmentation (faster-whisper): We use the sileroVAD voice activity detection function built into the faster-whisper library. VAD scans the entire audio with millisecond precision to find the start and end times of all speech segments. We cut the audio accordingly into a series of shorter .wav fragments, each carrying precise timestamps.

  2. High-Quality Transcription (Gemini): We send these small audio fragments sequentially and in batches to Gemini. Since each fragment already carries precise time information, we no longer need Gemini to provide timestamps. We only need it to focus on what it does best: transcribing content and identifying speakers.

Finally, we match the transcribed text returned by Gemini with the timestamps provided by faster-whisper one by one, combining them into a complete SRT file.

Complete Implementation Code

Below is the complete Python code to implement the workflow described above. You can directly copy and save it as test.py for testing.

Usage:

  1. Install Dependencies:

    bash
    pip install faster-whisper pydub google-generativeai
  2. Set API Key: It is recommended to set your Gemini API key as an environment variable for security.

    • On Linux/macOS: export GOOGLE_API_KEY="YOUR_API_KEY"
    • On Windows: set GOOGLE_API_KEY="YOUR_API_KEY"
    • Alternatively, you can modify the gemini_api_key variable directly in the code.
  3. Run the Script:

    bash
    python test.py "path/to/your/audio.mp3"

    Supports common audio formats like .mp3, .wav, .m4a, etc.

import os
import re
import sys
import time
import google.generativeai as genai
from pathlib import Path
from pydub import AudioSegment
# You can fill in the corresponding proxy address here if needed
# os.environ['https_proxy']='http://127.0.0.1:10808'

# --- Helper Function ---
def ms_to_time_string(ms):
    """Converts milliseconds to SRT time format HH:MM:SS,ms"""
    hours = ms // 3600000
    ms %= 3600000
    minutes = ms // 60000
    ms %= 60000
    seconds = ms // 1000
    milliseconds = ms % 1000
    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"

# --- Core Logic ---
def generate_srt_from_audio(audio_file_path, api_key):
    """
    Generates an SRT file from an audio file using VAD and Gemini.
    """
    if not Path(audio_file_path).exists():
        print(f"Error: Audio file not found at {audio_file_path}")
        return

    # 1. VAD-based Audio Segmentation
    print("Step 1: Segmenting audio with VAD...")
    try:
        # These imports are here to ensure faster-whisper is an optional dependency
        from faster_whisper.audio import decode_audio
        from faster_whisper.vad import VadOptions, get_speech_timestamps
    except ImportError:
        print("Error: faster-whisper is not installed. Please run 'pip install faster-whisper'")
        return

    sampling_rate = 16000
    audio_for_vad = decode_audio(audio_file_path, sampling_rate=sampling_rate)
    
    # VAD options can be tweaked for better performance
    vad_p={
            #"threshold":float(config.settings['threshold']),
            "min_speech_duration_ms":1,
            "max_speech_duration_s":8,
            "min_silence_duration_ms":200,
            "speech_pad_ms":100
        }
    vad_options = VadOptions(**vad_p)
    
    speech_chunks_samples = get_speech_timestamps(audio_for_vad, vad_options)
    
    # Convert sample-based timestamps to milliseconds
    speech_chunks_ms = [
        {"start": int(chunk["start"] / sampling_rate * 1000), "end": int(chunk["end"] / sampling_rate * 1000)}
        for chunk in speech_chunks_samples
    ]

    if not speech_chunks_ms:
        print("No speech detected in the audio file.")
        return

    # Create a temporary directory for audio chunks
    temp_dir = Path(f"./temp_audio_chunks_{int(time.time())}")
    temp_dir.mkdir(exist_ok=True)
    print(f"Saving segments to {temp_dir}...")

    full_audio = AudioSegment.from_file(audio_file_path)
    segment_data = []
    for i, chunk_times in enumerate(speech_chunks_ms):
        start_ms, end_ms = chunk_times['start'], chunk_times['end']
        audio_chunk = full_audio[start_ms:end_ms]
        chunk_file_path = temp_dir / f"chunk_{i}_{start_ms}_{end_ms}.wav"
        audio_chunk.export(chunk_file_path, format="wav")
        segment_data.append({"start_time": start_ms, "end_time": end_ms, "file": str(chunk_file_path)})
    print(segment_data)
    #return
    # 2. Batch Transcription with Gemini
    print("\nStep 2: Transcribing with Gemini in batches...")
    
    # Configure Gemini API
    genai.configure(api_key=api_key)
    
    # The final, robust prompt
    prompt = """
# Role
You are a highly specialized AI data processor. Your sole function is to receive a batch of audio files and generate a **single, complete XML report** based on the unbreakable rules below. You are not a conversational assistant.

# Unbreakable Rules & Output Format
You must analyze all audio files received in this request as a whole and strictly adhere to the following rules. **The priority of these rules overrides everything else, especially Rule #1.**

1.  **【Highest Priority】Strict One-to-One Mapping**:
    *   This is the most important rule: **Every single audio file** I provide you **must and can only correspond to one `<audio_text>` tag** in the final output.
    *   **Regardless of how long a single audio file is, or how many pauses or sentences it contains**, you **must** merge all its transcribed content **into a single string** and place it into that one and only `<audio_text>` tag.
    *   **Absolutely prohibit** creating multiple `<audio_text>` tags for the same input file.

2.  **【Data Analysis】Speaker Identification**:
    *   Analyze all audio to identify different speakers. All segments spoken by the same person must use the same, incrementing ID starting from 0 (`[spk0]`, `[spk1]`...).
    *   For audio where the speaker cannot be identified (e.g., noise, music), uniformly use ID `-1` (`[spk-1]`).

3.  **【Content & Order】Transcription and Sorting**:
    *   Automatically detect the language of each audio and transcribe it. If transcription is impossible, fill the text content with an empty string.
    *   The order of the `<audio_text>` tags in the final XML must strictly match the order of the input audio files.

# Mandatory Output Format Example
<!-- You must generate output with exactly the same structure as below. Note: Even if the audio is long, all its content must be merged into one tag. -->
```xml
<result>
    <audio_text>[spk0]This is the transcription result for the first file.</audio_text>
    <audio_text>[spk1]This is the transcription for the second file, it might be very long but all content must be in this single tag.</audio_text>
    <audio_text>[spk0]This is the transcription result for the third file, the speaker is the same as the first file.</audio_text>
    <audio_text>[spk-1]</audio_text> 
</result>
```

# !!! Final Mandatory Check !!!
- **Zero Tolerance Policy**: Your response **must only be the XML content**. Absolutely prohibit including any text, explanation, or ` ```xml ` markup outside the XML.
- **Mandatory Count and Error Correction**: Before you generate your final response, you **must perform a count check**: Does the number of `<audio_text>` tags you are about to generate **exactly equal** the number of audio files I provided?
    - **If the count does not match**, this means you have seriously violated **【Highest Priority】Rule #1**. You must **【discard】** the current draft and **【regenerate】**, ensuring strict adherence to the one-to-one mapping.
    - **Output is only allowed if the count matches exactly.**

"""

    model = genai.GenerativeModel(model_name="gemini-2.0-flash")

    # Process in batches of 20 (adjust as needed)
    batch_size = 50
    all_srt_entries = []
    print(f'{len(segment_data)=}')
    for i in range(0, len(segment_data), batch_size):
        batch = segment_data[i:i + batch_size]
        print(f"Processing batch {i//batch_size + 1}...")

        files_to_upload = []
        for seg in batch:
            files_to_upload.append(genai.upload_file(path=seg['file'], mime_type="audio/wav"))

        try:
            chat_session = model.start_chat(
                    history=[
                        {
                            "role": "user",
                            "parts": files_to_upload,
                        }
                    ]
                )
            print(files_to_upload)
            response = chat_session.send_message(prompt,request_options={"timeout":600})    


            # Use regex to parse the XML-like response
            transcribed_texts = re.findall(r'<audio_text>(.*?)</audio_text>', response.text.strip(), re.DOTALL)
            print(response.text)
            print(batch)
            
            
            

            for idx, text in enumerate(transcribed_texts):
                if idx < len(batch):
                    seg_info = batch[idx]
                    all_srt_entries.append({
                        "start_time": seg_info['start_time'],
                        "end_time": seg_info['end_time'],
                        "text": text.strip()
                    })

        except Exception as e:
            print(f"An error occurred during Gemini API call: {e}")

    # 3. Assemble SRT File
    print("\nStep 3: Assembling SRT file...")
    srt_file_path = Path(audio_file_path).with_suffix('.srt')
    with open(srt_file_path, 'w', encoding='utf-8') as f:
        for i, entry in enumerate(all_srt_entries):
            start_time_str = ms_to_time_string(entry['start_time'])
            end_time_str = ms_to_time_string(entry['end_time'])
            f.write(f"{i + 1}\n")
            f.write(f"{start_time_str} --> {end_time_str}\n")
            f.write(f"{entry['text']}\n\n")

    print(f"\nSuccess! SRT file saved to: {srt_file_path}")
    
    # Clean up temporary files
    for seg in segment_data:
        Path(seg['file']).unlink()
    temp_dir.rmdir()


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python gemini_srt_generator.py <path_to_audio_file>")
        sys.exit(1)
        
    audio_file = sys.argv[1]
    
    # It's recommended to set the API key as an environment variable
    # for security reasons, e.g., export GOOGLE_API_KEY="YOUR_KEY"
    gemini_api_key = os.environ.get("GOOGLE_API_KEY", "Fill in your Gemini API KEY here")

    generate_srt_from_audio(audio_file, gemini_api_key)

The "Blood and Tears" of Prompt Engineering: How to Tame Gemini

The final prompt you see is the result of a series of failures and optimizations. This process is highly instructive for any developer looking to integrate LLMs into automated workflows.

Phase 1: Initial Concept and Failure

The initial prompt was straightforward, asking Gemini to perform speaker identification and output results in order. But when sending more than about 10 audio segments at once, Gemini's behavior became unpredictable: instead of performing the task, it replied like a conversational assistant, saying "Okay, please provide the audio files," completely ignoring that we had already included the files in the request.

  • Conclusion: Prompts that are too complex and describe a "workflow" can confuse the model when handling multimodal batch tasks, causing it to revert to conversational mode.

Phase 2: Format "Amnesia"

We adjusted the prompt to be more like a "rule set" than a "flowchart." This time, Gemini successfully transcribed everything! But it forgot our requested XML format, simply concatenating all transcribed text into one large paragraph.

  • Conclusion: When the model faces a high "cognitive load" (processing dozens of audio files simultaneously), it may prioritize the core task (transcription) and ignore or "forget" secondary but critical instructions like formatting.

Phase 3: Uncontrolled "Internal Segmentation"

We further strengthened the format instructions, explicitly requiring XML output. The format was correct this time, but a new problem emerged: for a slightly longer audio segment (say 10 seconds), Gemini would arbitrarily split it into two or three sentences and generate one <audio_text> tag for each. This resulted in receiving 30+ tags for 20 input files, completely disrupting our one-to-one correspondence with timestamps.

  • Conclusion: The model's internal logic (like segmenting by sentence) can conflict with our external instructions. We must use stronger, more explicit instructions to override its default behavior.

The Final Prompt

Ultimately, we developed a set of effective "taming" strategies embodied in the final prompt:

  1. Extreme Role Limitation: Start by defining it as a "highly specialized AI data processor," not an "assistant," to prevent chit-chat.
  2. Rule Prioritization and Highest Priority: Explicitly set "one input file corresponds to one output tag" as the 【Highest Priority】 rule, making it an unbreakable red line for the model.
  3. Explicit Merge Instruction: Directly command the model to "merge all content into a single string regardless of audio length," providing clear operational guidance.
  4. Mandatory Self-Check and Error Correction: This is the most critical step. We command the model to perform a count check before outputting. If the tag count doesn't match the file count, it must 【discard】 the draft and 【regenerate】. This is equivalent to building an "assertion" and "error handling" mechanism into the prompt.

This process teaches us that programmatic interaction with LLMs is far more than just "asking a question." It's more like designing an API interface. We need rigorous instructions, clear formatting, explicit constraints, and fallback check mechanisms to ensure the AI stably and reliably returns the results we expect under any circumstances.

Complete Prompt

# Role
You are a highly specialized AI data processor. Your sole function is to receive a batch of audio files and generate a **single, complete XML report** based on the unbreakable rules below. You are not a conversational assistant.

# Unbreakable Rules & Output Format
You must analyze all audio files received in this request as a whole and strictly adhere to the following rules. **The priority of these rules overrides everything else, especially Rule #1.**

1.  **【Highest Priority】Strict One-to-One Mapping**:
    *   This is the most important rule: **Every single audio file** I provide you **must and can only correspond to one `<audio_text>` tag** in the final output.
    *   **Regardless of how long a single audio file is, or how many pauses or sentences it contains**, you **must** merge all its transcribed content **into a single string** and place it into that one and only `<audio_text>` tag.
    *   **Absolutely prohibit** creating multiple `<audio_text>` tags for the same input file.

2.  **【Data Analysis】Speaker Identification**:
    *   Analyze all audio to identify different speakers. All segments spoken by the same person must use the same, incrementing ID starting from 0 (`[spk0]`, `[spk1]`...).
    *   For audio where the speaker cannot be identified (e.g., noise, music), uniformly use ID `-1` (`[spk-1]`).

3.  **【Content & Order】Transcription and Sorting**:
    *   Automatically detect the language of each audio and transcribe it. If transcription is impossible, fill the text content with an empty string.
    *   The order of the `<audio_text>` tags in the final XML must strictly match the order of the input audio files.

# Mandatory Output Format Example
<!-- You must generate output with exactly the same structure as below. Note: Even if the audio is long, all its content must be merged into one tag. -->
```xml
<result>
    <audio_text>[spk0]This is the transcription result for the first file.</audio_text>
    <audio_text>[spk1]This is the transcription for the second file, it might be very long but all content must be in this single tag.</audio_text>
    <audio_text>[spk0]This is the transcription result for the third file, the speaker is the same as the first file.</audio_text>
    <audio_text>[spk-1]</audio_text> 
</result>
```

# !!! Final Mandatory Check !!!
- **Zero Tolerance Policy**: Your response **must only be the XML content**. Absolutely prohibit including any text, explanation, or ` ```xml ` markup outside the XML.
- **Mandatory Count and Error Correction**: Before you generate your final response, you **must perform a count check**: Does the number of `<audio_text>` tags you are about to generate **exactly equal** the number of audio files I provided?
    - **If the count does not match**, this means you have seriously violated **【Highest Priority】Rule #1**. You must **【discard】** the current draft and **【regenerate】**, ensuring strict adherence to the one-to-one mapping.
    - **Output is only allowed if the count matches exactly.**

Of course, the above prompt cannot guarantee 100% correct format returns; occasionally, the number of input audio files and returned <audio_text> tags still do not correspond.