Skip to content

Parakeet-API: Build a Private English Speech-to-Text Service Faster and More Accurate Than Whisper

In today's AI applications, speech-to-text (STT) has become a foundational capability. OpenAI's Whisper model is known for its multilingual support and high accuracy, but in specific scenarios, are there better options? The answer is yes.

If you need a solution that is faster and more accurate in English recognition and can be deployed completely privately, NVIDIA's Parakeet model is your best choice.

This article will detail how to use the Parakeet-TDT-0.6B model to build a high-performance service compatible with the OpenAI API. All code is open source for you to deploy and use directly.

Open Source Project Address: https://github.com/jianchang512/parakeet-api

Why Choose Parakeet Over Whisper?

Choosing the right technology requires weighing the pros and cons. Parakeet is not intended to replace Whisper, but it offers a better solution in a specific niche.

Advantages: Faster Speed and Higher English Accuracy

  1. Focus and Optimization: The Parakeet model is deeply optimized for English speech recognition. Compared to Whisper's large multilingual model, Parakeet's architecture is more lightweight and focuses on a single language. This makes it often superior to Whisper in terms of both word error rate and processing speed when dealing with English audio.
  2. Excellent Timestamps: The model can generate very accurate word-level and paragraph-level timestamps, which are crucial for generating high-quality SRT subtitles or conducting subsequent audio analysis.

Disadvantages: English Only Support

Currently, the core parakeet-tdt-0.6b-v2 model only supports the recognition of English pronunciations. If your business needs to handle multiple languages, Whisper is still a more suitable choice.

Project Architecture and Tech Stack

The local configuration service selects the following toolchain to achieve an efficient and stable transcription process:

  • Core Model: NVIDIA parakeet-tdt-0.6b-v2
  • Web Framework: Flask
  • Production Server: Waitress (multi-threaded)
  • Format Conversion: FFmpeg
  • API Specification: OpenAI v1/audio/transcriptions compatible

Quick Start: Installation and Environment Configuration

Before diving into the code, let's set up the runtime environment.

Step 1: Install System Dependencies (FFmpeg)

FFmpeg is key to handling audio and video format conversions. Make sure it is installed on your system.

  • Ubuntu/Debian: sudo apt update && sudo apt install ffmpeg
  • macOS (Homebrew): brew install ffmpeg
  • Windows: Download from the official website and add the bin directory to the system PATH.

Step 2: Configure the Python Environment and Install Dependencies

It is recommended to use a virtual environment.

bash
# Create and activate the virtual environment
python3 -m venv venv
source venv/bin/activate # (Linux/macOS)
# venv\Scripts\activate # (Windows)

# Install all necessary libraries
pip install numpy waitress flask typing_extensions torch nemo_toolkit["asr"]

Step 3: Key Performance Optimization - Configure CUDA

To maximize the model's performance, it is highly recommended to run it in an environment equipped with an NVIDIA GPU and to configure CUDA correctly.

If CUDA is not configured, you may see the following log when you first start the service:

The warning message [NeMo W] ... Cuda graphs with while loops are disabled... Reason: CUDA is not available means that NeMo failed to find an available GPU and has automatically fallen back to CPU mode.

  • Impact: The service can run, but the transcription speed will be very slow.
  • Solution:
    1. Make sure you have installed the NVIDIA graphics card driver.
    2. Install the CUDA Toolkit compatible with the driver.
    3. Install a CUDA-enabled version of PyTorch. This is the easiest step to make mistakes on. Please visit the PyTorch website to get the correct installation command for your CUDA version, for example:
      bash
      # Example: For CUDA 12.1
      pip uninstall -y torch
      pip install torch --index-url https://download.pytorch.org/whl/cu124

After correct configuration, this warning will disappear, and you will experience a performance increase of several times or even dozens of times brought by the GPU.

Ready-to-Use Web Interaction Interface

For convenient and quick experience, a simple and beautiful front-end page is built in. You only need to access the service address (such as http://127.0.0.1:5092) through your browser to use all the functions:

  • Drag and Drop Upload: Supports dragging and dropping or clicking to select audio and video files.
  • Real-time Status: Clearly displays the status of uploading, processing, completing, or errors.
  • Result Preview: The transcribed SRT subtitles will be displayed directly in the text box.
  • One-click Download: The generated SRT subtitle file can be downloaded to the local computer.

This interface is implemented with native JS and CSS, with no external library dependencies, fast loading, and smooth experience.

Core Implementation: Compatibility and Performance

  • Model Pre-loading: When the service starts, the NeMo model is loaded into memory, avoiding the huge latency caused by repeatedly loading the model for each request.
  • Seamless Switching of OpenAI SDK: By pointing the base_url of the OpenAI client to our local service, any existing program using the OpenAI SDK can switch to our private API at almost zero cost.

Client Call Example client_test.py:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5090/v1", # Point to the local service
    api_key="your-dummy-key"             # Any virtual key
)

with open("my_english_audio.mp3", "rb") as audio_file:
    srt_content = client.audio.transcriptions.create(
        model="parakeet",
        file=audio_file,
        response_format="srt"
    )
    print(srt_content)