Skip to content

🧠 Using WhisperX API in pyVideoTrans

WhisperX is a very powerful voice recognition model that also achieves Speaker Diarization. However, the official version only offers a command-line tool, which isn't very friendly to beginners, nor does it provide an API.

So, I created an enhanced version: whisperx-api! Based on the original model, it adds:

  • Local Web UI —— Open your browser to use it; upload files and transcribe with one click.
  • OpenAI Compatible API —— Can replace the original Whisper API and integrate directly into projects.
  • Speaker Diarization —— Automatically identifies and labels different speakers.
  • One-Click Start —— Uses the uv tool to automate environment configuration.

Used in the pyVideoTrans project:


Prerequisites

We need the following 2 tools:

  • uv: An ultra-fast Python package manager. Set up the environment with a single command.
  • FFmpeg: A powerful audio/video processing tool to help with format conversion.

🚀 Get Started in 3 Steps with One-Click Launch!

✅ Step 1: Download Project Code

Visit the project homepage: https://github.com/jianchang512/whisperx-api

Click the green "Code" button → "Download ZIP" to download the archive and unzip it. Then, enter the folder containing the app.py and index.html files.

Clear the contents of the address bar in the folder, type cmd, and press Enter to open a black terminal window.


✅ Step 2: Get the Access Pass for the "Speaker Diarization" Model (Skip if you don't need speaker features)

To download the speaker diarization model, you must agree to their protocol. Therefore, you need to "sign the agreement" on the Hugging Face website and obtain an access token first. Note: This step requires a network connection capable of accessing Hugging Face.

① Register and Log in to Hugging Face

Visit: https://huggingface.co/ Create a free account and log in.

② Create an Access Token

Visit: https://huggingface.co/settings/tokens Click "New token" → Select read permissions → Create and copy the token string starting with hf_.

Open the create token page, click Create new token

You must select the Read label

Copy the token

③ Agree to Model Use Agreements (Must Check!)

Visit the following two model pages sequentially and agree to the protocols:

Once on the page, fill in the 2 text boxes displayed, then click the button to submit.

④ Save the Token

Return to your whisperx-api project folder. Create a new file named token.txt. Paste the token starting with hf_ that you just copied into it and save.


✅ Step 3: One-Click Start!

Ensure your cmd terminal is still in the folder where app.py is located, then execute:

Run the startup command (you only need to execute this command for future uses):

bash
uv run app.py

On the first use, you will need to wait a relatively long time for modules and dependencies to install. Please be patient.

When you see output similar to the image below, it means the startup was successful 👇

The browser will automatically open this address: http://127.0.0.1:9092. Enter this address into the pyVideoTrans -- Menu -- Voice Recognition Settings -- WhisperX API Window -- API Address text box.

You will also see a clean web interface in the browser 👇. If you don't need to use it in the browser, you can close this page, but if you need API calls, do not close the cmd terminal.


💡 User Guide

Now you can use it in two ways 👇


🌐 Method 1: Web Operation Interface

  1. Upload File Click or drag audio/video files into the dotted box.

  2. Settings

    • Language: Select the corresponding language if known; otherwise, select "Auto Detect".
    • Model: Larger models are more accurate but slower. large-v3-turbo is recommended.
    • Prompt: You can fill in names, terms, etc., to improve recognition rates, e.g., OpenAI, WhisperX, PyTorch.
  3. Start Transcription Click "Submit Transcription" and wait for processing to complete.

  4. View & Download Results will be displayed below. You can edit directly and click "Download SRT File" to save.


⚙️ Method 2: API Call

Enter the address http://127.0.0.1:9092 into the API Address text box located at: pyVideoTrans -- Menu -- Voice Recognition Settings -- WhisperX API Window.


❓ FAQ

Q: "FFmpeg not found" error during startup? A: It is not installed or not added to the system environment variables. Please re-check the installation steps in the "Prerequisites" section.


Q: Clicked the transcribe button but no reaction? A: The first run downloads models; please be patient. If an error occurs, please check the terminal logs. Usually, it is because the VPN/Proxy is not turned on or is unstable.


Q: Why are there no [Speaker1], [Speaker2] tags? A:

  • It won't show if only one person is speaking in the audio.
  • Or your Hugging Face Token configuration is incorrect, or the agreements were not signed. Please re-check Step 2.

Q: Processing speed is too slow? A: If running in CPU mode, it is indeed slow. Users with NVIDIA graphics cards will experience speeds dozens of times faster.