Skip to content

A Beginner-Friendly Guide to whisper.cpp, the Zero-Dependency, High-Efficiency Speech-to-Text Tool (with Batch Processing Script)

In the field of speech transcription, OpenAI's Whisper model is undoubtedly the most well-known choice with the broadest language support.

Many excellent derivative projects have emerged around Whisper, such as faster-whisper, whisperx which supports speaker diarization, and the subject of this article, whisper.cpp.

Project Link: https://github.com/ggml-org/whisper.cpp

Why Choose whisper.cpp?

whisper.cpp is a C++ implementation of the Whisper model, and its core advantages are significant:

  • Virtually Zero Dependencies: This is its biggest selling point. Unlike other projects that require installing a heap of Python libraries via pip, whisper.cpp has almost no dependencies. You no longer have to worry about falling into the frustrating "rabbit hole" of environment configuration, module installation errors, and version conflicts.
  • Excellent Performance: The native performance advantage of C++ makes its transcription speed incredibly fast, with relatively low resource consumption.
  • Cross-Platform & Hardware Support: It runs smoothly on Windows, macOS, and Linux. Better yet, it supports a wide range of hardware acceleration, from mainstream NVIDIA (CUDA) and AMD (OpenCL) GPUs to Apple's Metal framework and even dedicated NPUs, all of which can be leveraged for further speed boosts.
  • Simple Deployment: You can download the official pre-compiled executable, unzip it, and use it immediately. The entire program package is very small.

Of course, it has one aspect that might not be friendly to newcomers: it's purely command-line operated.

But don't worry, this article is here to solve that problem. It will guide you step-by-step through the entire process, from downloading to usage, and provide a custom batch processing script specifically for Windows users. All you need to do is copy and paste, and you'll be able to transcribe all audio/video files in a folder with a double-click.


Step 1: Preparation

Before we start transcribing, we need to get the necessary program and files ready.

1. Download the whisper.cpp Main Program

First, you need to download the corresponding pre-compiled version based on your operating system and hardware.

Download Link: https://github.com/ggml-org/whisper.cpp/releases/tag/v1.8.2

How to choose?

  • Windows (without NVIDIA GPU): Download whisper-bin-x64.zip.
  • Windows (with NVIDIA GPU): Choose based on your installed CUDA version. For example, if you have CUDA 12.x installed, download whisper-cublas-12.4.0-bin-x64.zip; for CUDA 11.x, download whisper-cublas-11.8.0-bin-x64.zip. Using GPU acceleration can significantly increase transcription speed.
  • macOS Users: Download whisper-v1.8.2-xcframework.zip.

2. Unzip and Organize the File Structure

After downloading, unzip the compressed file. It is strongly recommended to extract it to a short path that does not contain Chinese characters or spaces, such as D:\whispercpp, to avoid unnecessary problems.

After unzipping, go into the folder and make sure you can see a file named whisper-cli.exe (we'll refer to this folder as the whisper.cpp root directory from now on).

For better organization, we'll manually create two new subfolders inside this directory:

  • models: To store the transcription model files we'll download later.
  • wavs: To store all the audio or video files you want to transcribe.

whisper.cpp itself only processes standard .wav audio files. If your files are in other audio formats like mp3, m4a, or even video formats like mp4, mkv, you might get an error or an empty result.

This is where a powerful tool comes in handy—FFmpeg. It can easily convert various file formats into the one whisper.cpp requires.

  • Check if it's installed: Open a command prompt (cmd), type ffmpeg, and press Enter. If you see a bunch of version information, it means it's already installed and the environment variable is configured. If you get a "command not found" error, you'll need to download it manually.

FFmpeg is present and available

FFmpeg is not found

4. Download Model Files

Generally, the larger the model, the higher the accuracy, but it also requires more computing resources and is slower.

Model Download Link: https://huggingface.co/ggerganov/whisper.cpp/tree/main

It's recommended to choose from the following models (tiny/base models are too small to be practical, and other quantized models are not necessary for general use):

Model NameFile SizeRecommended Use CaseDownload Link
ggml-small.bin~488 MBFor lower-spec computers, when speed is a priority and minor errors are acceptable.Click to Download
ggml-medium.bin~1.5 GBA balanced choice.Click to Download
ggml-large-v2.bin~3.0 GBFor pursuing high accuracy on a decent computer.Click to Download
ggml-large-v3.bin~3.1 GBHighest accuracy. Currently the best-performing model, recommended for users seeking ultimate quality.Click to Download
ggml-large-v3-turbo.bin~1.6 GBAn optimized version of large-v3. Faster than v2/v3 with comparable accuracy, suitable for most users.Click to Download

Choose and download the model you need, then place the .bin file into the models folder we created earlier.


Step 2: Start Transcribing

With all preparations done, we can now begin the actual transcription.

1. Run it Manually to Understand the Basic Workflow

To understand how whisper.cpp works, let's run it once manually.

  1. Open Command Prompt: Navigate to the folder containing whisper-cli.exe, type cmd in the address bar, and press Enter. This will open a command prompt directly in that directory.

  2. Prepare the Command: The basic command format for whisper.cpp is: whisper-cli [options] -f [path_to_your_audio_file] -m [path_to_your_model_file]

    Let's construct a specific example. Assume:

    • The file to be transcribed, 60.wav, is on the Desktop with the full path C:\Users\c1\Desktop\60.wav.
    • We are using the ggml-small.bin model.
    • We want to auto-detect the language and output an SRT subtitle file.

    The final command would be: whisper-cli -np -f "C:\Users\c1\Desktop\60.wav" -l auto -m models/ggml-small.bin -osrt

    Command Parameter Explanation:

ParameterFunctionDescription
whisper-cliMain programThis is the core executable for transcription.
-npNo ProgressHides the real-time progress bar. This makes the output cleaner when running in a script.
-f "File Path"FileSpecifies the input file to transcribe. If the path contains spaces, be sure to enclose it in double quotes.
-l autoLanguageSpecifies the language. auto means auto-detect. You can also specify it explicitly, like zh (Chinese) or en (English).
-m models/...ModelSpecifies the path to the model file to use.
-osrtOutput SRTIndicates that the output should be a .srt subtitle file.
  1. Execute the Command: Paste the command above into the cmd window and press Enter.

    The program will start running. After it finishes, you will find a newly generated 60.wav.srt subtitle file in the source file's directory (the Desktop in this case).

    whisper.cpp actually offers many more advanced parameters for finer control. If you're interested, you can explore the official documentation. For everyday use, knowing the ones above is sufficient.

    More command parameters: https://github.com/ggml-org/whisper.cpp/tree/master/examples/cli

2. Use a BAT Script for One-Click Batch Processing

Typing commands manually is inefficient and prone to errors. To solve this, I've prepared a Windows batch (.bat) script that automates the following tasks:

  • Iterates through all files in the wavs folder.
  • Uses FFmpeg to automatically convert non-.wav audio/video files into temporary .wav files.
  • Calls whisper.cpp to perform the transcription.
  • Saves the generated .srt subtitle files into a dedicated srt folder.
  • Automatically cleans up temporary files.

Steps:

  1. Create the Script File: In the same directory as whisper-cli.exe, right-click, select "New" -> "Text Document", and name it run.txt.

  2. Paste the Code: Double-click to open run.txt, then copy and paste all the code below into it.

    bat
    @echo off
    rem Set the model you want to use here
    set model_name=ggml-large-v3-turbo.bin
    
    setlocal enabledelayedexpansion
    
    rem Change directory to the script's location
    cd /d "%~dp0"
    
    rem Create srt directory if it doesn't exist
    if not exist "srt\" mkdir "srt"
    
    
    set "audio_ext=.wav" 
    
    rem Check if the model file exists
    if not exist "models/%model_name%" (
        echo.
        echo =================================================================
        echo [ERROR] Model file not found: models/%model_name%
        echo.
        echo Please make sure you have downloaded the model and placed it in the 'models' folder.
        echo If your model has a different name, please edit the "set model_name=..." line in this script.
        echo.
        echo Model download page: https://huggingface.co/ggerganov/whisper.cpp/tree/main  
        echo =================================================================
        echo.
        pause
        exit /b 1
    )
    
    
    echo ===============================================
    echo  Starting batch transcription for all files in "%~dp0wavs"
    echo  Results will be saved in the "%~dp0srt" folder.
    echo  Current model: %model_name%
    echo ===============================================
    echo.
    
    :: Loop through all files (*.*) in the wavs folder
    FOR %%F IN (wavs\*.*) DO (
        set ext="%%~xF"
        set sttwav="%%F"
        set temp_wav=""
        echo.
        echo ----------------- Processing file: "%%~nxF" -----------------
    
        rem If the format is not wav, use ffmpeg to convert it
        echo !audio_ext! | findstr /i /c:"!ext!" >nul	
        
        if !errorlevel! neq 0 (
            set temp_wav="wavs\temp_%%~nxF.wav"
            
            echo [-] "%%F" is not a standard WAV file. Converting with FFmpeg...
            
            ffmpeg -y -i "%%F"  -ac 1 -ar 16000 -c:a pcm_s16le "!temp_wav!" >nul 2>&1
            
            if !errorlevel! equ 0 (
                echo [+] FFmpeg conversion successful!
                 set sttwav="!temp_wav!"
            ) else (
                 echo [X] ERROR: FFmpeg failed to convert "%%~nxF". Skipping file.
                 set sttwav=""
            )
            
        )
        
        if not !sttwav! == "" (
            
            echo [-] Calling whisper.cpp for transcription...
    
            set temp_log="temp_%%~nxF.log"
            whisper-cli.exe -np -f !sttwav! -l auto -m models/%model_name% -osrt -of "srt\%%~nF" > !temp_log! 2>&1
            findstr /i /c:"error:" !temp_log! >nul
            if !errorlevel! equ 0 (
                echo [X] ERROR: Transcription failed for "%%~nxF". Check log for details.				
            ) else (
                echo [OK] SUCCESS: "%%~nxF" transcribed. Subtitle saved as "srt\%%~nF.srt"
            )
            del !temp_log!
            
            if not !temp_wav! == "" (
                del !temp_wav!
            )
        )
        echo --------------------------------------------------------------------
        
    )
    
    echo.
    echo ===============================================
    echo  All files have been processed!
    echo  Results are saved in the "%~dp0srt" folder.
    echo ===============================================
    echo.
    
    
    pause
  3. Change the File Extension: Save and close run.txt. Then, rename the file's extension from .txt to .bat. The final filename should be run.bat.

    Important: If your computer doesn't show file extensions, you need to enable "File name extensions" in the Folder Options. Otherwise, you might mistakenly name it run.bat.txt.

    If you double-click the file and it still opens in Notepad, you haven't successfully changed the extension. Make sure there is no .txt at the end of the filename.

  4. Configure the Model: Right-click on run.bat and select "Edit". You'll see this line near the top of the code: set model_name=ggml-large-v3-turbo.bin

    You can change this to the filename of any model you have downloaded and want to use. For example, if you want to use the large-v3 model, change it to: set model_name=ggml-large-v3.bin

    Save the file after making your change.

  5. Start Batch Transcription:

    • Copy or move all the audio and video files you want to transcribe into the wavs folder.
    • Double-click the run.bat script to run it.
    • A black command prompt window will pop up, and the script will automatically start processing the files. Just wait for it to finish.

    Script Execution Example

When the window displays "All files have been processed!", you can close it. Now, check the srt folder. All your subtitle files should be there!


Step 3: Script Code Explanation (Optional Reading)

For those who want to understand how the script works, here's a simple explanation of the core commands:

  • @echo off: Hides the commands being executed, making the window output cleaner.
  • set model_name=...: Defines a variable model_name so you can easily change the model at the top of the script.
  • cd /d "%~dp0": Changes the current directory to the script's directory, ensuring subsequent commands run in the correct path.
  • if not exist "srt\" mkdir "srt": Checks if an srt folder exists. If not, it creates one.
  • if not exist "models/%model_name%": Checks if the specified model file exists. If not, it displays an error with the model download link and exits.
  • FOR %%F IN (wavs\*.*) DO (...): This is a loop that iterates through every file in the wavs folder. The filename is assigned to the variable %%F, and the code block after DO is executed for each file.
  • if !errorlevel! neq 0 (...): This block checks the file format and calls FFmpeg. !errorlevel! is the return code of the previous command; 0 means success, and non-zero means failure. A small trick is used here to check if the file extension is .wav. If not, it calls the ffmpeg command to convert it.
  • ffmpeg -y -i "%%F" -ac 1 -ar 16000 ...: This is the FFmpeg conversion command. The parameters mean:
    • -y: Overwrite the output file if it already exists.
    • -i "%%F": Specify the input file.
    • -ac 1: Set audio channels to 1 (mono).
    • -ar 16000: Set the audio sample rate to 16000 Hz.
    • -c:a pcm_s16le: Set the audio codec to 16-bit PCM. These are the recommended audio specifications for Whisper.
  • whisper-cli.exe ...: Calls the core transcription program, similar to when we ran it manually.
  • del !temp_wav!: After transcription, deletes the temporary .wav file created by FFmpeg to keep the directory clean.
  • pause: Pauses the script at the end, waiting for the user to press any key before closing the window. This allows us to see the final output.

That's it! You have now successfully set up a local, efficient, free, and zero-dependency speech-to-text workflow. With the powerful performance of whisper.cpp and the provided batch script, tasks like transcribing meeting notes, creating video subtitles, or converting study materials will become incredibly simple.