Skip to content

Have you ever faced this problem?

Many speech-to-text tools work great for English, but their performance is often disappointing for Eastern languages like Chinese dialects (Cantonese, Sichuanese, etc.), Vietnamese, or Filipino.

Here's some good news!

The Dataocean AI team has developed and open-sourced the Dolphin project, a speech transcription model specifically optimized for Eastern languages, enabling more accurate recognition.

To make this powerful tool accessible even to non-technical users, I've created a user-friendly interface and an all-in-one package.



Key Features: Simple and Efficient

  • Focused on Eastern Languages: Specially optimized to support various Eastern languages and dialects.
  • Easy to Use: Just upload your audio/video, select the language, and click a button.
  • Flexible Output: Generates SRT subtitle files by default, with support for TXT or JSON formats as well.

How to Use (GUI Version)

Follow these steps to get started easily:

    1. Launch the Tool
    • • After running the program, it will automatically open a web interface in your browser, usually at http://127.0.0.1:5080. If it doesn't open automatically, just enter this address manually.
    1. Upload an Audio or Video File
    • • Click the "Select File" button on the interface to find the audio or video file you want to transcribe.
    • • Supports various formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, aac, flac, mov, mkv, avi, etc.
    1. Select the Language
    • • In the "Language" dropdown menu, find the corresponding language of your file (e.g., Mandarin Chinese, Sichuanese, Cantonese, etc.).
    • Not sure which language it is? No problem. Select "Auto Detect" and let the tool figure it out.
    1. Select Output Format

    Image

    • • It will generate an SRT subtitle file by default.
    • • You can also choose to output as TXT (plain text) or JSON (structured data) based on your needs.
    1. Start Transcription
    • • Click the "Start Transcription" button.
    • • The tool will automatically perform a series of processes in the background:
      • • Converts your file into the WAV audio format, which is suitable for processing.
      • • Splits the audio into smaller chunks to improve processing speed and accuracy.
      • • Uses the Dolphin model to recognize the speech in each chunk.
      • • Finally, it assembles the recognition results into your chosen format (e.g., SRT).
    1. Get the Results
    • • Once the transcription is complete, the results will be displayed directly on the interface.
    • • You can copy the text directly or click the download button to save the results as a file, making it convenient for video editing or other uses.

For Developers: How to Use the API

If you're a developer looking to integrate this functionality into your own application, the all-in-one package also provides an API.

  • Endpoint: /v1/audio/transcriptions
  • Method: POST
  • Content-Type: multipart/form-data (Note: This is not application/json because you are uploading a file).
  • Parameters:
    • file: (required) The audio/video file itself.
    • language: (optional) The target language code (see table below). Leave blank for auto-detection.
    • response_format: (optional) The response format. Supports "srt", "json", "txt". Defaults to "srt".
  • Response:
    • Success: Returns the transcribed text in the specified format (SRT, JSON, or TXT).
    • Failure: Returns a JSON object containing error information.

Supported Language Codes

Language CodeLanguage Name
zh-CNChinese (Mandarin)
zh-TWChinese (Taiwan)
zh-WUChinese (Wu)
zh-SICHUANChinese (Sichuanese)
zh-SHANXIChinese (Shanxi)
zh-ANHUIChinese (Anhui)
zh-TIANJINChinese (Tianjin)
zh-NINGXIAChinese (Ningxia)
zh-SHAANXIChinese (Shaanxi)
zh-HEBEIChinese (Hebei)
zh-SHANDONGChinese (Shandong)
zh-GUANGDONGChinese (Guangdong)
zh-SHANGHAIChinese (Shanghainese)
zh-HUBEIChinese (Hubei)
zh-LIAONINGChinese (Liaoning)
zh-GANSUChinese (Gansu)
zh-FUJIANChinese (Fujian)
zh-HUNANChinese (Hunan)
zh-HENANChinese (Henan)
zh-YUNNANChinese (Yunnan)
zh-MINNANChinese (Minnan)
zh-WENZHOUChinese (Wenzhou)
ja-JPJapanese
th-THThai
ru-RURussian
ko-KRKorean
id-IDIndonesian
vi-VNVietnamese
ct-NULLCantonese (Unknown)
ct-HKCantonese (Hong Kong)
ct-GZCantonese (Guangdong)
hi-INHindi
ur-INUrdu (India)
ur-PKUrdu
ms-MYMalay
uz-UZUzbek
ar-MAArabic (Morocco)
ar-GLAArabic
ar-SAArabic (Saudi Arabia)
ar-EGArabic (Egypt)
ar-KWArabic (Kuwait)
ar-LYArabic (Libya)
ar-JOArabic (Jordan)
ar-AEArabic (UAE)
ar-LVTArabic (Levantine)
fa-IRPersian
bn-BDBengali
ta-SGTamil (Singapore)
ta-LKTamil (Sri Lanka)
ta-INTamil (India)
ta-MYTamil (Malaysia)
te-INTelugu
ug-NULLUyghur
ug-CNUyghur
gu-INGujarati
my-MMBurmese
tl-PHTagalog
kk-KZKazakh
or-INOdia
ne-NPNepali
mn-MNMongolian
km-KHKhmer
jv-IDJavanese
lo-LALao
si-LKSinhala
fil-PHFilipino
ps-AFPashto
pa-INPunjabi
kab-NULLKabyle
ba-NULLBashkir
ks-INKashmiri
tg-TJTajik
su-IDSundanese
mr-INMarathi
ky-KGKyrgyz
az-AZAzerbaijani

API Call Example (using curl)

curl -X POST http://127.0.0.1:5080/v1/audio/transcriptions \
  -F "file=@/path/to/your/audio.mp3" \
  -F "language=zh-CN" \
  -F "response_format=srt"

API Call Example (using Python openai library) (This library conveniently allows calling APIs that are compatible with the OpenAI API format)

from openai import OpenAI

# Configure the client to point to the local service address
client = OpenAI(base_url='http://127.0.0.1:5080/v1', api_key='any_string_will_do') # api_key is not important in this context

audio_file_path = "your_audio.wav" # Replace with your file path

with open(audio_file_path, 'rb') as file_handle:
    # Make the transcription request
    transcript = client.audio.transcriptions.create(
        file=(audio_file_path, file_handle), # Pass the filename and file content
        model='base', # Model name, fixed as 'base' here or adjust as needed
        language='zh-CN', # Specify the language
        response_format="srt" # Specify the response format
    )
    # Print the transcription result (SRT format text)
    print(transcript)

Example Response (SRT Format)

1
00:00:00,000 --> 00:00:02,500
Hello, this is a test audio.

2
00:00:02,500 --> 00:00:05,000
I hope the transcription result is accurate.

Want it Faster? Enable GPU Acceleration (Optional)

  • Why use a GPU? If you have a compatible NVIDIA graphics card and the environment is properly configured, using a GPU can significantly increase transcription speed, especially for long audio files.
  • How to Enable:
      1. Prerequisites: Ensure your computer has the correct NVIDIA graphics card drivers and CUDA 12.x environment installed.
      1. Install Support: In the all-in-one package folder, find and double-click the Install GPU Support.bat (安装GPU支持.bat) file. It will automatically complete the necessary setup.
  • Note: The default all-in-one package does not include GPU support to keep the file size smaller.

A Few Tips

    1. File Size and Duration: It's recommended that individual files are not too large (e.g., under 1GB) and that the duration is kept within 1 hour. Very large files may be processed very slowly.
    1. Audio Quality: The clearer the audio and the less background noise, the better the transcription results. Try to use high-quality audio sources.
    1. Internet Connection for First Use: The first time you transcribe a specific language, the program needs an internet connection to download some required data for that language. It is recommended to successfully transcribe all your commonly used languages once (even with a very short test audio). After that, you can use it offline.