Skip to content

Have you ever faced this frustration?

Many speech-to-text tools work well with English but perform poorly with Eastern languages like Chinese dialects (Cantonese, Sichuanese, etc.), Vietnamese, Filipino, and others.

Great news is here!

The Dataocean AI team has developed and open-sourced the Dolphin project, a speech transcription model optimized specifically for Eastern languages, providing more accurate recognition.

To make this powerful tool accessible even for non-technical users, I've created a user-friendly interface and a one-click integration package.



Key Features: Simple and Efficient

  • Focus on Eastern Languages: Specially optimized to support various Eastern languages and dialects.
  • Easy to Use: Just upload your audio/video, select the language, and click a button.
  • Flexible Output: Defaults to generating SRT subtitle files, with support for TXT text or JSON format.

How to Use? (Graphical Interface Version)

Follow these simple steps to get started:

    1. Launch the Tool
    • • After running the program, it will automatically open a web interface in your browser, usually at http://127.0.0.1:5080. If it doesn't open automatically, just enter this address manually.
    1. Upload Audio or Video File
    • • Click the "Select File" button on the interface and locate the audio or video file you want to transcribe.
    • • Supports multiple formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, aac, flac, mov, mkv, avi, etc.
    1. Select Language
    • • In the "Language Selection" dropdown, find the language of your file (e.g., Mandarin Chinese, Sichuanese, Cantonese, etc.).
    • Not sure what language it is? No problem, select "Auto Detect" to let the tool figure it out.
    1. Select Output Format

    Image

    • • By default, it generates an SRT subtitle file.
    • • You can also choose to output TXT (plain text) or JSON (structured data) as needed.
    1. Start Transcription
    • • Click the "Start Transcription" button.
    • • The tool will automatically perform a series of processes in the background:
      • • Convert your file to WAV audio format suitable for processing.
      • • Split the audio into small segments to improve processing speed and accuracy.
      • • Use the Dolphin model to recognize speech in each segment.
      • • Finally, organize the recognition results into your chosen format (e.g., SRT).
    1. Get Results
    • • Once transcription is complete, the results will be displayed directly on the interface.
    • • You can directly copy the text or click the download button to save the result as a file for use in video editing or other applications.

For Developers: API Usage Method

If you're a developer and want to integrate this functionality into your own program, the integration package also provides an API interface.

  • Endpoint: /v1/audio/transcriptions
  • Method: POST
  • Content-Type: multipart/form-data (Note: Not application/json, as files need to be uploaded)
  • Request Parameters:
    • file: (Required) The audio/video file itself.
    • language: (Optional) Target language code (see table below). Leave empty for auto-detection.
    • response_format: (Optional) Return format, supports "srt", "json", "txt". Defaults to "srt".
  • Response:
    • Success: Returns the transcribed text in the specified format (SRT, JSON, or TXT).
    • Failure: Returns a JSON object containing error information.

Supported Language Codes

Language CodeLanguage Name (Chinese)
zh-CN中文(普通话)
zh-TW中文(台湾)
zh-WU中文(吴语)
zh-SICHUAN中文(四川话)
zh-SHANXI中文(山西话)
zh-ANHUI中文(安徽话)
zh-TIANJIN中文(天津话)
zh-NINGXIA中文(宁夏话)
zh-SHAANXI中文(陕西话)
zh-HEBEI中文(河北话)
zh-SHANDONG中文(山东话)
zh-GUANGDONG中文(广东话)
zh-SHANGHAI中文(上海话)
zh-HUBEI中文(湖北话)
zh-LIAONING中文(辽宁话)
zh-GANSU中文(甘肃话)
zh-FUJIAN中文(福建话)
zh-HUNAN中文(湖南话)
zh-HENAN中文(河南话)
zh-YUNNAN中文(云南话)
zh-MINNAN中文(闽南语)
zh-WENZHOU中文(温州话)
ja-JP日语
th-TH泰语
ru-RU俄语
ko-KR韩语
id-ID印度尼西亚语
vi-VN越南语
ct-NULL粤语(未知)
ct-HK粤语(香港)
ct-GZ粤语(广东)
hi-IN印地语
ur-IN乌尔都语(印度)
ur-PK乌尔都语
ms-MY马来语
uz-UZ乌兹别克语
ar-MA阿拉伯语(摩洛哥)
ar-GLA阿拉伯语
ar-SA阿拉伯语(沙特)
ar-EG阿拉伯语(埃及)
ar-KW阿拉伯语(科威特)
ar-LY阿拉伯语(利比亚)
ar-JO阿拉伯语(约旦)
ar-AE阿拉伯语(阿联酋)
ar-LVT阿拉伯语(黎凡特)
fa-IR波斯语
bn-BD孟加拉语
ta-SG泰米尔语(新加坡)
ta-LK泰米尔语(斯里兰卡)
ta-IN泰米尔语(印度)
ta-MY泰米尔语(马来西亚)
te-IN泰卢固语
ug-NULL维吾尔语
ug-CN维吾尔语
gu-IN古吉拉特语
my-MM缅甸语
tl-PH塔加洛语
kk-KZ哈萨克语
or-IN奥里亚语
ne-NP尼泊尔语
mn-MN蒙古语
km-KH高棉语
jv-ID爪哇语
lo-LA老挝语
si-LK僧伽罗语
fil-PH菲律宾语
ps-AF普什图语
pa-IN旁遮普语
kab-NULL卡拜尔语
ba-NULL巴什基尔语
ks-IN克什米尔语
tg-TJ塔吉克语
su-ID巽他语
mr-IN马拉地语
ky-KG吉尔吉斯语
az-AZ阿塞拜疆语

API Call Example (Using curl)

curl -X POST http://127.0.0.1:5080/v1/audio/transcriptions \
  -F "file=@/your/path/your_audio.mp3" \
  -F "language=zh-CN" \
  -F "response_format=srt"

API Call Example (Using Python openai Library)
(This library can conveniently call interfaces compatible with the OpenAI API format)

from openai import OpenAI

# Configure the client to point to the local service address
client = OpenAI(base_url='http://127.0.0.1:5080/v1', api_key='any string will do') # api_key is not critical in this scenario

audio_file_path = "your_audio.wav" # Replace with your file path

with open(audio_file_path, 'rb') as file_handle:
    # Initiate the transcription request
    transcript = client.audio.transcriptions.create(
        file=(audio_file_path, file_handle), # Pass filename and file content
        model='base', # Model name, fixed as 'base' or adjust based on actual situation
        language='zh-CN', # Specify language
        response_format="srt" # Specify return format
    )
    # Print the transcription result (SRT format text)
    print(transcript)

Response Example (SRT Format)

1
00:00:00,000 --> 00:00:02,500
Hello, this is a test audio.

2
00:00:02,500 --> 00:00:05,000
Hope the transcription result is accurate.

Want It Faster? Enable GPU Acceleration (Optional)

  • Why Use GPU? If you have a suitable NVIDIA graphics card and the environment configured, using a GPU can significantly speed up transcription, especially for long audio files.
  • How to Enable?
      1. Prerequisite: Ensure your computer has the correct NVIDIA graphics drivers and CUDA 12.x environment installed.
      1. Install Support: In the integration package folder, find and double-click the Install GPU Support.bat file; it will automatically complete the relevant setup.
  • Note: The default integration package does not include GPU support to keep the file size small.

A Few Tips

    1. File Size and Duration: It's recommended that a single file not be too large (e.g., not exceeding 1GB), and the duration is best kept within 1 hour. Very large files may process very slowly.
    1. Audio Quality: The clearer the audio and the less background noise, the better the transcription results. Try to use high-quality audio sources.
    1. Internet Required for First Use: The first time you transcribe a particular language, the program needs to download some required data for that language. It's recommended to successfully transcribe all commonly used languages at least once (even with a very short test audio), after which it can be used offline.