Have you ever faced this frustration?
Many speech-to-text tools work well with English but perform poorly with Eastern languages like Chinese dialects (Cantonese, Sichuanese, etc.), Vietnamese, Filipino, and others.
Great news is here!
The Dataocean AI team has developed and open-sourced the Dolphin project, a speech transcription model optimized specifically for Eastern languages, providing more accurate recognition.
To make this powerful tool accessible even for non-technical users, I've created a user-friendly interface and a one-click integration package.
Download Links
- • Method 1: Download from Baidu Netdisk: https://pan.baidu.com/s/1ODhqN-GiaHoGdU-ml3kCUQ?pwd=i2ui
- • GitHub Address: https://github.com/jianchang512/speech2text-df
Key Features: Simple and Efficient
- • Focus on Eastern Languages: Specially optimized to support various Eastern languages and dialects.
- • Easy to Use: Just upload your audio/video, select the language, and click a button.
- • Flexible Output: Defaults to generating SRT subtitle files, with support for TXT text or JSON format.
How to Use? (Graphical Interface Version)
Follow these simple steps to get started:
- Launch the Tool
- • After running the program, it will automatically open a web interface in your browser, usually at
http://127.0.0.1:5080. If it doesn't open automatically, just enter this address manually.
- • After running the program, it will automatically open a web interface in your browser, usually at
- Upload Audio or Video File
- • Click the "Select File" button on the interface and locate the audio or video file you want to transcribe.
- • Supports multiple formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, aac, flac, mov, mkv, avi, etc.
- Select Language
- • In the "Language Selection" dropdown, find the language of your file (e.g., Mandarin Chinese, Sichuanese, Cantonese, etc.).
- • Not sure what language it is? No problem, select "Auto Detect" to let the tool figure it out.
- Select Output Format
- • By default, it generates an SRT subtitle file.
- • You can also choose to output TXT (plain text) or JSON (structured data) as needed.
- Start Transcription
- • Click the "Start Transcription" button.
- • The tool will automatically perform a series of processes in the background:
- • Convert your file to WAV audio format suitable for processing.
- • Split the audio into small segments to improve processing speed and accuracy.
- • Use the Dolphin model to recognize speech in each segment.
- • Finally, organize the recognition results into your chosen format (e.g., SRT).
- Get Results
- • Once transcription is complete, the results will be displayed directly on the interface.
- • You can directly copy the text or click the download button to save the result as a file for use in video editing or other applications.
For Developers: API Usage Method
If you're a developer and want to integrate this functionality into your own program, the integration package also provides an API interface.
- • Endpoint:
/v1/audio/transcriptions - • Method:
POST - • Content-Type:
multipart/form-data(Note: Not application/json, as files need to be uploaded) - • Request Parameters:
- •
file: (Required) The audio/video file itself. - •
language: (Optional) Target language code (see table below). Leave empty for auto-detection. - •
response_format: (Optional) Return format, supports"srt","json","txt". Defaults to"srt".
- •
- • Response:
- • Success: Returns the transcribed text in the specified format (SRT, JSON, or TXT).
- • Failure: Returns a JSON object containing error information.
Supported Language Codes
| Language Code | Language Name (Chinese) |
|---|---|
| zh-CN | 中文(普通话) |
| zh-TW | 中文(台湾) |
| zh-WU | 中文(吴语) |
| zh-SICHUAN | 中文(四川话) |
| zh-SHANXI | 中文(山西话) |
| zh-ANHUI | 中文(安徽话) |
| zh-TIANJIN | 中文(天津话) |
| zh-NINGXIA | 中文(宁夏话) |
| zh-SHAANXI | 中文(陕西话) |
| zh-HEBEI | 中文(河北话) |
| zh-SHANDONG | 中文(山东话) |
| zh-GUANGDONG | 中文(广东话) |
| zh-SHANGHAI | 中文(上海话) |
| zh-HUBEI | 中文(湖北话) |
| zh-LIAONING | 中文(辽宁话) |
| zh-GANSU | 中文(甘肃话) |
| zh-FUJIAN | 中文(福建话) |
| zh-HUNAN | 中文(湖南话) |
| zh-HENAN | 中文(河南话) |
| zh-YUNNAN | 中文(云南话) |
| zh-MINNAN | 中文(闽南语) |
| zh-WENZHOU | 中文(温州话) |
| ja-JP | 日语 |
| th-TH | 泰语 |
| ru-RU | 俄语 |
| ko-KR | 韩语 |
| id-ID | 印度尼西亚语 |
| vi-VN | 越南语 |
| ct-NULL | 粤语(未知) |
| ct-HK | 粤语(香港) |
| ct-GZ | 粤语(广东) |
| hi-IN | 印地语 |
| ur-IN | 乌尔都语(印度) |
| ur-PK | 乌尔都语 |
| ms-MY | 马来语 |
| uz-UZ | 乌兹别克语 |
| ar-MA | 阿拉伯语(摩洛哥) |
| ar-GLA | 阿拉伯语 |
| ar-SA | 阿拉伯语(沙特) |
| ar-EG | 阿拉伯语(埃及) |
| ar-KW | 阿拉伯语(科威特) |
| ar-LY | 阿拉伯语(利比亚) |
| ar-JO | 阿拉伯语(约旦) |
| ar-AE | 阿拉伯语(阿联酋) |
| ar-LVT | 阿拉伯语(黎凡特) |
| fa-IR | 波斯语 |
| bn-BD | 孟加拉语 |
| ta-SG | 泰米尔语(新加坡) |
| ta-LK | 泰米尔语(斯里兰卡) |
| ta-IN | 泰米尔语(印度) |
| ta-MY | 泰米尔语(马来西亚) |
| te-IN | 泰卢固语 |
| ug-NULL | 维吾尔语 |
| ug-CN | 维吾尔语 |
| gu-IN | 古吉拉特语 |
| my-MM | 缅甸语 |
| tl-PH | 塔加洛语 |
| kk-KZ | 哈萨克语 |
| or-IN | 奥里亚语 |
| ne-NP | 尼泊尔语 |
| mn-MN | 蒙古语 |
| km-KH | 高棉语 |
| jv-ID | 爪哇语 |
| lo-LA | 老挝语 |
| si-LK | 僧伽罗语 |
| fil-PH | 菲律宾语 |
| ps-AF | 普什图语 |
| pa-IN | 旁遮普语 |
| kab-NULL | 卡拜尔语 |
| ba-NULL | 巴什基尔语 |
| ks-IN | 克什米尔语 |
| tg-TJ | 塔吉克语 |
| su-ID | 巽他语 |
| mr-IN | 马拉地语 |
| ky-KG | 吉尔吉斯语 |
| az-AZ | 阿塞拜疆语 |
API Call Example (Using curl)
curl -X POST http://127.0.0.1:5080/v1/audio/transcriptions \
-F "file=@/your/path/your_audio.mp3" \
-F "language=zh-CN" \
-F "response_format=srt"API Call Example (Using Python openai Library)
(This library can conveniently call interfaces compatible with the OpenAI API format)
from openai import OpenAI
# Configure the client to point to the local service address
client = OpenAI(base_url='http://127.0.0.1:5080/v1', api_key='any string will do') # api_key is not critical in this scenario
audio_file_path = "your_audio.wav" # Replace with your file path
with open(audio_file_path, 'rb') as file_handle:
# Initiate the transcription request
transcript = client.audio.transcriptions.create(
file=(audio_file_path, file_handle), # Pass filename and file content
model='base', # Model name, fixed as 'base' or adjust based on actual situation
language='zh-CN', # Specify language
response_format="srt" # Specify return format
)
# Print the transcription result (SRT format text)
print(transcript)Response Example (SRT Format)
1
00:00:00,000 --> 00:00:02,500
Hello, this is a test audio.
2
00:00:02,500 --> 00:00:05,000
Hope the transcription result is accurate.Want It Faster? Enable GPU Acceleration (Optional)
- • Why Use GPU? If you have a suitable NVIDIA graphics card and the environment configured, using a GPU can significantly speed up transcription, especially for long audio files.
- • How to Enable?
- Prerequisite: Ensure your computer has the correct NVIDIA graphics drivers and CUDA 12.x environment installed.
- Install Support: In the integration package folder, find and double-click the
Install GPU Support.batfile; it will automatically complete the relevant setup.
- Install Support: In the integration package folder, find and double-click the
- • Note: The default integration package does not include GPU support to keep the file size small.
A Few Tips
- File Size and Duration: It's recommended that a single file not be too large (e.g., not exceeding 1GB), and the duration is best kept within 1 hour. Very large files may process very slowly.
- Audio Quality: The clearer the audio and the less background noise, the better the transcription results. Try to use high-quality audio sources.
- Internet Required for First Use: The first time you transcribe a particular language, the program needs to download some required data for that language. It's recommended to successfully transcribe all commonly used languages at least once (even with a very short test audio), after which it can be used offline.
