Have you ever faced this problem?
Many speech-to-text tools work great for English, but their performance is often disappointing for Eastern languages like Chinese dialects (Cantonese, Sichuanese, etc.), Vietnamese, or Filipino.
Here's some good news!
The Dataocean AI team has developed and open-sourced the Dolphin project, a speech transcription model specifically optimized for Eastern languages, enabling more accurate recognition.
To make this powerful tool accessible even to non-technical users, I've created a user-friendly interface and an all-in-one package.
Download Links
- • Method 1: Download from Baidu Netdisk https://pan.baidu.com/s/1ODhqN-GiaHoGdU-ml3kCUQ?pwd=i2ui
- • GitHub Address https://github.com/jianchang512/speech2text-df
Key Features: Simple and Efficient
- • Focused on Eastern Languages: Specially optimized to support various Eastern languages and dialects.
- • Easy to Use: Just upload your audio/video, select the language, and click a button.
- • Flexible Output: Generates SRT subtitle files by default, with support for TXT or JSON formats as well.
How to Use (GUI Version)
Follow these steps to get started easily:
- Launch the Tool
- • After running the program, it will automatically open a web interface in your browser, usually at
http://127.0.0.1:5080
. If it doesn't open automatically, just enter this address manually.
- Upload an Audio or Video File
- • Click the "Select File" button on the interface to find the audio or video file you want to transcribe.
- • Supports various formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, aac, flac, mov, mkv, avi, etc.
- Select the Language
- • In the "Language" dropdown menu, find the corresponding language of your file (e.g., Mandarin Chinese, Sichuanese, Cantonese, etc.).
- • Not sure which language it is? No problem. Select "Auto Detect" and let the tool figure it out.
- Select Output Format
- • It will generate an SRT subtitle file by default.
- • You can also choose to output as TXT (plain text) or JSON (structured data) based on your needs.
- Start Transcription
- • Click the "Start Transcription" button.
- • The tool will automatically perform a series of processes in the background:
- • Converts your file into the WAV audio format, which is suitable for processing.
- • Splits the audio into smaller chunks to improve processing speed and accuracy.
- • Uses the Dolphin model to recognize the speech in each chunk.
- • Finally, it assembles the recognition results into your chosen format (e.g., SRT).
- Get the Results
- • Once the transcription is complete, the results will be displayed directly on the interface.
- • You can copy the text directly or click the download button to save the results as a file, making it convenient for video editing or other uses.
For Developers: How to Use the API
If you're a developer looking to integrate this functionality into your own application, the all-in-one package also provides an API.
- • Endpoint:
/v1/audio/transcriptions
- • Method:
POST
- • Content-Type:
multipart/form-data
(Note: This is notapplication/json
because you are uploading a file). - • Parameters:
- •
file
: (required) The audio/video file itself. - •
language
: (optional) The target language code (see table below). Leave blank for auto-detection. - •
response_format
: (optional) The response format. Supports"srt"
,"json"
,"txt"
. Defaults to"srt"
.
- •
- • Response:
- • Success: Returns the transcribed text in the specified format (SRT, JSON, or TXT).
- • Failure: Returns a JSON object containing error information.
Supported Language Codes
Language Code | Language Name |
---|---|
zh-CN | Chinese (Mandarin) |
zh-TW | Chinese (Taiwan) |
zh-WU | Chinese (Wu) |
zh-SICHUAN | Chinese (Sichuanese) |
zh-SHANXI | Chinese (Shanxi) |
zh-ANHUI | Chinese (Anhui) |
zh-TIANJIN | Chinese (Tianjin) |
zh-NINGXIA | Chinese (Ningxia) |
zh-SHAANXI | Chinese (Shaanxi) |
zh-HEBEI | Chinese (Hebei) |
zh-SHANDONG | Chinese (Shandong) |
zh-GUANGDONG | Chinese (Guangdong) |
zh-SHANGHAI | Chinese (Shanghainese) |
zh-HUBEI | Chinese (Hubei) |
zh-LIAONING | Chinese (Liaoning) |
zh-GANSU | Chinese (Gansu) |
zh-FUJIAN | Chinese (Fujian) |
zh-HUNAN | Chinese (Hunan) |
zh-HENAN | Chinese (Henan) |
zh-YUNNAN | Chinese (Yunnan) |
zh-MINNAN | Chinese (Minnan) |
zh-WENZHOU | Chinese (Wenzhou) |
ja-JP | Japanese |
th-TH | Thai |
ru-RU | Russian |
ko-KR | Korean |
id-ID | Indonesian |
vi-VN | Vietnamese |
ct-NULL | Cantonese (Unknown) |
ct-HK | Cantonese (Hong Kong) |
ct-GZ | Cantonese (Guangdong) |
hi-IN | Hindi |
ur-IN | Urdu (India) |
ur-PK | Urdu |
ms-MY | Malay |
uz-UZ | Uzbek |
ar-MA | Arabic (Morocco) |
ar-GLA | Arabic |
ar-SA | Arabic (Saudi Arabia) |
ar-EG | Arabic (Egypt) |
ar-KW | Arabic (Kuwait) |
ar-LY | Arabic (Libya) |
ar-JO | Arabic (Jordan) |
ar-AE | Arabic (UAE) |
ar-LVT | Arabic (Levantine) |
fa-IR | Persian |
bn-BD | Bengali |
ta-SG | Tamil (Singapore) |
ta-LK | Tamil (Sri Lanka) |
ta-IN | Tamil (India) |
ta-MY | Tamil (Malaysia) |
te-IN | Telugu |
ug-NULL | Uyghur |
ug-CN | Uyghur |
gu-IN | Gujarati |
my-MM | Burmese |
tl-PH | Tagalog |
kk-KZ | Kazakh |
or-IN | Odia |
ne-NP | Nepali |
mn-MN | Mongolian |
km-KH | Khmer |
jv-ID | Javanese |
lo-LA | Lao |
si-LK | Sinhala |
fil-PH | Filipino |
ps-AF | Pashto |
pa-IN | Punjabi |
kab-NULL | Kabyle |
ba-NULL | Bashkir |
ks-IN | Kashmiri |
tg-TJ | Tajik |
su-ID | Sundanese |
mr-IN | Marathi |
ky-KG | Kyrgyz |
az-AZ | Azerbaijani |
API Call Example (using curl)
curl -X POST http://127.0.0.1:5080/v1/audio/transcriptions \
-F "file=@/path/to/your/audio.mp3" \
-F "language=zh-CN" \
-F "response_format=srt"
API Call Example (using Python openai
library) (This library conveniently allows calling APIs that are compatible with the OpenAI API format)
from openai import OpenAI
# Configure the client to point to the local service address
client = OpenAI(base_url='http://127.0.0.1:5080/v1', api_key='any_string_will_do') # api_key is not important in this context
audio_file_path = "your_audio.wav" # Replace with your file path
with open(audio_file_path, 'rb') as file_handle:
# Make the transcription request
transcript = client.audio.transcriptions.create(
file=(audio_file_path, file_handle), # Pass the filename and file content
model='base', # Model name, fixed as 'base' here or adjust as needed
language='zh-CN', # Specify the language
response_format="srt" # Specify the response format
)
# Print the transcription result (SRT format text)
print(transcript)
Example Response (SRT Format)
1
00:00:00,000 --> 00:00:02,500
Hello, this is a test audio.
2
00:00:02,500 --> 00:00:05,000
I hope the transcription result is accurate.
Want it Faster? Enable GPU Acceleration (Optional)
- • Why use a GPU? If you have a compatible NVIDIA graphics card and the environment is properly configured, using a GPU can significantly increase transcription speed, especially for long audio files.
- • How to Enable:
- Prerequisites: Ensure your computer has the correct NVIDIA graphics card drivers and CUDA 12.x environment installed.
- Install Support: In the all-in-one package folder, find and double-click the
Install GPU Support.bat
(安装GPU支持.bat) file. It will automatically complete the necessary setup.
- Install Support: In the all-in-one package folder, find and double-click the
- • Note: The default all-in-one package does not include GPU support to keep the file size smaller.
A Few Tips
- File Size and Duration: It's recommended that individual files are not too large (e.g., under 1GB) and that the duration is kept within 1 hour. Very large files may be processed very slowly.
- Audio Quality: The clearer the audio and the less background noise, the better the transcription results. Try to use high-quality audio sources.
- Internet Connection for First Use: The first time you transcribe a specific language, the program needs an internet connection to download some required data for that language. It is recommended to successfully transcribe all your commonly used languages once (even with a very short test audio). After that, you can use it offline.