Gemini 2.5 Introduces Multi-Speaker Text-to-Speech (TTS), Available for Free
Google's Gemini 2.5 has a new and incredibly useful feature: multi-speaker text-to-speech! You can use it for free on Google AI Studio. This capability is driven by the gemini-2.5-flash-preview-tts
and gemini-2.5-pro-preview-tts
models.
Important Notes:
- Internet Access (with VPN if needed): To access Google AI services, you need to be able to access the global internet. You'll need to handle any necessary network configurations yourself. This is fundamental to using foreign AI tools.
- Google Account: You need a free Google account. If you don't have one yet, you can register on the Google website. Usually, a local phone number is sufficient for registration.
I. Accessing the Gemini Text-to-Speech Webpage
You can access Gemini's text-to-speech feature through either of the following ways:
- Direct Access: Open the link https://aistudio.google.com/generate-speech in your browser.
- Via the AI Studio Homepage: If you are already logged into Google AI Studio, you can also find the voice generation feature entry by following the instructions in the image below.
If the page fails to open, or if you see a message like "Not supported in your current region" (which often happens when using a Hong Kong network node), try switching your network proxy node to another country or region (such as the United States or Singapore).
Once successfully opened, you will see the following voice generation interface:
II. Interface Overview and Mode Switching
Don't worry, although the interface is in English, it's very easy to use. We will explain it step by step below.
Gemini's voice generation tool automatically detects the language of the text you enter. It currently supports up to 24 languages (although Chinese support is not explicitly stated in the documentation, it is actually supported).
By default, you will enter the Multi-speaker audio interface:
If you only need a single voice for the speech, you can click Single-speaker audio
on the right side of the interface to switch to Single-speaker audio mode. The single-person mode interface is simpler:
III. Multi-Speaker Text-to-Speech Practical Steps
We will focus on the more feature-rich multi-speaker text-to-speech, which currently supports only 2 speakers.
1. Prepare and Paste Text for Speech
In the Raw structure text box on the left side of the interface, enter or paste the text content you want to be voiced. Key points:
- Line Breaks: It is recommended that each line of content is not too long, with natural pauses in the sentences.
- Specify Speaker: At the beginning of each line, use the format
SpeakerX:
(English colon) to specify which character will read that line. For example:Speaker1: Today is a beautiful day, the sun is shining and the breeze is gentle.
Speaker2: Yes, how about we go for a walk in the park?
- Gemini will assign different voices to lines marked with different speakers. Currently, up to two speakers are supported (for example, you can define "Speaker1" and "Speaker2").
2. Configure Speaker Roles (Voice settings)
In the Voice settings area on the right side of the interface, you need to configure each speaker:
Set Speaker Name (Name): As shown in the figure below, the name entered in the Name input box must be exactly the same as the speaker identifier at the beginning of each line in the text on the left (such as "Speaker1", "Speaker2"). Case, numbers, and even spaces must match.
Select Voice: In the Voice drop-down menu below Name, you can select a specific voice actor role for the currently selected speaker. Click the play button next to each role to audition its timbre and choose your favorite voice.
3. (Optional) Set Voice Style (Style instructions)
If you want the voiceover to have a specific emotion or tone (e.g., happy, angry, sad), you can enter style prompts in the Style instructions
text box. These prompts will be automatically applied to the entire voiceover project, affecting the overall style of all speakers.
Tip: The text preview area on the right side also displays the content of your left editing area in real time, and you can directly modify, delete, or add lines in this area, which is very convenient.
4. Generate and Download Voiceover
After completing all the above settings, click the blue Run button in the lower right corner of the interface. Gemini will start processing your text and generating speech. If everything goes well, after a short while, the generated audio player will appear below. You can directly play it online to listen to it. After confirming that you are satisfied with the effect, click the download button to save it to your computer.
IV. Possible Problems and Solutions
Currently, Gemini has relatively strict limits on the API call frequency. When you process a large number of text lines, especially when using dual-speaker mode, you may encounter generation failures (especially when the text is in Chinese) and see an error message similar to the one below:
If you encounter this problem, you can try the following methods:
- Switch to Single-Speaker Mode: If multi-speaker is not essential, switching to
Single-speaker audio
mode usually improves the success rate. - Try Again Later: The simplest method is to wait a few minutes or longer before trying again.