Question 1

How does Google Text-to-Speech compare to Amazon Polly?

Accepted Answer

Google Text-to-Speech offers 380+ voices across 75+ languages, compared to Amazon Polly's smaller voice library. Google's Chirp 3: HD and Gemini-TTS models produce more natural-sounding speech with emotional range, while Polly provides per-word timestamps that Google lacks. Both charge $16.00/1M characters for neural voices, but Google's free tier renews monthly (1M-4M characters depending on model), whereas Polly's free tier is limited to the first 12 months. The main deciding factor is ecosystem: Polly integrates with AWS services, Google TTS with Dialogflow, Vertex AI, and Cloud Run.

Question 2

How many languages and voices does Google Text-to-Speech support?

Accepted Answer

Google Text-to-Speech supports 380+ voices across 75+ languages and regional variants. These span multiple voice tiers: Standard, WaveNet, Neural2, Studio, Polyglot (Preview), Chirp 3: HD (30 distinct styles), and the newest Gemini-TTS models available in 75+ locales. Languages include English, Mandarin, Hindi, Spanish, Arabic, Russian, Japanese, Portuguese, French, German, Korean, and dozens more. The full list is available through the voices:list API endpoint or Google's documentation.

Question 3

What is the difference between Google Text-to-Speech WaveNet and Chirp 3: HD voices?

Accepted Answer

WaveNet voices are Google's legacy neural TTS tier, priced at $4.00/1M characters with a 4 million character monthly free tier. They produce clear, natural speech but lack conversational spontaneity. Chirp 3: HD voices, at $30.00/1M characters with a 1 million character free tier, use Google's AudioML research to add human disfluencies, emotional range, and more accurate intonation. Chirp 3: HD also supports low-latency streaming for real-time agent conversations, but does not support SSML input or pitch/rate audio parameters.

Question 4

Does Google Text-to-Speech store or use customer data for training?

Accepted Answer

By default, no. Google processes Text-to-Speech requests in memory and does not store customer text input or generated audio for model training. Google only uses data from customers who have explicitly opted in to the data logging program, which provides discounted pricing in exchange. Customers who have not opted in retain full ownership of their content, and Google does not share it with third parties except as necessary to deliver the service.

Question 5

Can Google Text-to-Speech create a custom voice from my own recordings?

Accepted Answer

Yes. Google's Instant Custom Voice feature creates a personalized voice model from as little as 10 seconds of reference audio. It is available in 30+ locales and designed for use cases like audiobooks, podcasts, video games, and branded voice interfaces. Instant Custom Voice is priced at $60.00/1M characters with no monthly free tier. You access it through the Media Studio in the Google Cloud Console or via the API.

Question 6

How does Google Text-to-Speech integrate with Dialogflow for voicebots?

Accepted Answer

Google Text-to-Speech is natively integrated with Dialogflow CX and Dialogflow ES. When building a conversational agent in Dialogflow, you can select any Text-to-Speech voice model to generate dynamic spoken responses instead of playing pre-recorded audio. The integration also works with third-party contact center platforms like Genesys Cloud, Avaya, and Cisco. For end-to-end voice interfaces, Text-to-Speech pairs with Google Cloud Speech-to-Text to handle both directions of the conversation.

Question 7

What audio output formats does Google Text-to-Speech support?

Accepted Answer

Google Text-to-Speech generates audio in MP3, Linear16 (WAV), OGG Opus, and several additional encodings including A-Law and Mu-Law for telephony use cases. Per-request configuration options include pitch tuning (up to 20 semitones in either direction), speaking rate adjustment (0.25x to 4x normal speed), and volume gain control (up to +16 dB). Audio profiles let you optimize output for specific playback devices like headphones, phone lines, or smart speakers.

Question 8

What is Gemini-TTS in Google Text-to-Speech and how does it differ from older models?

Accepted Answer

Gemini-TTS is Google's newest Text-to-Speech technology, using Gemini 2.5 Flash and Gemini 2.5 Pro models. Unlike older models where you configure voice parameters manually, Gemini-TTS accepts natural-language text prompts to control style, accent, pace, tone, and emotional expression. It also supports multispeaker synthesis for narratives with multiple characters. Pricing uses a token-based model: Gemini 2.5 Flash at $0.50/1M input tokens + $10.00/1M audio tokens, and Gemini 2.5 Pro at double those rates. Audio tokens correspond to 25 tokens per second of generated output.

Google Text-to-Speech — Independent Software Review

Compliance Transparency Index

Best For

Not Ideal For

Operational Overview

Pricing Structure

Alternative Consideration

Frequently Asked Questions