Transform voice into text and vice versa with advanced AI capabilities.
Grade: A — Score: 100/100
Free (F0): $0/month
Pay-as-You-Go: Usage-based (no minimum commitment)
Commitment Tiers (Standard): Monthly commitment (volume discounts)
Commitment Tiers (Connected Container): Monthly commitment (~5% discount vs Standard)
Commitment Tiers (Disconnected Container): Annual commitment (for air-gapped/offline deployments)
Consider switching to Google Cloud Speech-to-Text: Offers similar functionalities with competitive pricing and integration options.
Azure Speech-to-Text pricing depends on the transcription mode. Real-time standard transcription costs $1 per audio hour, fast transcription costs $0.36 per hour, and batch transcription costs $0.18 per hour. Custom models (trained on your domain-specific vocabulary) cost slightly more: $1.20 per hour for real-time and $0.225 per hour for batch, plus $0.0538 per model per hour for endpoint hosting and $10 per compute hour for custom model training. Enhanced add-on features — continuous language identification, diarization (speaker separation), and pronunciation assessment — add $0.30 per hour per feature for real-time use, but are included at no extra charge for batch. All usage is billed in one-second increments.
Yes. The Free (F0) tier provides 5 audio hours per month for speech-to-text (shared between Standard and Custom; batch is not supported), 0.5 million characters per month for neural text-to-speech, 5 audio hours per month for speech translation, and 10,000 transactions per month for speaker recognition (verification, identification, and voice profile storage). You also get 1 custom speech model hosted free per month (unused models are automatically decommissioned after 7 days). The free tier is enough for prototyping and light testing but not for production workloads.
Azure Speech's standard real-time STT at $1 per hour ($0.0167/minute) is comparable to Google Cloud Speech-to-Text at $0.016/minute. Azure's batch transcription at $0.18 per hour ($0.003/minute) is highly competitive. With commitment tiers, Azure drops to $0.50 per hour for 50,000 hours/month. Deepgram offers rates as low as $0.0043/minute for pre-recorded audio, making it cheaper for pure transcription. Azure's advantage is the breadth of its integrated platform — STT, TTS, translation, Voice Live API, avatars, and speaker recognition all under one API surface with 100+ compliance certifications. Competitors like Deepgram or Soniox may be cheaper per minute but require separate services for translation, TTS, or custom voice.
Voice Live API enables end-to-end voice for AI agents by combining speech-to-text, LLM reasoning, and text-to-speech in a single pipeline. It is now integrated with Foundry Agent Service. Pricing is per million tokens across three tiers based on LLM size: Voice Live Pro (GPT-Realtime, GPT-4o, GPT-4.1) starts at $4.40/M tokens for text input and $17/M tokens for audio-standard input; Voice Live Standard (GPT-4o-Mini-Realtime, GPT-4o Mini, GPT-4.1 Mini) starts at $0.66/M tokens for text and $15/M tokens for audio; Voice Live Lite (GPT-4.1 Nano, Phi models) starts at $0.11/M tokens for text and $15/M tokens for audio. Voice Live BYO (bring your own model) provides audio processing only at $12.50/M tokens input and $30/M tokens output.
Azure TTS offers prebuilt neural voices at $15 per 1M characters, and neural HD voices (higher fidelity) at $22 per 1M characters. Custom professional voice — where you train a unique branded voice — costs $24 per 1M characters for synthesis ($48 for neural HD), $52 per compute hour for voice model training (capped at $936 per training session), and $4.04 per model per hour for endpoint hosting. Personal voice (limited access, requires Microsoft approval) costs $24 per 1M characters with free voice creation and $600 per 1,000 voice profiles per month for storage. Commitment tiers reduce prebuilt neural TTS to as low as $7.50 per 1M characters at the 2,000M character tier ($15,000/month).
Commitment tiers are monthly volume commitments that reduce per-unit costs by 20–50% compared to pay-as-you-go. For STT Standard, tiers are: $1,600/month for 2,000 hours (effective $0.80/hour, with $0.80/hour overage), $6,500/month for 10,000 hours ($0.65/hour effective), and $25,000/month for 50,000 hours ($0.50/hour effective — a 50% savings over the $1/hour pay-as-you-go rate). For TTS Neural, tiers start at $960/month for 80M characters ($12/1M effective) and go to $15,000/month for 2,000M characters ($7.50/1M effective). You should consider commitment tiers when you consistently process 2,000+ hours of audio or 80M+ TTS characters per month. The risk is that unused committed hours are not refunded.
Yes. Azure Speech supports three deployment models beyond cloud: connected containers run on your infrastructure while maintaining a metered connection to Azure for billing (with ~5% discount over standard commitment pricing — e.g., STT at $1,520/month for 2,000 hours vs $1,600 standard). Disconnected containers operate fully offline for air-gapped environments, priced annually — STT Standard starts at $74,100/year for 120,000 hours, and TTS Neural starts at $47,424/year for 4.8B characters. Disconnected containers require sign-up for access. Embedded Speech provides on-device STT and TTS for scenarios where cloud connectivity is intermittent, using lighter models directly on the device.
Azure Speech supports 100+ languages for speech-to-text, with an ever-growing list for speech translation. Real-time speech translation costs $2.50 per audio hour (includes 1 audio input/output and up to 2 text translation languages — additional languages are billed via Azure Translator). Video Translation converts video input at $5 per hour with output at $15 per hour (standard voice) or $20 per hour (personal voice). Live Interpreter enables real-time multilingual communication at $1 per audio input hour plus $10 per 1M characters for text output, $1.50 per hour for standard voice output, or $2 per hour for custom voice output. SDKs are available in C#, C++, Java, Python, and JavaScript.