Streaming Speech-to-Text

Real-time transcription your notetaker, agents, and captions can depend on

Universal-3 Pro Streaming

Your transcriptions will show here...

import assemblyai as aai
from assemblyai.streaming.v3 import (
    StreamingClient,
    StreamingClientOptions,
    StreamingEvents,
    StreamingParameters,
    TurnEvent,
)

def on_turn(self, event: TurnEvent):
    print(f"{event.transcript} ({event.end_of_turn})")

client = StreamingClient(
    StreamingClientOptions(
        api_key=API_KEY,
        api_host="streaming.assemblyai.com",
    )
)

client.on(StreamingEvents.Turn, on_turn)

client.connect(
    StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
    )
)

try:
    client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
finally:
    client.disconnect(terminate=True)

Models

Pick the model that fits your workload

Real-time transcription fast enough for voice agents, accurate enough for production.

Universal-3 Pro Streaming

The most accurate, controllable model on the market.

$0.45 / hr

Learn more

Best for: voice agents and real-time assistants

English Only

~150ms P50 latency, finals-only output

Highest accuracy on names, numbers, and technical terms

Universal Streaming

Production-grade streaming accuracy at the lowest price.

$0.15 / hr

Learn more

Best for: live captioning, meetings, agent assist

English Only

Partial and final transcripts

Production-grade accuracy at the lowest price

Universal Streaming Multilingual

Global contact centers and multilingual voice experiences.

$0.15 / hr

Learn more

EN, ES, FR, DE, IT, PT

Code-switching

Speaker diarization

Keyterm prompting

Whisper Streaming

Open-source Whisper model enhanced with AssemblyAI's reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point.

$0.30 / hr

Learn more

99+ languages

Unlimited concurrency

HIPAA BAA available

Add-on

Medical Mode

Clinical-grade streaming accuracy

Clinical-grade accuracy
Medical terminology recognition
Noise-resilient transcription
HIPAA-ready infrastructure

Learn more

Compare features

Model

U-3 Pro Streaming Voice agents, AI scribes

Universal Streaming AI notetakers, call centers

Univ. Multilingual Global contact centers

Whisper Streaming Long-tail language support

Price

$0.45 /hr

$0.15 /hr

$0.30 /hr

Languages

EN, ES, FR, DE, IT, PT

English

EN, ES, FR, DE, IT, PT

99+ languages

Natural language prompting

Up to ~1,500 words

—

Keyterm prompting

Up to ~100 words

—

Code-switching

—

Speaker diarization

10+ speakers

—

Medical terminology

Medical mode add-on

—

HIPAA BAA

On request

Unlimited concurrency

Use cases

Built for every voice workflow

Real-time transcription powers every application where you stream audio.

Voice Agents

Production-grade transcription for voice agents, available standalone or through our Voice Agent API. Drops into Pipecat, LiveKit, or your own stack.

Learn more

AI Notetaker

Capture patient-provider conversations in real time with clinical-grade accuracy.

Learn more

Conversation Intelligence, and Call Analytics

Stream live transcripts from every sales call, support ticket, and customer interaction.

Learn more

AI Scribes

Capture patient-provider conversations in real time with clinical-grade accuracy.

Learn more

Agent Assist

Real-time transcription that surfaces the right answer to agents mid-call.

Learn more

Dictation, and Voice Input

Power voice input in productivity tools, mobile apps, and accessibility workflows.

Learn more

Playground

We’re not playing around, but you can

Put our Voice AI models to the test in our no-code playground.

Try it out

Common questions

: Streaming speech-to-text transcribes live audio as it's spoken. You send audio over a secure WebSocket to the API, which returns transcripts within a few hundred milliseconds (~300 ms P50).
: Yes. AssemblyAI scales automatically with no hard caps on concurrent streams. You can run as many simultaneous transcription sessions as your application requires.
: Create a free account, connect to the WebSocket endpoint, set your sample rate, stream audio chunks, and handle transcript events. Our SDKs for Python and JavaScript make it a few lines of code.
: Universal Streaming is $0.15 per hour. Universal-3 Pro Streaming is $0.20 per hour. Whisper Streaming is $0.30 per hour for 99+ language coverage. Billing is based on total session duration.
: Immutable transcripts, intelligent endpointing, word-level timestamps, keyterms prompting, speaker diarization, and unlimited concurrent streams.
: English is supported by default. Universal Streaming Multilingual supports Spanish, French, German, Italian, and Portuguese. Whisper Streaming adds 99+ additional languages.