Real-time transcription in Python with AssemblyAI's streaming SDK
This Python tutorial covers ~300ms latency transcription, intelligent endpointing, and immutable transcripts perfect for voice agents



You can have live transcription running in Python in about fifteen lines of code. Open a connection, register an event handler, stream your microphone, done—transcripts stream in as you speak, and each turn finalizes about 300ms after you stop talking. The catch isn't getting it working; it's getting it working correctly in production without leaking sessions or fighting partial results that won't stabilize.
This is a hands-on tutorial for the v3 streaming SDK—the StreamingClient pattern that's current as of 2026. If you've used the old RealtimeTranscriber class, forget it; that API is gone. We'll cover setup, the four events you handle, a complete working example, the advanced knobs that matter, and—critically—how to avoid the surprise bill that catches almost everyone the first time.
Let's build it.
Setup
Install the SDK with the extras you'll need for microphone capture:
pip install "assemblyai[extras]"Grab your API key from the dashboard and set it as an environment variable so it never lands in your source:
export ASSEMBLYAI_API_KEY="your_key_here"That's the whole setup. The SDK handles the WebSocket connection to wss://streaming.assemblyai.com/v3/ws, audio framing, and reconnection logic. You bring audio and event handlers. If you'd rather follow the canonical reference alongside this, the streaming getting-started guide tracks the same flow.
The four events you handle
Real-time streaming speech-to-text is event-driven. The client emits four event types, and you register a callback for each one that matters to you.
- Begin — the session opened. You get a session ID here; log it.
- Turn — the meat. Fires repeatedly as the model recognizes speech, carrying partial and finalized transcripts.
- Termination — the session closed. You get the total audio duration processed.
- Error — something went wrong on the connection.
Here's each handler:
from assemblyai.streaming.v3 import (
StreamingClient,
StreamingClientOptions,
StreamingParameters,
StreamingEvents,
BeginEvent,
TurnEvent,
TerminationEvent,
StreamingError,
)
def on_begin(client: StreamingClient, event: BeginEvent):
print(f"Session started: {event.id}")
def on_turn(client: StreamingClient, event: TurnEvent):
print(event.transcript, end="\r")
if event.end_of_turn:
print(event.transcript) # finalized line
def on_terminated(client: StreamingClient, event: TerminationEvent):
print(f"Session ended. {event.audio_duration_seconds}s of audio processed.")
def on_error(client: StreamingClient, error: StreamingError):
print(f"Error: {error}")
The on_turn handler is where the interesting logic lives. Each TurnEvent carries the current transcript, which updates as more audio arrives. When event.end_of_turn is true, the model has decided the speaker finished a thought—that's your cue to commit the line.
Running the client: a complete example
Here's an end-to-end script that transcribes your microphone until you hit Ctrl+C:
import assemblyai as aai
from assemblyai.streaming.v3 import (
StreamingClient,
StreamingClientOptions,
StreamingParameters,
StreamingEvents,
BeginEvent,
TurnEvent,
TerminationEvent,
StreamingError,
)
API_KEY = "your_key_here" # or read from env
def on_begin(client, event: BeginEvent):
print(f"Session started: {event.id}")
def on_turn(client, event: TurnEvent):
print(event.transcript, end="\r")
if event.end_of_turn:
print(event.transcript)
def on_terminated(client, event: TerminationEvent):
print(f"\nProcessed {event.audio_duration_seconds}s of audio.")
def on_error(client, error: StreamingError):
print(f"Error: {error}")
client = StreamingClient(StreamingClientOptions(api_key=API_KEY))
client.on(StreamingEvents.Begin, on_begin)
client.on(StreamingEvents.Turn, on_turn)
client.on(StreamingEvents.Termination, on_terminated)
client.on(StreamingEvents.Error, on_error)
client.connect(
StreamingParameters(
speech_model="universal-3-5-pro",
sample_rate=16000,
)
)
try:
client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
finally:
client.disconnect(terminate=True)
A few things worth calling out. The speech_model="universal-3-5-pro" parameter selects Universal-3.5 Pro Realtime, AssemblyAI's flagship streaming model and the recommended default for voice agents. It delivers flagship accuracy across 18 languages with mid-sentence code-switching and—uniquely—can take your agent's side of the conversation as input to sharpen the next transcript (more on that below). It runs at $0.45/hr, the same as the previous generation, so upgrading is a one-line change. The sample_rate=16000 has to match your audio source: the model expects 16kHz mono PCM, and a mismatch produces garbage, not an error. Formatting—punctuation and casing on finalized turns—is automatic on this model; there's no flag to toggle.
Notice the finally block. That client.disconnect(terminate=True) is not optional—skip it and you'll learn why the hard way. More on that next.
The session-termination warning you can't skip
This is the section I'd tattoo on the back of every developer's hand if I could.
Streaming is billed per session duration—the entire time the connection is open, not just when audio is flowing. And here's the trap: if you don't explicitly terminate a session, it stays open. If your script crashes, your laptop sleeps, or you forget the disconnect call, the session keeps running. It will auto-close after three hours—and bill you for the full three hours, audio or not.
These are "zombie sessions," and they're the single most common cause of surprise streaming bills. I've seen a forgotten loop rack up hours of phantom usage overnight.
The fix is simple and non-negotiable: always close your session.
# Explicit termination — do this every time
client.disconnect(terminate=True)Wrap your streaming logic so termination runs no matter what:
try:
client.stream(audio_source)
except KeyboardInterrupt:
pass
finally:
client.disconnect(terminate=True) # runs on crash, Ctrl+C, or normal exit
If you're working at the raw WebSocket level instead of the SDK, send a terminate message before closing the socket. The principle is identical: every session you open, you close. Check the pricing page so you know exactly what an open session costs per hour—it makes the discipline a lot easier to internalize.
Do this and the surprise bills disappear. It really is that mechanical.
Advanced configuration
Once the basics work, these parameters are where you tune for your specific use case. Here's the reference for the ones you'll reach for most:
Intelligent endpointing
Endpointing is how the model decides a speaker has finished. Universal-3.5 Pro Realtime uses punctuation-based turn detection: a turn ends when the model emits terminal punctuation (. ? !), and if no punctuation lands within max_turn_silence, the turn closes on silence anyway. The mode preset sets sensible defaults for the parameters below, and you can override any of them. For a fast-paced voice agent you want short silences and quick turns; for medical dictation you want patience so you don't chop a sentence in half.
client.connect(
StreamingParameters(
speech_model="universal-3-5-pro",
sample_rate=16000,
mode="balanced",
min_turn_silence=160,
max_turn_silence=2400,
)
)
Agent context
Universal-3.5 Pro Realtime is the first streaming model that can take your agent's side of the conversation as input. When your voice agent asks "What's your email address?", pass that question via agent_context and the model knows an email is coming—so it transcribes user@assemblyai.com instead of "user at assemblyai dot com." The user side comes along for free: prior finalized turns are carried forward automatically, so you only manage the agent half.
client.connect(
StreamingParameters(
speech_model="universal-3-5-pro",
sample_rate=16000,
agent_context="What's your email address?",
)
)
Seed agent_context at connection time with your agent's opening line, then send an UpdateConfiguration message with a fresh agent_context after each agent reply. AssemblyAI measured a 10.2% word error rate reduction from agent context across 20,000 real conversations.
Speaker labels
For multi-speaker audio, turn on real-time speaker diarization with speaker_labels=True. Each turn then carries a speaker_label, so a live meeting transcript can tag lines to Speaker A and Speaker B as the call happens, for up to 10 speakers. Pass max_speakers if you know the count in advance. When the stream ends, the model re-clusters voices with full-conversation context and sends back any corrected labels—so your saved transcript gets async-quality attribution.
Voice focus
Noisy environment? Set voice_focus to isolate the primary speaker before audio reaches the model—near-field for headsets and handsets, far-field for conference rooms, kiosks, and drive-thru speakers. Optionally tune voice_focus_threshold (a float from 0.0 to 1.0) to control how aggressively background audio is suppressed.
Temporary token auth for client-side apps
Never ship your API key to a browser or mobile app. Instead, mint a short-lived token on your server and hand that to the client:
token = client.create_temporary_token(expires_in_seconds=60)The client uses that token to connect directly, the token expires in a minute, and your real key stays on the server. This is the right pattern for any streaming transcription running in untrusted environments.
Performance optimization
A few habits keep latency low and the experience smooth.
Render partials, reconcile on finals. Don't wait for end_of_turn to show anything. Print the running event.transcript immediately for responsiveness, then commit the formatted line when the turn finalizes. Showing only finals makes a fast model feel laggy.
Send correctly sized frames. The SDK's MicrophoneStream already chunks audio sensibly. If you're feeding custom audio, aim for 50–250ms frames. Bigger chunks add latency; tiny ones add overhead.
Pick the right region and model. Latency is end to end—inference plus network. If your users are far from the inference region, that round trip shows up as lag no matter how fast the model is. And match the model to the job: latency-critical conversational work wants Universal-3.5 Pro Realtime, while high-volume, cost-sensitive captioning can ride the value Universal-Streaming tiers starting at $0.15/hr (see assemblyai.com/pricing).
Custom audio sources
MicrophoneStream is convenient, but client.stream() accepts any iterable that yields PCM audio chunks. That means you can pipe in audio from a file, a telephony bridge, or another service:
def file_chunks(path, chunk_size=3200):
with open(path, "rb") as f:
while chunk := f.read(chunk_size):
yield chunk
client.stream(file_chunks("call.pcm"))
The only hard requirement is that the audio is 16kHz mono PCM matching the sample_rate you set. If your source is a different rate—say a Twilio stream at 8kHz—resample it before streaming. A mismatch won't error; it'll just transcribe nonsense, which is a maddening bug to chase.
This flexibility is what lets the same Python code back everything from a desktop dictation tool to a contact-center pipeline. There's a good survey of live transcription tools if you want to see how the ecosystem fits together.
Troubleshooting
The bugs I see most, and their fixes:
- Garbled transcripts. Almost always a sample-rate mismatch. Confirm your audio is 16kHz mono PCM and that sample_rate matches.
- Connection refused or 401. Bad or missing API key, or an expired temporary token. Check ASSEMBLYAI_API_KEY and your token's expires_in_seconds.
- Partials never finalize. Your endpointing is too patient. Lower min_turn_silence or max_turn_silence.
- Speakers get cut off. The opposite problem—raise min_turn_silence.
- Session seems to hang open. You didn't call disconnect(terminate=True). See the warning section; this one costs money.
Production deployment
Moving from a working script to production comes down to a handful of disciplines.
Always terminate, defensively. Use try/finally so sessions close on crashes and exceptions, not just clean exits. Add a watchdog that closes any session running longer than you'd ever expect. The 3-hour auto-close is a backstop, not a billing strategy.
Authenticate clients with temporary tokens. Server mints, client connects. Never the API key in untrusted code.
Handle reconnection. Networks drop. Catch the error event, and reconnect with backoff rather than hammering the endpoint. Buffer a little audio client-side so a brief drop doesn't lose words.
Monitor session duration. Since billing is per session, log every session's open and close time. An alert on long-running sessions catches zombies before they cost you. If you're routing transcripts into an LLM downstream, the LLM Gateway gives you a single place to manage those calls.
Match the model to scale. Prototype on Universal-3.5 Pro Realtime, measure your real latency and word error rate on your own audio, then decide whether a value tier holds up. Speaking of accuracy—don't trust a single WER number; here's why word error rate is broken as a standalone metric.
Frequently asked questions
What is Universal-3.5 Pro Realtime?
Universal-3.5 Pro Realtime is AssemblyAI's flagship streaming speech-to-text model—generally available and the recommended default for real-time voice agents. You select it with speech_model="universal-3-5-pro" over a WebSocket connection to wss://streaming.assemblyai.com/v3/ws. It transcribes 18 languages with mid-sentence code-switching, runs at $0.45/hr, and is the first streaming model that can take your agent's question as input to sharpen the next transcript.
How do I transcribe audio in real time in Python with AssemblyAI?
Install the SDK with pip install "assemblyai[extras]", create a StreamingClient with your API key, register a Turn event handler, and call client.stream() with an audio source like MicrophoneStream. The client opens a WebSocket, returns partial transcripts as audio arrives, and finalizes each turn when the speaker pauses. Always call client.disconnect(terminate=True) when you're done so the session closes and stops billing.
What's the best real-time speech-to-text model for voice agents?
AssemblyAI's Universal-3.5 Pro Realtime is purpose-built for voice agents, with flagship accuracy on the messy audio agents actually face—telephony, accents, background noise, and rapid turn-taking. It adds agent context (the model sees your agent's question), punctuation-based turn detection, real-time speaker diarization, and keyterm prompting, across 18 languages. For high-volume, cost-sensitive workloads, the value Universal-Streaming tiers start at $0.15/hr.
How accurate is Universal-3.5 Pro Realtime compared to other streaming models?
On a Pipecat benchmark of real agent conversations, Universal-3.5 Pro Realtime posted a 6.99% word error rate—lower (better) than Google Chirp3 (9.04%), ElevenLabs Scribe v2 (9.76%), and Deepgram Flux (15.58%). Passing the agent's question via agent context cut word error rate by a further 10.2% across 20,000 conversations. Accuracy on your own audio is what matters most, so benchmark on a representative sample before you commit.
How does AssemblyAI detect end-of-turn, and how do I handle partial results?
Universal-3.5 Pro Realtime uses punctuation-based turn detection: a turn ends when the model emits terminal punctuation (. ? !), and if no punctuation lands within max_turn_silence, it ends on silence anyway. Each TurnEvent carries the running transcript—render it immediately for responsiveness, then commit the line when event.end_of_turn becomes true. Tune turn timing with the mode preset plus min_turn_silence and max_turn_silence.
How much does AssemblyAI's real-time streaming cost, and how do I avoid surprise charges?
Universal-3.5 Pro Realtime is $0.45/hr and the value Universal-Streaming tiers are $0.15/hr, but the catch is that streaming is billed per session duration—the whole time the WebSocket is open, not just when audio flows. An unclosed session auto-closes after three hours and bills the full duration, which is the main cause of surprise "zombie session" charges. Always call client.disconnect(terminate=True) inside a try/finally block so sessions close on crashes and clean exits alike.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.





