March 24, 2026

Real-time transcription in Python with Universal-3 Pro Streaming

Real time transcription Python tutorial: stream mic audio via WebSockets for sub-second speech-to-text with punctuation, setup steps, and runnable code.

Kelsey Foster

Growth

Universal-3 Pro Streaming

Streaming Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

This tutorial shows you how to build a real-time speech-to-text application in Python that transcribes speech as you speak, delivering results in under 300 milliseconds. You'll create a streaming transcription system that processes live microphone input and displays formatted text with proper punctuation and timing information.

You'll use AssemblyAI's Universal-3 Pro Streaming model through WebSocket connections, the Python SDK for audio processing, and PyAudio for microphone capture. The tutorial covers setting up event handlers, configuring turn detection parameters, and implementing advanced features like dynamic keyterms prompting and mid-stream configuration updates for production voice applications.

What is real-time transcription?

Real-time transcription is the process of converting speech-to-text as you speak, not after you’re done talking. This means you get transcripts in under 300 milliseconds while the conversation continues.

Think of it like live TV captions—words appear on screen as people talk.

Your Python application sends small chunks of audio (20–100 milliseconds each) through a WebSocket connection and receives text back immediately.

Here's what makes real-time transcription different from regular transcription:

WebSocket connections: Keep an open channel between your app and transcription service instead of sending complete files
Streaming sessions: Process audio continuously until you close the connection
Turn objects: Data structures that contain transcript text, timing info, and metadata about each speaking turn
Voice activity detection: Automatically identifies when someone starts and stops speaking

When to use streaming transcription

You need real-time transcription when your application requires immediate responses to speech. Voice agents can't wait two seconds for transcripts without breaking conversation flow. Live captioning needs instant text to stay synchronized with speakers.

The best use cases include voice assistants, customer service bots, and meeting transcription where people interact with the text as it appears. If you're transcribing pre-recorded content like podcasts or videos, batch transcription works better since speed doesn't matter.

Before you start

You'll need Python 3.7 or higher and an AssemblyAI API key to access Universal-3 Pro Streaming. Sign up for a free account to get your API key—it takes less than a minute.

Universal-3 Pro Streaming costs $0.45 per hour of audio processed. Free accounts can start up to 5 new sessions per minute, which works fine for development. Paid accounts start with 100 new sessions per minute and unlimited concurrent connections.

Basic requirements:

Python 3.7+: With pip package manager for installing libraries
Microphone access: For capturing audio input from your system
Internet connection: WebSocket connections need stable connectivity
4GB RAM minimum: For smooth real-time audio processing

Create an account to access your API key.

You'll use an API key to authenticate the Python SDK in this tutorial. Create an account to access Universal-3 Pro Streaming.

Get API key

Set up your Python environment

Create a separate folder for your transcription project to keep packages organized. Virtual environments prevent conflicts between different Python projects on your system.

mkdir realtime-transcription
cd realtime-transcription
python -m venv venv

Activate your virtual environment before installing packages:

# On macOS/Linux
source venv/bin/activate

# On Windows
.\venv\Scripts\activate

Install the required packages for real-time transcription:

pip install assemblyai pyaudio python-dotenv

assemblyai: Python SDK for Universal-3 Pro Streaming API
pyaudio: Handles microphone input and audio streaming
python-dotenv: Manages environment variables securely

Your system needs audio processing libraries that vary by operating system:

Operating System	Installation Command	Package Manager
macOS	`brew install portaudio`	Homebrew
Ubuntu/Debian	`sudo apt install portaudio19-dev`	APT
Windows	Download wheel from PyPI	Direct install

Create a .env file in your project folder to store your API key:

ASSEMBLYAI_API_KEY=your_api_key_here

Add .env to .gitignore so you don't accidentally commit your credentials:

echo ".env" >> .gitignore
echo "venv/" >> .gitignore

Implement real-time transcription with Universal-3 Pro

Universal-3 Pro processes audio through WebSocket connections that the Python SDK manages automatically. You focus on sending audio and handling the transcripts that come back.

Create a file called transcribe.py and add these imports:

import os
from dotenv import load_dotenv
import assemblyai as aai
from assemblyai.streaming.v3 import (
    StreamingClient, StreamingClientOptions, StreamingEvents,
StreamingParameters
)

load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

Understand Universal-3 Pro transcription responses

Universal-3 Pro returns transcripts as Turn objects that represent complete speaking segments. Each Turn contains formatted text with punctuation, word timing, and metadata about the transcription state.

The model uses punctuation to detect when someone finishes speaking. This means you get properly formatted sentences instead of run-on text.

Property	Type	Description	Voice Agent Use Case
`turn_order`	Integer	Sequential turn counter	Track conversation flow
`transcript`	String	Formatted text with punctuation	Display final text
`utterance`	String	Complete utterance with partial words	Generate responses early
`words`	Array	Individual words with timing data	Sync with audio playback
`end_of_turn`	Boolean	Shows if turn is complete	Trigger response generation

The end_of_turn property tells you when someone finished a complete thought. When this is true, the transcript is final and won't change. When it's false, more text might come for this speaking turn.

Key differences in Universal-3 Pro:

Partial turns: Appear when there's silence but no punctuation—they won't be revised
Final turns: Include punctuation and proper formatting
Immutable transcripts: Once sent, partial transcripts never change

Set up event handlers

Event handlers are functions that run when specific things happen during transcription. You need to define these before creating your client so they're ready when events occur.

def on_begin(client, event):
    print(f"Session started - ID: {event.id}")
    print("Speak into your microphone...\n")
def on_turn(client, event):
    if event.end_of_turn:
        print(f"✓ {event.transcript}")
    else:        print(f"... {event.transcript}", end='\r')
def on_terminated(client, event):
    print(f"\nSession ended - {event.audio_duration_seconds}s
processed")
def on_error(client, error):
    print(f"Error: {error}")

The on_turn handler shows you both partial and final transcripts. Voice agents typically wait for end_of_turn: true before generating responses since these transcripts include proper punctuation and formatting.

Create and run the streaming client

The StreamingClient handles WebSocket connections and audio processing automatically. You configure it with your preferences and let it manage the technical details.

def start_transcription():
    options = StreamingClientOptions(
        api_key=aai.settings.api_key,
        api_host="streaming.assemblyai.com"
    )
    client = StreamingClient(options)
    
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)
    
    params = StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
        format_turns=True
    )
    
    try:
        client.connect(params)
        mic_stream = aai.extras.MicrophoneStream(sample_rate=16000)
        client.stream(mic_stream)
    except KeyboardInterrupt:
        print("\nStopping transcription...")
    finally:
        client.disconnect(terminate=True)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)
    
    # Configure transcription settings
    params = StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
        format_turns=True
    )
    
    try:
        # Connect to Universal-3 Pro
        client.connect(params)
        
        # Start streaming from microphone
        mic_stream = 
aai.extras.MicrophoneStream(sample_rate=16000)
        client.stream(mic_stream)
        
    except KeyboardInterrupt:
        print("\nStopping transcription...")
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    start_transcription()

The speech_model="u3-rt-pro" parameter selects Universal-3 Pro for fast voice agent performance. Setting sample_rate=16000 matches your microphone input for accurate transcription.

Complete working example

Here's everything combined into a working transcription application:

import os
from dotenv import load_dotenv
import assemblyai as aai
from assemblyai.streaming.v3 import (
    StreamingClient, StreamingClientOptions, StreamingEvents, StreamingParameters
)

load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

def on_begin(client, event):
    print(f"Session started - ID: {event.id}")
    print("Speak into your microphone...\n")

def on_turn(client, event):
    if event.end_of_turn:
        print(f"✓ {event.transcript}")
    else:
        print(f"... {event.transcript}", end='\r')

def on_terminated(client, event):
    print(f"\nSession ended - {event.audio_duration_seconds:.1f}s
processed")

def on_error(client, error):
    print(f"Error: {error}")

def start_transcription():
    options = StreamingClientOptions(
        api_key=aai.settings.api_key,
        api_host="streaming.assemblyai.com"
    )
    client = StreamingClient(options)
    
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)
    
    params = StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
        format_turns=True,
        min_turn_silence=100,
        max_turn_silence=1200
    )
    
    try:
        client.connect(params)
        client.stream(aai.extras.MicrophoneStream(sample_rate=16000))
    except KeyboardInterrupt:
        print("\n\nStopping...")
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    start_transcription()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

def on_begin(client, event):
    print(f"Session started - ID: {event.id}")
    print("Speak into your microphone...\n")

def on_turn(client, event):
    if event.end_of_turn:
        print(f"✓ {event.transcript}")
    else:
        print(f"... {event.transcript}", end='\r')

def on_terminated(client, event):
    duration = event.audio_duration_seconds
    print(f"\nSession ended - {duration:.1f} seconds processed")

def on_error(client, error):
    print(f"Error: {error}")

def start_transcription():
    options = StreamingClientOptions(
        api_key=aai.settings.api_key,
        api_host="streaming.assemblyai.com"
    )
    
    client = StreamingClient(options)
    
    client.on(StreamingEvents.Begin, on_begin)
    client.on(StreamingEvents.Turn, on_turn)
    client.on(StreamingEvents.Termination, on_terminated)
    client.on(StreamingEvents.Error, on_error)
    
    params = StreamingParameters(
        speech_model="u3-rt-pro",
        sample_rate=16000,
        format_turns=True,
        min_turn_silence=100,
        max_turn_silence=1200
    )
    
    try:
        client.connect(params)
        
        print("Initializing microphone...")
        mic_stream =
aai.extras.MicrophoneStream(sample_rate=16000)
        
        print("Streaming to Universal-3 Pro...")
        client.stream(mic_stream)
        
    except KeyboardInterrupt:
        print("\n\nStopping...")
    finally:
        client.disconnect(terminate=True)

if __name__ == "__main__":
    print("Real-time Transcription with Universal-3 Pro")
    print("-" * 45)
    start_transcription()

Run your script to start transcribing:

python transcribe.py

You'll see partial transcripts that update in real-time, followed by final transcripts with punctuation.

Experiment with real-time transcription in your browser

Experiment with Universal-3 Pro Streaming without writing code. See partial and final turns update live as you speak.

Open playground

Configure advanced options

Production applications need specific configurations to balance speed, accuracy, and user experience. Universal-3 Pro provides parameters that control how it detects when someone stops speaking and formats the text.

Parameter	Default	Range	Use Case	Performance Impact
`min_turn_silence`	100ms	50-500ms	Quick responses	Lower = faster, may split words
`max_turn_silence`	1200ms	500-5000ms	Long pauses	Higher = better for slow speakers
`format_turns`	true	true/false	Raw vs formatted text	Not applicable to u3-rt-pro model
`keyterms_prompt`	empty	Custom terms	Domain-specific words	Improves accuracy for specific terms

Configure turn detection

Universal-3 Pro uses punctuation to detect when someone finishes speaking, not confidence scores like other systems. When silence reaches min_turn_silence (default 100ms), it transcribes the audio and looks for periods, question marks, or exclamation points.

If it finds punctuation, you get a final transcript with end_of_turn: true. Without punctuation, you get a partial transcript and the system keeps listening until max_turn_silence (default 1200ms) forces an end.

params = StreamingParameters(
    speech_model="u3-rt-pro",
    sample_rate=16000,
    min_turn_silence=150,  # Wait a bit longer for better
accuracy
    max_turn_silence=2000  # Allow longer pauses for thinking
)

Tuning silence parameters:

Too low min_turn_silence (under 100ms): Splits phone numbers and email addresses
Too high min_turn_silence (over 300ms): Makes the system feel slow
Too low max_turn_silence: Cuts off slow speakers
Too high max_turn_silence: Creates awkward pauses in conversations

You can also force turns to end based on external signals like button presses:

# End turn immediately when user presses stop
def on_button_press():
    client.send({"type": "ForceEndpoint"})

Update configuration mid-stream with keyterms

The UpdateConfiguration message lets you change settings without reconnecting. This works great for voice agents that handle different conversation stages requiring different vocabulary.

A medical intake bot needs different keyterms when collecting patient information versus discussing symptoms:

def update_keyterms_for_stage(client, stage):
    stage_keyterms = {
        "identification": ["account number", "member ID", "date
of birth", "policy number"],
        "medical": ["metformin", "hypertension", "MRI scan",
"blood pressure"],
        "payment": ["credit card", "expiration date", "billing
address"],
    }
    keyterms = stage_keyterms.get(stage, [])
    
    client.send({
        "type": "UpdateConfiguration",
        "keyterms_prompt": keyterms or None
    })

Clear keyterms when they're no longer relevant to prevent false matches:

# Remove medical terms after symptom discussion
config = {
    "type": "UpdateConfiguration", 
    "keyterms_prompt": None
}
client.send(config)

You can adjust other settings mid-stream for specific interactions:

Increase max_turn_silence when expecting long credit card numbers
Decrease min_turn_silence for rapid-fire Q&A sessions
Update prompt text to guide transcription for new topics

Control text formatting

Universal-3 Pro applies punctuation and capitalization as an integral part of its punctuation-based turn detection system. The formatting is built into the model's processing and cannot be disabled.

The format_turns parameter is not applicable to the u3-rt-pro model:

# format_turns parameter has no effect on u3-rt-pro model
params = StreamingParameters(
    speech_model="u3-rt-pro",
    sample_rate=16000,
    format_turns=True  # Not applicable to u3-rt-pro
)

Voice agents receive formatted text with proper punctuation automatically. This enables accurate intent analysis and natural language processing without additional formatting steps.

Use temporary authentication tokens

Never put API keys directly in client-side code. Temporary tokens provide secure authentication without exposing your credentials. Generate tokens on your server and pass them to client applications.

Server-side token generation:

import os
from flask import Flask, jsonify
import assemblyai as aai
from assemblyai.streaming.v3 import StreamingClient,
StreamingClientOptions

app = Flask(__name__)
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

@app.route('/get-token')
def get_token():
    options = StreamingClientOptions(
        api_key=aai.settings.api_key,
        api_host="streaming.assemblyai.com"
    )
    client = StreamingClient(options)
    token = client.create_temporary_token(
        expires_in_seconds=600,
        max_session_duration_seconds=1800
    )
    return jsonify({'token': token})

Client-side usage with the token:

import requests
from assemblyai.streaming.v3 import StreamingClient,
StreamingClientOptions

response = requests.get('https://your-server.com/get-token')
token = response.json()['token']

options = StreamingClientOptions(token=token,
api_host="streaming.assemblyai.com")
client = StreamingClient(options)

Real-time transcription use cases

Voice agents for customer service use Universal-3 Pro to handle account questions, payment processing, and technical support. The fast response times enable natural conversation flow that feels helpful rather than robotic.

Live captioning transforms meetings, conferences, and educational sessions by making content accessible in real-time. Universities report higher engagement when providing live captions for remote learning.

Common applications include:

Meeting transcription: Automatic note-taking with speaker attribution
Voice assistants: Hands-free control for smart devices and applications
Customer service bots: Automated support that escalates complex issues to humans
Healthcare documentation: Voice-powered clinical note-taking during patient visits
IVR modernization: Natural speech instead of touch-tone menu navigation

Healthcare applications benefit from keyterms prompting for medical terminology accuracy. Doctors speak naturally during examinations while the system captures clinical notes with correct medication names and diagnostic codes.

Final words

Real-time transcription with Universal-3 Pro Streaming gives you sub-300ms response times through punctuation-based turn detection and WebSocket connections. The Python SDK handles the complex audio processing while you focus on building your voice application features.

Universal-3 Pro's streaming transcription model delivers industry-leading accuracy with features like dynamic keyterms prompting and mid-stream configuration updates. AssemblyAI's Voice AI infrastructure scales automatically from development to production, handling millions of hours of audio without outages or performance issues.

Explore more ways to create real-time voice applications.

Ship sub-300ms streaming transcripts using the Python SDK and Universal-3 Pro Streaming. Create your account and start building today.

Start building

FAQ

What audio sample rates work best with Universal-3 Pro Streaming?

Use 16kHz sample rate with mono linear16 PCM encoding for optimal accuracy and performance. Higher sample rates don't improve transcription quality but increase bandwidth usage unnecessarily.

How do partial transcripts differ from final transcripts in Universal-3 Pro?

Partial transcripts have end_of_turn: false and appear when silence reaches min_turn_silence without terminal punctuation. Final transcripts have end_of_turn: true with complete punctuation—these never change once emitted.

Can I stream audio from files instead of a microphone?

Yes, replace MicrophoneStream with a custom function that reads audio files in chunks. Add delays between chunks to maintain real-time streaming pace and avoid overwhelming the API.

When should I disable text formatting in Universal-3 Pro?

The format_turns parameter is not applicable to Universal-3 Pro (u3-rt-pro model) as formatting is integral to its punctuation-based turn detection system. The model always returns formatted text with proper punctuation.

How do I handle WebSocket connection errors in production?

Implement retry logic with exponential backoff and connection health checks. Monitor for network interruptions and automatically reconnect with the same session parameters when connections drop.

Real-time transcription in Python with Universal-3 Pro Streaming

What is real-time transcription?

When to use streaming transcription

Before you start

Set up your Python environment

Implement real-time transcription with Universal-3 Pro

Understand Universal-3 Pro transcription responses

Set up event handlers

Create and run the streaming client

Complete working example

Configure advanced options

Configure turn detection

Update configuration mid-stream with keyterms

Control text formatting

Use temporary authentication tokens

Real-time transcription use cases

Final words

FAQ

What audio sample rates work best with Universal-3 Pro Streaming?

How do partial transcripts differ from final transcripts in Universal-3 Pro?

Can I stream audio from files instead of a microphone?

When should I disable text formatting in Universal-3 Pro?

How do I handle WebSocket connection errors in production?

6 best orchestration tools to build AI voice agents in 2026

Build a real-time AI voice bot using Python, AssemblyAI, and ElevenLabs

Streaming speaker diarization: How to identify who's speaking in real time

Using the Voice Agent API alongside an existing voice stack

Supervised Machine Learning For Beginners

Why Language Models Became Large Language Models And The Hurdles In Developing LLM-based Applications

AssemblyAI Named a Leader in G2’s Spring 2026 Voice Recognition Report

Getting Started with Hugging Face's Gradio

Real-time transcription in Python with Universal-3 Pro Streaming

What is real-time transcription?

When to use streaming transcription

Before you start

Set up your Python environment

Implement real-time transcription with Universal-3 Pro

Understand Universal-3 Pro transcription responses

Set up event handlers

Create and run the streaming client

Complete working example

Configure advanced options

Configure turn detection

Update configuration mid-stream with keyterms

Control text formatting

Use temporary authentication tokens

Real-time transcription use cases

Final words

FAQ

What audio sample rates work best with Universal-3 Pro Streaming?

How do partial transcripts differ from final transcripts in Universal-3 Pro?

Can I stream audio from files instead of a microphone?

When should I disable text formatting in Universal-3 Pro?

How do I handle WebSocket connection errors in production?

Related posts

6 best orchestration tools to build AI voice agents in 2026

Build a real-time AI voice bot using Python, AssemblyAI, and ElevenLabs

Streaming speaker diarization: How to identify who's speaking in real time

Using the Voice Agent API alongside an existing voice stack

Supervised Machine Learning For Beginners

Why Language Models Became Large Language Models And The Hurdles In Developing LLM-based Applications

AssemblyAI Named a Leader in G2’s Spring 2026 Voice Recognition Report

Getting Started with Hugging Face's Gradio