Self-Hosted Streaming

The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription service that can be deployed within your own infrastructure. Audio, transcripts, and PII never leave your network — only license validation and usage metadata are transmitted back to AssemblyAI.

Self-hosted streaming requires an upfront commercial commitment of $20,000. Contact our sales team to discuss your needs and learn more about our self-hosted offering.

The deployment instructions, Compose files, nginx configuration, and example clients are maintained in the public streaming-self-hosting-stack repository. This page covers what self-hosted streaming is, what you need to run it, and how the stack is shaped. Go to the repo for the actual setup steps.

What you can self-host

Self-hosted streaming ships as two separate stacks. Each stack serves one model family, runs from its own Docker Compose file, and uses its own GPU. You pick the stack that matches the model you want to serve — they are not designed to run side by side.

Stack	Model(s) served	Compose file	Best for
Universal Streaming	Universal Streaming English + Multilingual	`docker-compose.yml`	English and multilingual transcription workloads, telephony, captioning
Universal-3.5 Pro Streaming	Universal-3.5 Pro	`docker-compose.u3pro.yml`	Voice agents — short utterances, low end-of-turn latency, continuous partials

For product capabilities and accuracy details, see the Universal Streaming and Universal-3.5 Pro Streaming overview pages.

Core principle

Complete data isolation. Audio, transcripts, and PII stay inside your infrastructure. The only outbound traffic is license validation and (for usage-based contracts) usage metadata to https://usage-tracker.assemblyai.com.

System requirements

Hardware

Universal Streaming. NVIDIA T4 or newer per ASR container. We recommend at least 4 CPU and 16 GB RAM per ASR container.
Universal-3.5 Pro Streaming. NVIDIA L4, A10, A100, L40S, H100, or equivalent with at least 24 GB VRAM. The container also bundles ~14 GB of model weights, so plan disk accordingly. T4 GPUs are not sufficient for U3.5 Pro.

Software

Operating system. Linux
Container runtime. Docker and Docker Compose (v2 — the docker compose command, not docker-compose)
NVIDIA Container Toolkit. Required for Docker to access the GPU
AWS credentials. AssemblyAI provisions a scoped AWS access key for your team so you can pull container images from our private ECR registry

Architecture

Both stacks share the same gateway, load balancer, and license proxy — they only differ in the ASR backend.

Shared services (both stacks)

streaming-api — Gateway WebSocket service that clients connect to. Handles session lifecycle, audio framing, and routing to the ASR backend.
license-and-usage-proxy — Validates the license file at startup and reports usage metadata (for usage-based contracts).
streaming-asr-lb — nginx:alpine load balancer that routes ASR gRPC requests to the right backend based on the X-Model-Version header.

Universal Streaming stack

Adds two ASR backends:

streaming-asr-english — English speech recognition.
streaming-asr-multilang — Multilingual speech recognition.

Universal-3.5 Pro Streaming stack

Adds a single ASR backend:

streaming-asr-u3pro — Universal-3.5 Pro speech recognition. Available as of v0.6.0.

Connection flow

WebSocket client ──→ streaming-api:8080
                          │
                          ├─ License validation ──→ license-and-usage-proxy:8080
                          │                              │
                          ├─ Usage reporting ────────────┴──→ https://usage-tracker.assemblyai.com
                          │                                  (usage-based billing only)
                          │
                          └─ ASR (gRPC) ──────────→ streaming-asr-lb:80
                                                       │
                                                       └─ Header-based routing (X-Model-Version):
                                                          ├── en-default → streaming-asr-english:50051 (gRPC)
                                                          ├── ml-default → streaming-asr-multilang:50051 (gRPC)
                                                          └── u3-pro     → streaming-asr-u3pro:50051 (gRPC)

The load balancer only forwards to backends that are actually deployed in the running stack — Universal Streaming routes for en-default and ml-default, U3.5 Pro routes for u3-pro.

Getting started

Follow the upstream repo’s README for the actual setup steps. At a high level:

Get credentials and a license file from your AssemblyAI representative — an AWS access key scoped to ECR, and a license.jwt file. The same license file works for both stacks.
Install Docker, Docker Compose, and the NVIDIA Container Toolkit. See the README’s setup section for verification commands.
Authenticate to ECR with the provided AWS credentials.
Pick a stack and configure .env with the image references from the repo’s .env.example.
Start the stack with docker compose up -d (Universal Streaming) or docker compose -f docker-compose.u3pro.yml up -d (Universal-3.5 Pro Streaming).

Universal Streaming ASR containers take roughly 2 minutes to become ready and log Ready to serve!. The Universal-3.5 Pro Streaming ASR container takes roughly 5 minutes and logs U3Pro ASR Server ready!. Health checks may report unhealthy during startup — that is expected.

Running a test client

The repo ships an example Python client under streaming_example that streams a pre-recorded WAV file to the WebSocket endpoint. It supports all three speech models via the --speech-model flag:

universal-streaming-english — Universal Streaming, English
universal-streaming-multilingual — Universal Streaming, multilingual
universal-3-5-pro — Universal-3.5 Pro Streaming

The client routes to the correct ASR backend automatically via the X-Model-Version header. Make sure the value you pass matches a backend deployed in the stack you started.

Switching between stacks

The two stacks listen on the same ports (streaming-api on 8080, ASR load balancer on the gRPC backend), so they cannot run simultaneously. To switch:

# Stop the running stack first.
docker compose down                                # if Universal Streaming is running
docker compose -f docker-compose.u3pro.yml down    # if Universal-3.5 Pro Streaming is running

# Update .env for the new stack (image vars differ), then start.
docker compose up -d                               # Universal Streaming
docker compose -f docker-compose.u3pro.yml up -d   # Universal-3.5 Pro Streaming

Production deployment

Per-service deployment strategy, resource sizing, autoscaling thresholds, health-check tuning, and the license-and-usage-proxy /v1/status endpoint reference all live in the repo’s Production Deployment Recommendations section.

Release notes and changelog

Release notes for the self-hosted stack — including per-version model improvements, API additions, and breaking changes — live in the repo’s README changelog. Tagged releases are visible on the Releases page.

Support

For deployment questions, image access, license issues, or to report bugs, contact your AssemblyAI representative.

​What you can self-host

​Core principle

​System requirements

​Hardware

​Software

​Architecture

​Shared services (both stacks)

​Universal Streaming stack

​Universal-3.5 Pro Streaming stack

​Connection flow

​Getting started

​Running a test client

​Switching between stacks

​Production deployment

​Release notes and changelog

​Support

What you can self-host

Core principle

System requirements

Hardware

Software

Architecture

Shared services (both stacks)

Universal Streaming stack

Universal-3.5 Pro Streaming stack

Connection flow

Getting started

Running a test client

Switching between stacks

Production deployment

Release notes and changelog

Support