Self-Hosted Streaming
Self-Hosted Streaming
The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription solution that can be deployed within your own infrastructure. This early access version is designed for design partners to evaluate and provide feedback on our self-hosted offering.
Universal-3 Pro Streaming is now available for self-hosting
Available as of v0.6.0. See the Universal-3 Pro Streaming images section below for the deployment image references, or the upstream docker-compose.u3pro.yml for the full service definition.
Self-hosted streaming requires a $20,000 upfront commitment. Contact our sales team to discuss your specific needs and to learn more about our self-hosted offerings.
Getting the latest instructions
The most up-to-date deployment instructions, configuration files, and example scripts are maintained in our private GitHub repository:
https://github.com/AssemblyAI/streaming-self-hosting-stack
Design partners are encouraged to provide their GitHub username to gain access to the repository. Please contact the AssemblyAI team directly to request access.
Core principle
- Complete data isolation: No audio data, transcript data, or personally identifiable information (PII) will ever be sent to AssemblyAI servers. Only usage metadata and licensing information is transmitted.
System requirements
Hardware requirements
- GPU (Universal Streaming): NVIDIA GPU support required (any NVIDIA GPU model will work, T4 or newer recommended).
- GPU (Universal-3 Pro Streaming): Requires NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. T4 GPUs are not sufficient for U3 Pro. See the v0.6.0 changelog entry for details.
Software requirements
- Operating System: Linux
- Container Runtime: Docker and Docker Compose required
- AWS Account: Required for pulling container images from our ECR registry
Architecture
Self-hosted streaming ships as two separate stacks. Both share the same gateway, load balancer, and license proxy — they differ only in the ASR backend. Run one stack at a time.
Shared services (both stacks)
- API Service (
streaming-api) - Gateway API service handling WebSocket connections - License and Usage Proxy (
license-and-usage-proxy) - License validation and usage reporting service - ASR Load Balancer (
streaming-asr-lb) - Standard nginx:alpine container with header-based routing between ASR services
Universal Streaming stack (docker-compose.yml)
Adds two ASR backends to the shared services above:
- English ASR Service (
streaming-asr-english) - English speech recognition model service - Multilingual ASR Service (
streaming-asr-multilang) - Multilingual speech recognition model service
Universal-3 Pro Streaming stack (docker-compose.u3pro.yml)
Adds a single ASR backend to the shared services above:
- U3 Pro ASR Service (
streaming-asr-u3pro) - Universal-3 Pro speech recognition model service, available as of v0.6.0. Targeted at voice agent scenarios. See the v0.6.0 changelog entry for capability details.
Connection flow
Both stacks route ASR requests through the same streaming-asr-lb nginx load balancer using header-based routing on X-Model-Version. The difference is which ASR backends are deployed.
Universal Streaming (docker-compose.yml)
Universal-3 Pro Streaming (docker-compose.u3pro.yml)
Prerequisites
- AssemblyAI license: Valid for the streaming self-hosted product.
- Docker & Docker Compose: Ensure Docker and Docker Compose are installed.
- GPU Support: NVIDIA Container Toolkit for GPU-enabled services.
- AWS Access: Valid AWS credentials to pull images from ECR.
Setup and deployment
1. Docker runtime with GPU support
1.1 Verify NVIDIA drivers are installed:
1.2 Install NVIDIA Container Toolkit:
Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.
1.3 Verify the Docker runtime has GPU access:
2. AWS ECR authentication
AWS ECR Access: We will manually provision AWS account credentials for your team to pull container images from our private Amazon ECR registry.
3. Configure container images
Universal Streaming and Universal-3 Pro Streaming ship as two separate stacks with their own Compose files. Pick the stack that matches the model you want to serve — they are not designed to be merged into a single Compose project. Configure the .env file for the stack you plan to run.
Universal Streaming (English + Multilingual) images
Use the reference .env.example file to create a .env file with container image references:
For ease of reference in this doc, the current image references are below:
Universal-3 Pro Streaming images
To run Universal-3 Pro Streaming, use the separate docker-compose.u3pro.yml file from the upstream repo with the following image references:
The U3 Pro stack does not require the STREAMING_ASR_ENGLISH_IMAGE or STREAMING_ASR_MULTILANG_IMAGE images.
4. Have the license file ready
License File Generation: We will manually provision a .jwt license file for your team to authenticate the container. The same license file is used for both the Universal Streaming and Universal-3 Pro Streaming stacks.
Ensure you have your AssemblyAI license file in the current working directory as license.jwt, or modify the LICENSE_FILE_PATH environment variable in the relevant Compose file (docker-compose.yml for Universal Streaming, docker-compose.u3pro.yml for Universal-3 Pro Streaming) to point to your license file location.
5. Start services
Start the stack you configured in step 3. Both stacks share the same streaming-api, load balancer, and license proxy — they differ only in the ASR backend. Stop one stack with docker compose down before starting the other.
Universal Streaming (English + Multilingual)
Universal-3 Pro Streaming
The ASR service containers include built-in model weights — no separate model download required. The Universal Streaming ASR services log "Ready to serve!" when ready (typically ~2 minutes). The U3 Pro ASR service logs "U3Pro ASR Server ready!" when ready (typically ~5 minutes).
Configuration
The inline docker-compose.yml and nginx_streaming_asr.conf shown in this section are for the Universal Streaming stack. For Universal-3 Pro Streaming, see the upstream repo for docker-compose.u3pro.yml and the corresponding nginx routing.
Docker Compose configuration
The docker-compose.yml file defines the service architecture:
Nginx configuration
The ASR load balancer in nginx_streaming_asr.conf uses header-based routing to direct requests to the appropriate model service based on the X-Model-Version header:
Usage reporting configuration
The license-and-usage-proxy service supports two billing modes based on your AssemblyAI license:
Flat billing mode
If your license is configured for flat billing, usage tracking is disabled. No additional configuration is required.
Usage-based billing mode
If your license is configured for usage-based billing, the proxy will automatically report usage data to AssemblyAI’s usage tracker service. You must configure the following environment variable in the docker-compose.yml for the license-and-usage-proxy service:
Important Notes:
- For the API key, any key retrieved from the AssemblyAI dashboard can be used.
- At startup, the proxy validates connectivity by registering with AssemblyAI’s https://usage-tracker.assemblyai.com.
- If connectivity validation fails, the proxy will shut down.
- Usage data is batched and reported every few seconds.
- The proxy automatically retries failed requests up to several times. Critical Behavior: If https://usage-tracker.assemblyai.com becomes unreachable and all retry attempts fail (after 5-60 minutes), the license-and-usage-proxy service will terminate itself. This is a fail-safe mechanism to ensure usage data integrity. Your service orchestrator should be configured to automatically replace the container with a new one. Monitoring Recommendations:
- Monitor the proxy’s logs for warnings about failed usage reporting attempts.
- Set up alerts for proxy restarts, which may indicate persistent connectivity issues.
- If the in-memory usage queue size exceeds 1000 items, the proxy will log a warning suggesting upscaling.
Service endpoints
- WebSocket:
ws://localhost:8080
Running the streaming example
A Python example script is provided to demonstrate how to stream a pre-recorded audio file to the self-hosted stack.
The example script below targets the Universal Streaming stack (--language en|multi, routing to streaming-asr-english or streaming-asr-multilang). For Universal-3 Pro Streaming, use the example script in the upstream repo’s streaming_example directory, which supports --speech-model u3-rt-pro for U3 Pro routing.
Note: You can initiate a session as soon as the relevant ASR containers are healthy. Universal Streaming containers (streaming-asr-english, streaming-asr-multilang) log "Ready to serve!" when ready; the U3 Pro container (streaming-asr-u3pro) logs "U3Pro ASR Server ready!" when ready.
Setup
Change to the streaming_example directory:
Create a fresh Python virtual environment and activate it:
Install the required packages:
Python script
Save this as example_with_prerecorded_audio_file.py:
Usage
The example script (example_with_prerecorded_audio_file.py) requires a PCM 16-bit WAV file (mono channel, 16kHz sample rate).
Note on language parameter:
- Use
"en"or omit the--languageparameter for English transcription (routes to English ASR service) - Use
"multi"or any non-English language code for multilingual transcription (routes to multilingual ASR service)
Basic usage:
Example with multilingual transcription:
Command-line arguments:
View help:
Live microphone streaming example
This example demonstrates real-time microphone transcription using a remote self-hosted deployment. This is useful for testing your self-hosted instance from a local machine.
The script below routes by language (en → English ASR, anything else → Multilingual ASR) and only targets the Universal Streaming stack. For Universal-3 Pro Streaming, adapt the script to use the U3 Pro routing (speech_model=u3-rt-pro) or use the upstream repo’s example as a reference.
Setup
Install the required packages:
Python script
Save this as live_microphone_streaming.py:
Usage
Basic usage (English):
Multilingual transcription:
Specific language (e.g., Spanish):
Note:
- Make sure to replace
SERVER_IPin the script with your actual server IP address - If testing locally on the same machine as the server, use
localhostor127.0.0.1 - The
Authorization: self-hostedheader is required for all connections - Language routing:
"en"routes to English ASR service, any other code (including"multi") routes to multilingual ASR service
Updating services
Model updates
To update to a new model version:
- Pull the new container images from ECR
- Update your
.envfile with the new image references - Restart the services using Docker Compose
For the Universal Streaming stack:
For the Universal-3 Pro Streaming stack, pass the U3 Pro Compose file:
Monitoring and debugging
View service logs
Check service status
Troubleshooting
Debug commands
For the Universal Streaming stack:
For the Universal-3 Pro Streaming stack, pass the U3 Pro Compose file and use the U3 Pro service name:
Common issues
-
GPU not detected: Verify NVIDIA Container Toolkit is properly installed and Docker has GPU access.
-
Services not starting: Check logs for specific error messages using
docker compose logs -f [service-name]. -
Connection refused: Ensure all services are healthy by checking
docker compose psand reviewing health check status.
Production Deployment Recommendations
streaming-api service
- Deployment Strategy: We recommend doing Blue/Green deployments to avoid disrupting ongoing sessions. Once you fully shift the traffic to the new color, wait at least 3 hours (the max session duration) before shutting down the old color to ensure no sessions get disrupted.
- Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it’s better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
- Autoscaling: We recommend setting up autoscaling based on the number of active sessions. A container with 1 CPU can generally handle around 32 concurrent sessions.
- Monitoring: Always monitor the logs during deployment to catch any potential issues early.
- Dependencies: For successful startup, the service depends on the license-and-usage-proxy service being up and running.
- Configuration: You can enable features like TLS encryption and structured logging via environment variables.
- Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
- Usage Reporting Behavior: After each session completes, the streaming-api reports usage to the license-and-usage-proxy with automatic retries on failure. Monitor logs any messages at a >= warning level.
license-and-usage-proxy service
- Deployment Strategy: Do gradual rollouts to ensure stability. Consider implementing monitoring and alerting for service restarts.
- Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it’s better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
- Monitoring: Always monitor logs during deployment to catch any potential issues early. You can set up an alert based on the responses of the
/v1/statusendpoint to alert you on any license issues. For usage-based billing, also monitor for usage reporting warnings and service restarts. - Dependencies:
- For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the
LICENSE_FILE_PATHenvironment variable to point to the license file path on the host machine. - For usage-based billing, the service also requires connectivity to https://usage-tracker.assemblyai.com at startup. If connectivity validation fails, the container will terminate. Ensure the
USAGE_TRACKING_API_KEYenvironment variable is properly configured.
- For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the
- Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
- Usage Reporting Resilience:
- Network connectivity to the https://usage-tracker.assemblyai.com endpoint must be reliable for production deployments with usage-based billing.
- Run at least a few containers behind a load balancer to ensure high availability.
License Status Endpoint
The /v1/status endpoint provides real-time information about the license validation state:
Endpoint: GET /v1/status
Response Schema:
State Descriptions:
Ready: Initial state when the service starts before any license validation has occurred.Connected: Last license validation check was successful.TrustBased: Last license validation check failed, but the request was within the trust window grace period, so services will remain operational.Failed: Last license validation check failed and the trust window has expired. streaming-api containers will shut down and stop serving requests.
Fields:
state: Current license validation state.last_successful_checkin: ISO 8601 timestamp of the last successful license validation (null if never successful).trust_expiration: ISO 8601 timestamp when the trust window expires (null if no successful validation yet).
Recommended Alerts:
- Alert when
statetransitions toTrustBased(indicates license validation issues). - Critical alert when
stateisFailed(services will shut down).
streaming-asr-english and streaming-asr-multilang services
- Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr container if a persistent connection gets disrupted with minimal state loss.
- Hardware Requirements: The services can run on NVIDIA T4 or newer GPUs. We recommend allocating at least 4 CPU and 16GB of RAM per container.
- Autoscaling: You can set up autoscaling based on the number of active sessions. A container with recommended hardware can generally handle up to 48 concurrent sessions.
- Monitoring: Always monitor logs during deployment to catch any potential issues early.
- Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
streaming-asr-u3pro service
- Hardware Requirements: Universal-3 Pro Streaming requires NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. T4 GPUs are not sufficient. See the v0.6.0 changelog entry for details.
- For all other operational guidance (deployment strategy, autoscaling, monitoring, health checks), see the streaming-self-hosting-stack repo.
Changelog
v0.6.0
Universal-3 Pro Streaming — new self-hosted stack
This release introduces the Universal-3 Pro Streaming self-hosted stack via a separate docker-compose.u3pro.yml file. U3 Pro is targeted at voice agent scenarios and delivers significant improvements over the universal English model on complex entities, short utterances, and end-of-turn (EOT) latency.
Highlights of U3 Pro behavior delivered with this release:
- 22% reduction in voice agent hallucinations
- 10% reduction in voice agent WER
- 29% reduction in voice agent short-utterance error rate
- 5% reduction in medical WER
- Continuous partials during long turns — partials are emitted incrementally instead of being delayed; turns now stitch up to 60s instead of hard-cutting at 16s/32s.
- 750ms early partial of detected speech for snappier voice agent UX.
Hardware: NVIDIA L4 / A10 / A100 / L40S / H100 (24 GB+ VRAM).
Streaming API — new features
continuous_partialsquery parameter — clients can opt into continuous partials during long turns.- Structured logging — both the U3 Pro ASR server and the universal ASR server now honor
USE_STRUCTURED_LOGGING, matching the streaming-api behavior.
Other improvements
- Various logging and metrics improvements across the streaming-api and ASR services.
- Bug fixes and stability improvements.
v0.5.0
English ASR model
A new English model is released, which produces already-formatted outputs directly and delivers large quality gains on digits, telephony, medical, and CI segments:
- 34% improvement on digit sequence error rate (DSER)
- 17% improvement on telephony WER
- 12% average improvement on medical WER
- 10% average improvement on CI segments WER
- ~2.4% absolute F1 score improvement on keyterms prompting
- Significantly improved timestamp accuracy — resolves overlapping and zero-duration word issues.
Multilingual ASR model
- ~70% absolute improvement in timestamp accuracy — fixes overlapping words and zero-duration word bugs.
Streaming API — new features
- Error and Warning WebSocket message types — Dedicated message types that let clients distinguish actionable errors from non-fatal warnings without relying on close codes.
- Configuration echoed in
SessionBegins— TheSessionBeginsmessage now includes the resolved session configuration so clients can verify applied settings. - Explicit speech-model selection — Clients explicitly select the speech model at session start.
Streaming API — fixes and improvements
- More specific WebSocket close codes for session termination scenarios, making client-side error handling more precise.
- Improved
word_finalizedevents — All word finalizations are emitted (not only the last word of a turn).
Other improvements
- Various logging, metrics, and observability improvements across the streaming-api and ASR services.
- Bug fixes and stability improvements.
v0.4.0
English ASR Model
Major improvements to short utterance handling and hallucination reduction:
- 100% reduction in hallucinations
- 12.8% improvement on short utterances - Better performance for voice agent use cases
- 7.39% improvement on digit sequence error rate
- 1.75% improvement on proper nouns
- 0.46% improvement on CI segments
- 0.39% improvement on accented speech
Multilingual ASR Model
- Context biasing support - Customers can now use context biasing (model-based biasing) with the multilingual model
Other Improvements
- Increased concurrent session handling per container, leading to reduced deployment costs
- Improved observability for the license-and-usage-proxy service
- Various bug fixes and stability improvements
Current limitations
As a design partner, please be aware of these current limitations:
- Manual credential provisioning (no self-service dashboard yet)
- Docker Compose deployment example only (production orchestration templates coming later)
Design partner support
What we provide
- Docker Compose configuration file
- Manual credential provisioning
- Direct engineering support for deployment
- Regular model updates
What we need from you
- Feedback on deployment experience
- Performance metrics in your environment
- Feature requests and prioritization input
- Use case validation
AWS deployment guide
This section provides step-by-step instructions for deploying the self-hosted streaming solution on AWS EC2, designed for users who may not be familiar with AWS infrastructure.
AWS prerequisites
Before you begin, ensure you have:
- An AWS account with billing enabled
- AWS CLI installed and configured on your local machine
- Basic familiarity with SSH and command-line operations
EC2 instance setup
1. Request GPU quota increase
By default, AWS accounts have limited or zero quota for GPU instances. You’ll need to request an increase:
- Navigate to the AWS Service Quotas console
- Search for “EC2”
- Find “Running On-Demand G and VT instances” (for g4dn, g5, or similar GPU instances)
- Click “Request quota increase”
- Request at least 4 vCPUs (minimum for a g4dn.xlarge instance)
- Provide a use case description: “Self-hosted AI transcription service requiring GPU acceleration”
- Submit the request
Note: Quota requests typically take 24-48 hours to process. Plan accordingly.
2. Choose the right instance type
Recommended instance types based on your needs:
Recommendation:
- For Universal Streaming, start with g4dn.xlarge for evaluation, then scale to g4dn.2xlarge or g5 instances for production workloads.
- For Universal-3 Pro Streaming, you need a GPU with at least 24 GB VRAM. Use g5.xlarge at minimum (A10G, 24 GB) and g5.2xlarge for production. T4-based instances (g4dn) are not sufficient for U3 Pro. See the v0.6.0 changelog entry for the full hardware requirements.
3. Launch EC2 instance with recommended AMI
3.1 Navigate to the EC2 console and click “Launch Instance”
3.2 Configure instance settings:
- Name:
assemblyai-self-hosted-streaming - AMI: Search for and select “AWS Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04)”
- AMI ID format:
ami-xxxxxxxxx(varies by region) - This AMI includes pre-installed NVIDIA drivers, CUDA toolkit, and Docker with GPU support
- AMI ID format:
- Instance type: Select the instance type that matches your stack from the recommendation table above. For Universal Streaming,
g4dn.xlargeis a reasonable starter; for Universal-3 Pro Streaming, useg5.xlargeor larger. - Key pair: Create a new key pair or select an existing one
- If creating new: Download the
.pemfile and save it securely - Set permissions:
chmod 400 your-key.pem
- If creating new: Download the
3.3 Configure storage:
- Root volume: Increase to at least 100 GB gp3 (model weights and containers require significant space)
- The default 8 GB is insufficient
3.4 Configure security group (Network settings):
Create a new security group with the following inbound rules:
Security recommendations:
- For production: Restrict Source to your specific IP addresses or VPC CIDR ranges
- For development/testing: You can use
0.0.0.0/0but understand this allows public access - Consider using AWS VPN or Direct Connect for enhanced security
- Enable AWS CloudTrail for audit logging
3.5 Launch the instance and wait for it to reach “Running” state
4. Connect to your EC2 instance
5. Verify GPU and Docker setup
Once connected, verify the pre-installed components:
Important: This setup uses Docker Compose v2, which uses the command
docker compose (space, no hyphen) instead of the older docker-compose
(hyphen). All commands in this guide use the v2 syntax.
6. Configure AWS credentials on the instance
Set up AWS credentials to pull container images from ECR:
You’ll be prompted to enter:
- AWS Access Key ID
- AWS Secret Access Key
- Default region:
us-west-2 - Default output format:
json
7. Deploy the self-hosted streaming solution
Follow the standard deployment instructions from the “Setup and deployment” section above. Common setup steps first, then pick the stack you want to deploy.
Common setup (both stacks):
Universal Streaming (English + Multilingual)
Universal-3 Pro Streaming
Important startup notes:
- Universal Streaming ASR services (
streaming-asr-english,streaming-asr-multilang) take approximately 2-3 minutes to fully initialize and log"Ready to serve!"when ready. - The U3 Pro ASR service (
streaming-asr-u3pro) takes approximately 5 minutes to fully initialize and logs"U3Pro ASR Server ready!"when ready. - Health checks may show “unhealthy” during startup — this is normal.
- Wait until the relevant ASR service(s) show their ready log line before attempting to use the API.
8. Test the deployment
From your local machine, test the connection using the live microphone example (see the Live microphone streaming example section above).
Important: Replace SERVER_IP in the example script with your EC2
instance’s public IP address, which you can find in the EC2 console under
your instance details.
AWS cost optimization tips
- Use Spot Instances: Save up to 70% for non-critical workloads (may be interrupted)
- Stop instances when not in use: GPU instances are expensive; stop them during off-hours
- Use CloudWatch alarms: Set up billing alerts to avoid unexpected costs
- Consider Reserved Instances: Save up to 60% with 1 or 3-year commitments for production workloads
- Right-size your instance: Monitor GPU utilization and downgrade if consistently underutilized
Security best practices
- Enable AWS Systems Manager Session Manager for SSH-less access
- Use IAM roles instead of hardcoded credentials where possible
- Enable VPC Flow Logs for network monitoring
- Regular security updates:
sudo apt update && sudo apt upgrade -y - Use AWS Secrets Manager to store sensitive configuration
- Enable EBS encryption for data at rest
- Configure CloudWatch Logs for centralized logging
- Implement least privilege access with security groups and NACLs
Troubleshooting AWS-specific issues
Issue: “InsufficientInstanceCapacity” error when launching
- Solution: Try a different availability zone within your region or a different instance type
Issue: Quota request denied or pending
- Solution: Contact AWS Support through the console with your use case details
Issue: Cannot connect to EC2 instance
- Solution: Verify security group allows SSH (port 22) from your IP
- Solution: Check that you’re using the correct key pair and username (
ubuntufor Ubuntu AMIs)
Issue: Docker containers fail to start with GPU errors
- Solution: Verify NVIDIA Container Toolkit is properly configured
- Solution: Check that the instance type has GPU resources
Issue: Services show “unhealthy” status
- Solution: ASR services take 2-3 minutes to fully initialize - wait for “Ready to serve!” log messages
- Solution: Health checks may fail during startup - this is normal and will resolve once services are ready
Issue: Connection refused when testing from local machine
- Solution: Ensure you’re using the instance’s public IP address, not the private IP
- Solution: Verify security group allows inbound traffic on port 8080 from your IP
- Solution: Check that services are fully started with
docker compose logs -f
Issue: “Authorization” header missing error
- Solution: All WebSocket connections must include the header
Authorization: self-hosted
Issue: Need to transfer files to EC2 instance (e.g., audio files)
- Solution: Use SCP from your local machine:
Issue: High costs
- Solution: Stop the instance when not in use
- Solution: Review CloudWatch metrics to ensure you’re using the right instance size