Self-Hosted Streaming
The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription solution that can be deployed within your own infrastructure. This early access version is designed for design partners to evaluate and provide feedback on our self-hosted offering.
Getting the latest instructions
The most up-to-date deployment instructions, configuration files, and example scripts are maintained in our private GitHub repository:
https://github.com/AssemblyAI/streaming-self-hosting-stack
Design partners are encouraged to provide their GitHub username to gain access to the repository. Please contact the AssemblyAI team directly to request access.
Core principle
- Complete data isolation: No audio data, transcript data, or personally identifiable information (PII) will ever be sent to AssemblyAI servers. Only usage metadata and licensing information is transmitted.
System requirements
Hardware requirements
- GPU: NVIDIA GPU support required (any NVIDIA GPU model will work, T4 or newer recommended)
Software requirements
- Operating System: Linux
- Container Runtime: Docker and Docker Compose required
- AWS Account: Required for pulling container images from our ECR registry
Architecture
The streaming solution consists of three AssemblyAI Docker images plus a standard nginx container:
- API Service (
streaming-api) - Gateway API service handling WebSocket connections - English ASR Service (
streaming-asr-english) - English speech recognition model service - Multilingual ASR Service (
streaming-asr-multilang) - Multilingual speech recognition model service - ASR Load Balancer (
streaming-asr-lb) - Standard nginx:alpine container with header-based routing between ASR services
Connection flow
Prerequisites
- Active enterprise contract with AssemblyAI
- AWS account for container registry access
- Linux environment with Docker and Docker Compose installed
- NVIDIA Container Toolkit for GPU support
Setup and deployment
1. Docker runtime with GPU support
1.1 Verify NVIDIA drivers are installed:
1.2 Install NVIDIA Container Toolkit:
Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.
1.3 Verify the Docker runtime has GPU access:
2. Obtain credentials
AWS ECR Access: We will manually provision AWS account credentials for your team to pull container images from our private Amazon ECR registry.
3. AWS ECR authentication
Authenticate with AWS ECR using provided credentials:
4. Configure container images
Create a .env file with container image references:
5. Deploy with Docker Compose
Start all services:
The ASR service containers include built-in model weights - no separate model download required.
Configuration
Docker Compose configuration
The docker-compose.yml file defines the service architecture:
Nginx configuration
The ASR load balancer uses header-based routing to direct requests to the appropriate model service based on the X-Model-Version header:
nginx_streaming_asr.conf
Service endpoints
- WebSocket:
ws://localhost:8080
Running the streaming example
A Python example script is provided to demonstrate how to stream a pre-recorded audio file to the self-hosted stack.
Note: You can initiate a session as soon as the streaming-asr-english and streaming-asr-multilang containers are healthy, which happens after they output a "Ready to serve!" log line.
Setup
Change to the streaming_example directory:
Create a fresh Python virtual environment and activate it:
Install the required packages:
Python script
Save this as example_with_prerecorded_audio_file.py:
Usage
The example script (example_with_prerecorded_audio_file.py) requires a PCM 16-bit WAV file (mono channel, 16kHz sample rate).
Note on language parameter:
- Use
"en"or omit the--languageparameter for English transcription (routes to English ASR service) - Use
"multi"or any non-English language code for multilingual transcription (routes to multilingual ASR service)
Basic usage:
Example with multilingual transcription:
Command-line arguments:
View help:
Live microphone streaming example
This example demonstrates real-time microphone transcription using a remote self-hosted deployment. This is useful for testing your self-hosted instance from a local machine.
Setup
Install the required packages:
Python script
Save this as live_microphone_streaming.py:
Usage
Basic usage (English):
Multilingual transcription:
Specific language (e.g., Spanish):
Note:
- Make sure to replace
SERVER_IPin the script with your actual server IP address - If testing locally on the same machine as the server, use
localhostor127.0.0.1 - The
Authorization: self-hostedheader is required for all connections - Language routing:
"en"routes to English ASR service, any other code (including"multi") routes to multilingual ASR service
Updating services
Model updates
To update to a new model version:
- Pull the new container images from ECR
- Update your
.envfile with the new image references - Restart the services using Docker Compose
Monitoring and debugging
View service logs
Check service status
Troubleshooting
Debug commands
Common issues
-
GPU not detected: Verify NVIDIA Container Toolkit is properly installed and Docker has GPU access.
-
Services not starting: Check logs for specific error messages using
docker compose logs -f [service-name]. -
Connection refused: Ensure all services are healthy by checking
docker compose psand reviewing health check status.
Current limitations
As a design partner, please be aware of these current limitations:
- Text formatting is not included (coming in future streaming model release)
- Manual credential provisioning (no self-service dashboard yet)
- Docker Compose deployment example only (production orchestration templates coming later)
Design partner support
What we provide
- Docker Compose configuration file
- Manual credential provisioning
- Direct engineering support for deployment
- Regular model updates
What we need from you
- Feedback on deployment experience
- Performance metrics in your environment
- Feature requests and prioritization input
- Use case validation
AWS deployment guide
This section provides step-by-step instructions for deploying the self-hosted streaming solution on AWS EC2, designed for users who may not be familiar with AWS infrastructure.
AWS prerequisites
Before you begin, ensure you have:
- An AWS account with billing enabled
- AWS CLI installed and configured on your local machine
- Basic familiarity with SSH and command-line operations
EC2 instance setup
1. Request GPU quota increase
By default, AWS accounts have limited or zero quota for GPU instances. You’ll need to request an increase:
- Navigate to the AWS Service Quotas console
- Search for “EC2”
- Find “Running On-Demand G and VT instances” (for g4dn, g5, or similar GPU instances)
- Click “Request quota increase”
- Request at least 4 vCPUs (minimum for a g4dn.xlarge instance)
- Provide a use case description: “Self-hosted AI transcription service requiring GPU acceleration”
- Submit the request
Note: Quota requests typically take 24-48 hours to process. Plan accordingly.
2. Choose the right instance type
Recommended instance types based on your needs:
Recommendation: Start with g4dn.xlarge for evaluation, then scale to g4dn.2xlarge or g5 instances for production workloads.
3. Launch EC2 instance with recommended AMI
3.1 Navigate to the EC2 console and click “Launch Instance”
3.2 Configure instance settings:
- Name:
assemblyai-self-hosted-streaming - AMI: Search for and select “AWS Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04)”
- AMI ID format:
ami-xxxxxxxxx(varies by region) - This AMI includes pre-installed NVIDIA drivers, CUDA toolkit, and Docker with GPU support
- AMI ID format:
- Instance type: Select
g4dn.xlarge(or your chosen instance type) - Key pair: Create a new key pair or select an existing one
- If creating new: Download the
.pemfile and save it securely - Set permissions:
chmod 400 your-key.pem
- If creating new: Download the
3.3 Configure storage:
- Root volume: Increase to at least 100 GB gp3 (model weights and containers require significant space)
- The default 8 GB is insufficient
3.4 Configure security group (Network settings):
Create a new security group with the following inbound rules:
Security recommendations:
- For production: Restrict Source to your specific IP addresses or VPC CIDR ranges
- For development/testing: You can use
0.0.0.0/0but understand this allows public access - Consider using AWS VPN or Direct Connect for enhanced security
- Enable AWS CloudTrail for audit logging
3.5 Launch the instance and wait for it to reach “Running” state
4. Connect to your EC2 instance
5. Verify GPU and Docker setup
Once connected, verify the pre-installed components:
Important: This setup uses Docker Compose v2, which uses the command docker compose (space, no hyphen) instead of the older docker-compose (hyphen). All commands in this guide use the v2 syntax.
6. Configure AWS credentials on the instance
Set up AWS credentials to pull container images from ECR:
You’ll be prompted to enter:
- AWS Access Key ID
- AWS Secret Access Key
- Default region:
us-west-2 - Default output format:
json
7. Deploy the self-hosted streaming solution
Follow the standard deployment instructions from the “Setup and deployment” section above:
Important startup notes:
- The ASR services (
streaming-asr-englishandstreaming-asr-multilang) take approximately 2-3 minutes to fully initialize - You’ll see
"Ready to serve!"in the logs when each ASR service is ready - Health checks may show “unhealthy” during startup - this is normal
- Wait until both ASR services show
"Ready to serve!"before attempting to use the API
8. Test the deployment
From your local machine, test the connection using the live microphone example (see the Live microphone streaming example section above).
Important: Replace SERVER_IP in the example script with your EC2 instance’s public IP address, which you can find in the EC2 console under your instance details.
AWS cost optimization tips
- Use Spot Instances: Save up to 70% for non-critical workloads (may be interrupted)
- Stop instances when not in use: GPU instances are expensive; stop them during off-hours
- Use CloudWatch alarms: Set up billing alerts to avoid unexpected costs
- Consider Reserved Instances: Save up to 60% with 1 or 3-year commitments for production workloads
- Right-size your instance: Monitor GPU utilization and downgrade if consistently underutilized
Security best practices
- Enable AWS Systems Manager Session Manager for SSH-less access
- Use IAM roles instead of hardcoded credentials where possible
- Enable VPC Flow Logs for network monitoring
- Regular security updates:
sudo apt update && sudo apt upgrade -y - Use AWS Secrets Manager to store sensitive configuration
- Enable EBS encryption for data at rest
- Configure CloudWatch Logs for centralized logging
- Implement least privilege access with security groups and NACLs
Troubleshooting AWS-specific issues
Issue: “InsufficientInstanceCapacity” error when launching
- Solution: Try a different availability zone within your region or a different instance type
Issue: Quota request denied or pending
- Solution: Contact AWS Support through the console with your use case details
Issue: Cannot connect to EC2 instance
- Solution: Verify security group allows SSH (port 22) from your IP
- Solution: Check that you’re using the correct key pair and username (
ubuntufor Ubuntu AMIs)
Issue: Docker containers fail to start with GPU errors
- Solution: Verify NVIDIA Container Toolkit is properly configured
- Solution: Check that the instance type has GPU resources
Issue: Services show “unhealthy” status
- Solution: ASR services take 2-3 minutes to fully initialize - wait for “Ready to serve!” log messages
- Solution: Health checks may fail during startup - this is normal and will resolve once services are ready
Issue: Connection refused when testing from local machine
- Solution: Ensure you’re using the instance’s public IP address, not the private IP
- Solution: Verify security group allows inbound traffic on port 8080 from your IP
- Solution: Check that services are fully started with
docker compose logs -f
Issue: “Authorization” header missing error
- Solution: All WebSocket connections must include the header
Authorization: self-hosted
Issue: Need to transfer files to EC2 instance (e.g., audio files)
- Solution: Use SCP from your local machine:
Issue: High costs
- Solution: Stop the instance when not in use
- Solution: Review CloudWatch metrics to ensure you’re using the right instance size