Self-Hosted Streaming | AssemblyAI

The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription solution that can be deployed within your own infrastructure. This early access version is designed for design partners to evaluate and provide feedback on our self-hosted offering.

Getting the latest instructions

The most up-to-date deployment instructions, configuration files, and example scripts are maintained in our private GitHub repository:

https://github.com/AssemblyAI/streaming-self-hosting-stack

Design partners are encouraged to provide their GitHub username to gain access to the repository. Please contact the AssemblyAI team directly to request access.

Core principle

Complete data isolation: No audio data, transcript data, or personally identifiable information (PII) will ever be sent to AssemblyAI servers. Only usage metadata and licensing information is transmitted.

System requirements

Hardware requirements

GPU: NVIDIA GPU support required (any NVIDIA GPU model will work, T4 or newer recommended)

Software requirements

Operating System: Linux
Container Runtime: Docker and Docker Compose required
AWS Account: Required for pulling container images from our ECR registry

Architecture

The streaming solution consists of three AssemblyAI Docker images plus a standard nginx container:

API Service (streaming-api) - Gateway API service handling WebSocket connections
English ASR Service (streaming-asr-english) - English speech recognition model service
Multilingual ASR Service (streaming-asr-multilang) - Multilingual speech recognition model service
ASR Load Balancer (streaming-asr-lb) - Standard nginx:alpine container with header-based routing between ASR services

Connection flow

External Request → streaming-api:8080 (WebSocket) → streaming-asr-lb:80 → Header-based routing (X-Model-Version):
                                                                        ├── en-default → streaming-asr-english:50051 (gRPC)
                                                                        └── ml-default → streaming-asr-multilang:50051 (gRPC)

Prerequisites

Active enterprise contract with AssemblyAI
AWS account for container registry access
Linux environment with Docker and Docker Compose installed
NVIDIA Container Toolkit for GPU support

Setup and deployment

1. Docker runtime with GPU support

1.1 Verify NVIDIA drivers are installed:

$ nvidia-smi

1.2 Install NVIDIA Container Toolkit:

Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.

1.3 Verify the Docker runtime has GPU access:

$ docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

2. Obtain credentials

AWS ECR Access: We will manually provision AWS account credentials for your team to pull container images from our private Amazon ECR registry.

3. AWS ECR authentication

Authenticate with AWS ECR using provided credentials:

$ aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com

4. Configure container images

Create a .env file with container image references:

$ STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.1.0
> STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.1.0
> STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.1.0

5. Deploy with Docker Compose

Start all services:

$ # Start all services
> docker compose up -d
> 
> # View logs
> docker compose logs -f
> 
> # Check service status
> docker compose ps

The ASR service containers include built-in model weights - no separate model download required.

Configuration

Docker Compose configuration

The docker-compose.yml file defines the service architecture:

1 services:
2   streaming-api:
3     image: ${STREAMING_API_IMAGE}
4     ports:
5       - "8080:8080"
6     environment:
7       - AAI_WSS_PORT=8080
8       - AAI_ASR_ENDPOINT=streaming-asr-lb:80
9       - AAI_STREAMING_ASR_ENDPOINT=streaming-asr-lb:80
10       - AAI_USE_SECURE_CHANNEL_TO_ASR_SERVICE=False
11     healthcheck:
12       test: ["CMD", "curl", "-f", "http://localhost:8080/v3/health"]
13       interval: 10s
14       timeout: 2s
15       retries: 2
16       start_period: 5s
17     depends_on:
18       - streaming-asr-lb
19     networks:
20       - streaming-network
21 
22   streaming-asr-lb:
23     image: nginx:alpine
24     ports:
25       - "8081:80"
26     healthcheck:
27       test: [ "CMD", "curl", "-fsS", "http://localhost:80/nginx_health" ]
28       interval: 10s
29       timeout: 2s
30       retries: 2
31       start_period: 10s
32     volumes:
33       - ./nginx_streaming_asr.conf:/etc/nginx/nginx.conf:ro
34     depends_on:
35       - streaming-asr-english
36       - streaming-asr-multilang
37     networks:
38       - streaming-network
39 
40   streaming-asr-english:
41     image: ${STREAMING_ASR_ENGLISH_IMAGE}
42     ports:
43       - "50051:50051"
44     environment:
45       - SERVER_PORT=50051
46       - LOGGING_LEVEL=INFO
47     healthcheck:
48       test: ["CMD", "grpc_health_probe", "-addr=:50051"]
49       interval: 10s
50       timeout: 2s
51       retries: 5
52       start_period: 120s
53     networks:
54       - streaming-network
55     deploy:
56       resources:
57         reservations:
58           devices:
59             - driver: nvidia
60               count: 1
61               capabilities: [ "gpu" ]
62 
63   streaming-asr-multilang:
64     image: ${STREAMING_ASR_MULTILANG_IMAGE}
65     ports:
66       - "50052:50051"
67     environment:
68       - SERVER_PORT=50051
69       - LOGGING_LEVEL=INFO
70     healthcheck:
71       test: ["CMD", "grpc_health_probe", "-addr=:50051"]
72       interval: 10s
73       timeout: 2s
74       retries: 5
75       start_period: 120s
76     networks:
77       - streaming-network
78     deploy:
79       resources:
80         reservations:
81           devices:
82             - driver: nvidia
83               count: 1
84               capabilities: [ "gpu" ]
85 
86 networks:
87   streaming-network:
88     driver: bridge
89     ipam:
90       config:
91         - subnet: 172.20.0.0/16

Nginx configuration

The ASR load balancer uses header-based routing to direct requests to the appropriate model service based on the X-Model-Version header:

nginx_streaming_asr.conf

1 events { worker_connections 1024; }
2 
3 http {
4   access_log /dev/stdout;
5   error_log  /dev/stderr info;
6 
7   upstream streaming_asr_english   { server streaming-asr-english:50051; }
8   upstream streaming_asr_multilang { server streaming-asr-multilang:50051; }
9 
10   map $http_x_model_version $asr_backend {
11     default    streaming_asr_english;
12     en-default streaming_asr_english;
13     ml-default streaming_asr_multilang;
14   }
15 
16   keepalive_timeout     10h;
17   client_header_timeout 10h;
18   send_timeout          10h;
19 
20   server {
21     listen 80;
22     http2 on;
23     client_max_body_size 0;
24 
25     # Health endpoint (NGINX itself)
26     location = /nginx_health {
27       access_log off;
28       default_type text/plain;
29       return 200 "OK\n";
30     }
31 
32     location / {
33       grpc_pass grpc://$asr_backend;
34       grpc_connect_timeout 75s;
35       grpc_read_timeout    10h;
36       grpc_send_timeout    10h;
37       grpc_socket_keepalive on;
38     }
39   }
40 }

Service endpoints

WebSocket: ws://localhost:8080

Running the streaming example

A Python example script is provided to demonstrate how to stream a pre-recorded audio file to the self-hosted stack.

Note: You can initiate a session as soon as the streaming-asr-english and streaming-asr-multilang containers are healthy, which happens after they output a "Ready to serve!" log line.

Setup

Change to the streaming_example directory:

$ cd streaming_example

Create a fresh Python virtual environment and activate it:

$ python -m venv streaming_venv
> source streaming_venv/bin/activate

Install the required packages:

$ pip install -r requirements.txt

Python script

Save this as example_with_prerecorded_audio_file.py:

1 """
2 Example script for streaming audio to AssemblyAI's self-hosted streaming transcription API.
3 This is a minimal reference implementation for demonstration purposes only.
4 For production use cases, best practices, and the complete API specification, please visit https://www.assemblyai.com/docs
5 """
6 
7 import argparse
8 import json
9 import logging
10 import math
11 import os
12 import time
13 import wave
14 from concurrent.futures import ThreadPoolExecutor
15 from dataclasses import dataclass
16 from datetime import datetime, timedelta
17 from typing import List, Optional
18 from urllib.parse import urlencode
19 
20 from websockets.sync.client import ClientConnection, connect
21 
22 LOGGER = logging.getLogger(__name__)
23 
24 
25 @dataclass(frozen=True)
26 class AudioChunk:
27     data: bytes
28     duration_ms: int
29 
30 
31 def _validate_and_get_pcm16_raw_bytes(
32     wav_file_path: str, expected_sample_rate: int
33 ) -> bytes:
34     """
35     Validate that the WAV file is PCM16 encoded with the expected sample rate and extract raw audio data.
36 
37     :param wav_file_path: Path to the WAV file.
38     :param expected_sample_rate: Expected sample rate (e.g., 16000).
39     :return: Raw audio content as bytes.
40     :raises ValueError: If the file is not PCM16 or doesn't match expected sample rate.
41     """
42     with wave.open(wav_file_path, "rb") as wav_file:
43         # Check if it's PCM16
44         if wav_file.getsampwidth() != 2:
45             raise ValueError(
46                 f"Audio file must be 16-bit PCM. Found sample width: {wav_file.getsampwidth() * 8}-bit"
47             )
48 
49         if wav_file.getcomptype() != "NONE":
50             raise ValueError(
51                 f"Audio file must be uncompressed PCM. Found compression type: {wav_file.getcomptype()}"
52             )
53 
54         # Check sample rate
55         actual_sample_rate = wav_file.getframerate()
56         if actual_sample_rate != expected_sample_rate:
57             raise ValueError(
58                 f"Audio file must have sample rate of {expected_sample_rate} Hz. "
59                 f"Found: {actual_sample_rate} Hz"
60             )
61 
62         # Check if mono
63         if wav_file.getnchannels() != 1:
64             raise ValueError(
65                 f"Audio file must be mono (1 channel). Found: {wav_file.getnchannels()} channels"
66             )
67 
68         raw_audio = wav_file.readframes(wav_file.getnframes())
69 
70     return raw_audio
71 
72 
73 def _get_chunks_from_file(
74     filepath: str,
75     sample_rate: int,
76     chunk_size_ms: int,
77 ) -> List[AudioChunk]:
78     """
79     Read a PCM16 WAV file and split it into chunks.
80 
81     :param filepath: Path to the PCM16 WAV file.
82     :param sample_rate: Expected sample rate of the audio file.
83     :param chunk_size_ms: Duration of each chunk in milliseconds.
84     :return: List of AudioChunk objects.
85     :raises ValueError: If the file is not in the correct format.
86     """
87     chunks = []
88     audio_bytes: bytes = _validate_and_get_pcm16_raw_bytes(filepath, sample_rate)
89 
90     read_bytes = 0
91     while read_bytes < len(audio_bytes):
92         frame_size = 2  # 16-bit PCM (2 bytes per sample)
93         chunk_bytes_len = int(sample_rate * chunk_size_ms * frame_size // 1000)
94         data = audio_bytes[read_bytes : read_bytes + chunk_bytes_len]
95         read_bytes += len(data)
96         actual_chunk_ms = math.ceil(len(data) * 1000 / (sample_rate * frame_size))
97         chunks.append(AudioChunk(data=data, duration_ms=actual_chunk_ms))
98 
99     return chunks
100 
101 
102 def _write_to_ws(ws: ClientConnection, audio_chunks: List[AudioChunk]) -> None:
103     """
104     Write audio chunks to the WebSocket connection.
105 
106     :param ws: WebSocket connection.
107     :param audio_chunks: List of audio chunks to send.
108     """
109     try:
110         for chunk in audio_chunks:
111             # Sleep for the chunk duration to send chunks with realtime rate
112             time.sleep(chunk.duration_ms / 1000)
113             ws.send(chunk.data)
114         ws.send('{"type": "Terminate"}')
115     except Exception as e:
116         LOGGER.error(
117             f"Exception occurred while writing to websocket: {e}", exc_info=True
118         )
119         ws.close()
120         raise
121 
122 
123 def _read_from_ws(ws: ClientConnection) -> None:
124     """
125     Read and process messages from the WebSocket connection.
126 
127     :param ws: WebSocket connection.
128     """
129     try:
130         for message in ws:
131             data = json.loads(message)
132             if "type" not in data:
133                 raise Exception(f"Unknown message received: {data}")
134             elif data["type"] == "Turn":
135                 if data["words"]:
136                     text = " ".join([word["text"] for word in data["words"]])
137                     audio_start = data["words"][0]["start"]
138                     audio_end = data["words"][-1]["end"]
139                     end_of_turn = "True " if data["end_of_turn"] else "False"
140                     LOGGER.info(
141                         f"{timedelta(milliseconds=audio_start)}-"
142                         f"{timedelta(milliseconds=audio_end)}, end-of-turn: {end_of_turn}: {text}",
143                     )
144             elif data["type"] == "Begin":
145                 expires_at = datetime.fromtimestamp(int(data["expires_at"]))
146                 LOGGER.info(
147                     f"Session started. Session id: {data['id']}, expires at: {expires_at}",
148                 )
149             elif data["type"] == "Termination":
150                 LOGGER.info(
151                     f"Session completed with session duration: {data['session_duration_seconds']} sec.",
152                 )
153             else:
154                 LOGGER.error(f"Unknown message type: {data}")
155     except Exception as e:
156         LOGGER.error(
157             f"Exception occurred while reading from the websocket: {e}", exc_info=True
158         )
159         ws.close()
160         raise
161 
162 
163 def run_session(
164     api_endpoint: str,
165     audio_chunks: List[AudioChunk],
166     sample_rate: int,
167     keyterms_prompt: Optional[List[str]] = None,
168     language: Optional[str] = None,
169 ) -> None:
170     """
171     Run a WebSocket session to stream audio and receive transcriptions.
172 
173     :param api_endpoint: WebSocket endpoint URL.
174     :param audio_chunks: List of audio chunks to send.
175     :param sample_rate: Sample rate of the audio.
176     :param keyterms_prompt: Optional list of key terms for the transcription.
177     :param language: Optional language code for transcription.
178     """
179     try:
180         params = {
181             "sample_rate": sample_rate,
182         }
183         if keyterms_prompt:
184             params["keyterms"] = json.dumps(keyterms_prompt)
185         if language:
186             params["language"] = language
187 
188         endpoint_str = f"{api_endpoint}?{urlencode(params)}"
189         headers = {"Authorization": "self-hosted"}
190         LOGGER.info(f"Endpoint: {endpoint_str}")
191         with ThreadPoolExecutor(max_workers=2) as executor:
192             with connect(endpoint_str, additional_headers=headers) as websocket:
193                 write_future = executor.submit(
194                     _write_to_ws,
195                     websocket,
196                     audio_chunks,
197                 )
198                 read_future = executor.submit(
199                     _read_from_ws,
200                     websocket,
201                 )
202                 write_future.result()
203                 read_future.result()
204     except Exception as e:
205         LOGGER.error(
206             f"Exception occurred: {e}",
207             exc_info=True,
208         )
209         raise
210 
211 
212 def parse_args():
213     """Parse command line arguments."""
214     parser = argparse.ArgumentParser(
215         description="Stream audio to AssemblyAI self-hosted real-time transcription service",
216         formatter_class=argparse.RawDescriptionHelpFormatter,
217         epilog="""
218 Examples:
219   # Basic usage with default endpoint
220   python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav
221 
222   # Specify custom endpoint and language
223   python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav --endpoint ws://localhost:8080 --language multi
224 
225 Note: Audio file must be PCM 16-bit WAV format, mono channel, 16kHz sample rate.
226         """,
227     )
228     parser.add_argument(
229         "--audio-file",
230         type=str,
231         default=os.path.dirname(__file__) + os.path.sep + "example_audio_file.wav",
232         help="Path to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)",
233     )
234     parser.add_argument(
235         "--endpoint",
236         type=str,
237         default="ws://localhost:8080",
238         help="WebSocket endpoint URL (default: ws://localhost:8080)",
239     )
240     parser.add_argument(
241         "--language",
242         type=str,
243         default="",
244         help="Language code for transcription (e.g., 'multi')",
245     )
246     return parser.parse_args()
247 
248 
249 if __name__ == "__main__":
250     try:
251         args = parse_args()
252         logging.basicConfig(level=logging.INFO, format="%(message)s")
253         sample_rate = 16_000
254 
255         audio_chunks = _get_chunks_from_file(
256             args.audio_file,
257             sample_rate=sample_rate,
258             chunk_size_ms=100,
259         )
260         run_session(
261             api_endpoint=args.endpoint,
262             audio_chunks=audio_chunks,
263             sample_rate=sample_rate,
264             language=args.language if args.language else None,
265         )
266     except KeyboardInterrupt:
267         LOGGER.info("Interrupted by user, exiting.")
268         exit(0)
269     except ValueError as e:
270         LOGGER.error(f"Audio file validation error: {e}")
271         exit(1)

Usage

The example script (example_with_prerecorded_audio_file.py) requires a PCM 16-bit WAV file (mono channel, 16kHz sample rate).

Note on language parameter:

Use "en" or omit the --language parameter for English transcription (routes to English ASR service)
Use "multi" or any non-English language code for multilingual transcription (routes to multilingual ASR service)

Basic usage:

$ python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav

Example with multilingual transcription:

$ python example_with_prerecorded_audio_file.py \
>   --audio-file example_audio_file.wav \
>   --endpoint ws://localhost:8080 \
>   --language multi

Command-line arguments:

Argument	Description	Default
`--audio-file`	Path to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)	`example_audio_file.wav`
`--endpoint`	WebSocket endpoint URL	`ws://localhost:8080`
`--language`	Language code for transcription. Use `"en"` for English or omit for English (default). Use `"multi"` for multilingual	`"en"`

View help:

$ python example_with_prerecorded_audio_file.py --help

Live microphone streaming example

This example demonstrates real-time microphone transcription using a remote self-hosted deployment. This is useful for testing your self-hosted instance from a local machine.

Setup

Install the required packages:

$ pip install websockets pyaudio

Python script

Save this as live_microphone_streaming.py:

1 import asyncio
2 import websockets
3 import pyaudio
4 import json
5 
6 # Replace with your server's IP address or use 'localhost' for local testing
7 SERVER_IP = "your.server.ip.address"
8 
9 async def stream_audio(language="en"):
10     # Build WebSocket URL with query parameters
11     params = f"sample_rate=16000&language={language}"
12     WS_URL = f"ws://{SERVER_IP}:8080/v3/ws?{params}"
13     
14     # Add authorization header (required for self-hosted)
15     headers = {"Authorization": "self-hosted"}
16     
17     print(f"Connecting to {WS_URL}...")
18     
19     async with websockets.connect(WS_URL, extra_headers=headers) as ws:
20         print("Connected! Starting to stream audio...")
21         
22         # Set up audio stream from microphone
23         p = pyaudio.PyAudio()
24         stream = p.open(
25             format=pyaudio.paInt16,
26             channels=1,
27             rate=16000,
28             input=True,
29             frames_per_buffer=3200  # 100ms chunks at 16kHz
30         )
31         
32         print(f"\n🎤 Listening with language={language}... speak into your microphone!")
33         print("Press Ctrl+C to stop\n")
34         
35         # Function to send audio
36         async def send_audio():
37             try:
38                 while True:
39                     data = stream.read(3200, exception_on_overflow=False)
40                     await ws.send(data)
41                     await asyncio.sleep(0.1)  # 100ms chunks
42             except KeyboardInterrupt:
43                 await ws.send(json.dumps({"type": "Terminate"}))
44                 print("\nStopping...")
45             finally:
46                 stream.stop_stream()
47                 stream.close()
48                 p.terminate()
49         
50         # Function to receive transcripts
51         async def receive_transcripts():
52             try:
53                 async for message in ws:
54                     data = json.loads(message)
55                     
56                     if data.get("type") == "Begin":
57                         print(f"✅ Session started! ID: {data.get('id')}")
58                     
59                     elif data.get("type") == "Turn":
60                         if data.get("words"):
61                             text = " ".join([word["text"] for word in data["words"]])
62                             end_of_turn = "[FINAL]" if data.get("end_of_turn") else ""
63                             print(f"📝 {text} {end_of_turn}")
64                     
65                     elif data.get("type") == "Termination":
66                         print(f"✅ Session completed. Duration: {data.get('session_duration_seconds')}s")
67                         break
68                     
69             except Exception as e:
70                 print(f"Error receiving: {e}")
71         
72         # Run both tasks concurrently
73         await asyncio.gather(send_audio(), receive_transcripts())
74 
75 if __name__ == "__main__":
76     import sys
77     
78     # Usage: python live_microphone_streaming.py [language]
79     # Examples:
80     #   python live_microphone_streaming.py en       # English
81     #   python live_microphone_streaming.py multi    # Multilingual with auto-detect
82     #   python live_microphone_streaming.py es       # Spanish
83     language = sys.argv[1] if len(sys.argv) > 1 else "en"
84     
85     try:
86         asyncio.run(stream_audio(language))
87     except KeyboardInterrupt:
88         print("\nStopped by user")

Usage

Basic usage (English):

$ python live_microphone_streaming.py

Multilingual transcription:

$ python live_microphone_streaming.py multi

Specific language (e.g., Spanish):

$ python live_microphone_streaming.py es

Note:

Make sure to replace SERVER_IP in the script with your actual server IP address
If testing locally on the same machine as the server, use localhost or 127.0.0.1
The Authorization: self-hosted header is required for all connections
Language routing: "en" routes to English ASR service, any other code (including "multi") routes to multilingual ASR service

Updating services

Model updates

To update to a new model version:

Pull the new container images from ECR
Update your .env file with the new image references
Restart the services using Docker Compose

$ docker compose down
> docker compose up -d

Monitoring and debugging

View service logs

$ # All services
> docker compose logs -f
> 
> # Specific service
> docker compose logs -f streaming-api

Check service status

$ # Container status
> docker compose ps
> 
> # Resource usage
> docker stats

Troubleshooting

Debug commands

$ # Check nginx configuration
> docker compose exec streaming-asr-lb nginx -t
> 
> # Restart specific service
> docker compose restart streaming-api
> docker compose restart streaming-asr-english
> docker compose restart streaming-asr-multilang

Common issues

GPU not detected: Verify NVIDIA Container Toolkit is properly installed and Docker has GPU access.
Services not starting: Check logs for specific error messages using docker compose logs -f [service-name].
Connection refused: Ensure all services are healthy by checking docker compose ps and reviewing health check status.

Current limitations

As a design partner, please be aware of these current limitations:

Text formatting is not included (coming in future streaming model release)
Manual credential provisioning (no self-service dashboard yet)
Docker Compose deployment example only (production orchestration templates coming later)

Design partner support

What we provide

Docker Compose configuration file
Manual credential provisioning
Direct engineering support for deployment
Regular model updates

What we need from you

Feedback on deployment experience
Performance metrics in your environment
Feature requests and prioritization input
Use case validation

AWS deployment guide

This section provides step-by-step instructions for deploying the self-hosted streaming solution on AWS EC2, designed for users who may not be familiar with AWS infrastructure.

AWS prerequisites

Before you begin, ensure you have:

An AWS account with billing enabled
AWS CLI installed and configured on your local machine
Basic familiarity with SSH and command-line operations

EC2 instance setup

1. Request GPU quota increase

By default, AWS accounts have limited or zero quota for GPU instances. You’ll need to request an increase:

Navigate to the AWS Service Quotas console
Search for “EC2”
Find “Running On-Demand G and VT instances” (for g4dn, g5, or similar GPU instances)
Click “Request quota increase”
Request at least 4 vCPUs (minimum for a g4dn.xlarge instance)
Provide a use case description: “Self-hosted AI transcription service requiring GPU acceleration”
Submit the request

Note: Quota requests typically take 24-48 hours to process. Plan accordingly.

2. Choose the right instance type

Recommended instance types based on your needs:

Instance Type	vCPUs	GPU	Memory	Use Case	Approximate Cost/Hour
g4dn.xlarge	4	1x T4 (16GB)	16 GB	Development/Testing	~$0.526
g4dn.2xlarge	8	1x T4 (16GB)	32 GB	Light Production	~$0.752
g5.xlarge	4	1x A10G (24GB)	16 GB	Production (Higher Performance)	~$1.006
g5.2xlarge	8	1x A10G (24GB)	32 GB	Production (High Throughput)	~$1.212

Recommendation: Start with g4dn.xlarge for evaluation, then scale to g4dn.2xlarge or g5 instances for production workloads.

3. Launch EC2 instance with recommended AMI

3.1 Navigate to the EC2 console and click “Launch Instance”

3.2 Configure instance settings:

Name: assemblyai-self-hosted-streaming
AMI: Search for and select “AWS Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04)”
- AMI ID format: ami-xxxxxxxxx (varies by region)
- This AMI includes pre-installed NVIDIA drivers, CUDA toolkit, and Docker with GPU support
Instance type: Select g4dn.xlarge (or your chosen instance type)
Key pair: Create a new key pair or select an existing one
- If creating new: Download the .pem file and save it securely
- Set permissions: chmod 400 your-key.pem

3.3 Configure storage:

Root volume: Increase to at least 100 GB gp3 (model weights and containers require significant space)
The default 8 GB is insufficient

3.4 Configure security group (Network settings):

Create a new security group with the following inbound rules:

Type	Protocol	Port Range	Source	Description
SSH	TCP	22	Your IP/0.0.0.0/0	SSH access for management
Custom TCP	TCP	8080	Your IP/0.0.0.0/0	WebSocket endpoint
Custom TCP	TCP	8081	Your IP/0.0.0.0/0	Health check endpoint (optional)

Security recommendations:

For production: Restrict Source to your specific IP addresses or VPC CIDR ranges
For development/testing: You can use 0.0.0.0/0 but understand this allows public access
Consider using AWS VPN or Direct Connect for enhanced security
Enable AWS CloudTrail for audit logging

3.5 Launch the instance and wait for it to reach “Running” state

4. Connect to your EC2 instance

$ # Replace with your instance's public IP and key file
> ssh -i your-key.pem ubuntu@<EC2_PUBLIC_IP>

5. Verify GPU and Docker setup

Once connected, verify the pre-installed components:

$ # Verify NVIDIA drivers
> nvidia-smi
> 
> # Verify Docker
> docker --version
> 
> # Verify Docker Compose (v2 syntax)
> docker compose version
> 
> # If the above fails, you may need to install Docker Compose v2
> # Remove old version if present
> sudo apt-get remove docker-compose
> 
> # Install Docker Compose v2 (plugin)
> sudo apt-get update
> sudo apt-get install docker-compose-plugin
> 
> # Verify installation
> docker compose version
> 
> # Verify GPU access in Docker
> docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Important: This setup uses Docker Compose v2, which uses the command docker compose (space, no hyphen) instead of the older docker-compose (hyphen). All commands in this guide use the v2 syntax.

6. Configure AWS credentials on the instance

Set up AWS credentials to pull container images from ECR:

$ # Install AWS CLI if not already installed
> sudo apt-get update
> sudo apt-get install -y awscli
> 
> # Configure AWS credentials (use the credentials provided by AssemblyAI)
> aws configure

You’ll be prompted to enter:

AWS Access Key ID
AWS Secret Access Key
Default region: us-west-2
Default output format: json

7. Deploy the self-hosted streaming solution

Follow the standard deployment instructions from the “Setup and deployment” section above:

$ # Authenticate with ECR
> aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com
> 
> # Create project directory
> mkdir -p ~/assemblyai-streaming
> cd ~/assemblyai-streaming
> 
> # Create .env file with image references
> cat > .env << 'EOF'
> STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.1.0
> STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.1.0
> STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.1.0
> EOF
> 
> # Create docker-compose.yml file
> # Copy the complete docker-compose.yml content from the Configuration section above and save it
> # Or download it from the GitHub repository
> 
> # Create nginx configuration file
> # Copy the nginx_streaming_asr.conf content from the Configuration section above and save it
> # Or download it from the GitHub repository
> 
> # Start services
> docker compose up -d
> 
> # Monitor logs (services may take 2-3 minutes to fully start)
> docker compose logs -f

Important startup notes:

The ASR services (streaming-asr-english and streaming-asr-multilang) take approximately 2-3 minutes to fully initialize
You’ll see "Ready to serve!" in the logs when each ASR service is ready
Health checks may show “unhealthy” during startup - this is normal
Wait until both ASR services show "Ready to serve!" before attempting to use the API

8. Test the deployment

From your local machine, test the connection using the live microphone example (see the Live microphone streaming example section above).

Important: Replace SERVER_IP in the example script with your EC2 instance’s public IP address, which you can find in the EC2 console under your instance details.

AWS cost optimization tips

Use Spot Instances: Save up to 70% for non-critical workloads (may be interrupted)
Stop instances when not in use: GPU instances are expensive; stop them during off-hours
Use CloudWatch alarms: Set up billing alerts to avoid unexpected costs
Consider Reserved Instances: Save up to 60% with 1 or 3-year commitments for production workloads
Right-size your instance: Monitor GPU utilization and downgrade if consistently underutilized

Security best practices

Enable AWS Systems Manager Session Manager for SSH-less access
Use IAM roles instead of hardcoded credentials where possible
Enable VPC Flow Logs for network monitoring
Regular security updates: sudo apt update && sudo apt upgrade -y
Use AWS Secrets Manager to store sensitive configuration
Enable EBS encryption for data at rest
Configure CloudWatch Logs for centralized logging
Implement least privilege access with security groups and NACLs

Troubleshooting AWS-specific issues

Issue: “InsufficientInstanceCapacity” error when launching

Solution: Try a different availability zone within your region or a different instance type

Issue: Quota request denied or pending

Solution: Contact AWS Support through the console with your use case details

Issue: Cannot connect to EC2 instance

Solution: Verify security group allows SSH (port 22) from your IP
Solution: Check that you’re using the correct key pair and username (ubuntu for Ubuntu AMIs)

Issue: Docker containers fail to start with GPU errors

Solution: Verify NVIDIA Container Toolkit is properly configured
Solution: Check that the instance type has GPU resources

Issue: Services show “unhealthy” status

Solution: ASR services take 2-3 minutes to fully initialize - wait for “Ready to serve!” log messages
Solution: Health checks may fail during startup - this is normal and will resolve once services are ready

Issue: Connection refused when testing from local machine

Solution: Ensure you’re using the instance’s public IP address, not the private IP
Solution: Verify security group allows inbound traffic on port 8080 from your IP
Solution: Check that services are fully started with docker compose logs -f

Issue: “Authorization” header missing error

Solution: All WebSocket connections must include the header Authorization: self-hosted

Issue: Need to transfer files to EC2 instance (e.g., audio files)

Solution: Use SCP from your local machine:

$ scp -i your-key.pem local-file.wav ubuntu@<EC2_PUBLIC_IP>:~/destination/

Issue: High costs

Solution: Stop the instance when not in use
Solution: Review CloudWatch metrics to ensure you’re using the right instance size