Self-Hosted Streaming

The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription solution that can be deployed within your own infrastructure. This early access version is designed for design partners to evaluate and provide feedback on our self-hosted offering.

Universal-3 Pro Streaming is now available for self-hosting

Available as of v0.6.0. See the Universal-3 Pro Streaming images section below for the deployment image references, or the upstream docker-compose.u3pro.yml for the full service definition.

Self-hosted streaming requires a $20,000 upfront commitment. Contact our sales team to discuss your specific needs and to learn more about our self-hosted offerings.

Getting the latest instructions

The most up-to-date deployment instructions, configuration files, and example scripts are maintained in our private GitHub repository:

https://github.com/AssemblyAI/streaming-self-hosting-stack

Design partners are encouraged to provide their GitHub username to gain access to the repository. Please contact the AssemblyAI team directly to request access.

Core principle

  • Complete data isolation: No audio data, transcript data, or personally identifiable information (PII) will ever be sent to AssemblyAI servers. Only usage metadata and licensing information is transmitted.

System requirements

Hardware requirements

  • GPU (Universal Streaming): NVIDIA GPU support required (any NVIDIA GPU model will work, T4 or newer recommended).
  • GPU (Universal-3 Pro Streaming): Requires NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. T4 GPUs are not sufficient for U3 Pro. See the v0.6.0 changelog entry for details.

Software requirements

  • Operating System: Linux
  • Container Runtime: Docker and Docker Compose required
  • AWS Account: Required for pulling container images from our ECR registry

Architecture

Self-hosted streaming ships as two separate stacks. Both share the same gateway, load balancer, and license proxy — they differ only in the ASR backend. Run one stack at a time.

Shared services (both stacks)

  1. API Service (streaming-api) - Gateway API service handling WebSocket connections
  2. License and Usage Proxy (license-and-usage-proxy) - License validation and usage reporting service
  3. ASR Load Balancer (streaming-asr-lb) - Standard nginx:alpine container with header-based routing between ASR services

Universal Streaming stack (docker-compose.yml)

Adds two ASR backends to the shared services above:

  • English ASR Service (streaming-asr-english) - English speech recognition model service
  • Multilingual ASR Service (streaming-asr-multilang) - Multilingual speech recognition model service

Universal-3 Pro Streaming stack (docker-compose.u3pro.yml)

Adds a single ASR backend to the shared services above:

  • U3 Pro ASR Service (streaming-asr-u3pro) - Universal-3 Pro speech recognition model service, available as of v0.6.0. Targeted at voice agent scenarios. See the v0.6.0 changelog entry for capability details.

Connection flow

Both stacks route ASR requests through the same streaming-asr-lb nginx load balancer using header-based routing on X-Model-Version. The difference is which ASR backends are deployed.

Universal Streaming (docker-compose.yml)

Websocket client → streaming-api:8080 (WebSocket)
├─ Usage reporting ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
│ │
├─ License validation ─────────┘
└─ ASR requests ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
├── en-default → streaming-asr-english:50051 (gRPC)
└── ml-default → streaming-asr-multilang:50051 (gRPC)

Universal-3 Pro Streaming (docker-compose.u3pro.yml)

Websocket client → streaming-api:8080 (WebSocket)
├─ Usage reporting ───────→ license-and-usage-proxy:8080 [if usage-based billing] ────→ https://usage-tracker.assemblyai.com
│ │
├─ License validation ─────────┘
└─ ASR requests ───────→ streaming-asr-lb:80 → Header-based routing (X-Model-Version):
└── u3-pro → streaming-asr-u3pro:50051 (gRPC)

Prerequisites

  1. AssemblyAI license: Valid for the streaming self-hosted product.
  2. Docker & Docker Compose: Ensure Docker and Docker Compose are installed.
  3. GPU Support: NVIDIA Container Toolkit for GPU-enabled services.
  4. AWS Access: Valid AWS credentials to pull images from ECR.

Setup and deployment

1. Docker runtime with GPU support

1.1 Verify NVIDIA drivers are installed:

$nvidia-smi

1.2 Install NVIDIA Container Toolkit:

Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.

1.3 Verify the Docker runtime has GPU access:

$docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

2. AWS ECR authentication

AWS ECR Access: We will manually provision AWS account credentials for your team to pull container images from our private Amazon ECR registry.

$# Login to ECR to pull container images
$aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com

3. Configure container images

Universal Streaming and Universal-3 Pro Streaming ship as two separate stacks with their own Compose files. Pick the stack that matches the model you want to serve — they are not designed to be merged into a single Compose project. Configure the .env file for the stack you plan to run.

Universal Streaming (English + Multilingual) images

Use the reference .env.example file to create a .env file with container image references:

1STREAMING_API_IMAGE=CUSTOM_IMAGE_URI
2STREAMING_ASR_ENGLISH_IMAGE=CUSTOM_IMAGE_URI
3STREAMING_ASR_MULTILANG_IMAGE=CUSTOM_IMAGE_URI
4LICENSE_AND_USAGE_PROXY_IMAGE=CUSTOM_IMAGE_URI

For ease of reference in this doc, the current image references are below:

1STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.6.0
2STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.6.0
3STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.6.0
4LICENSE_AND_USAGE_PROXY_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-license-and-usage-proxy:release-v0.6.0
5USAGE_TRACKING_API_KEY=YOUR_USAGE_TRACKING_API_KEY

Universal-3 Pro Streaming images

To run Universal-3 Pro Streaming, use the separate docker-compose.u3pro.yml file from the upstream repo with the following image references:

1STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.6.0
2LICENSE_AND_USAGE_PROXY_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-license-and-usage-proxy:release-v0.6.0
3STREAMING_ASR_U3PRO_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-u3-pro:release-v0.6.0
4USAGE_TRACKING_API_KEY=YOUR_USAGE_TRACKING_API_KEY

The U3 Pro stack does not require the STREAMING_ASR_ENGLISH_IMAGE or STREAMING_ASR_MULTILANG_IMAGE images.

4. Have the license file ready

License File Generation: We will manually provision a .jwt license file for your team to authenticate the container. The same license file is used for both the Universal Streaming and Universal-3 Pro Streaming stacks.

Ensure you have your AssemblyAI license file in the current working directory as license.jwt, or modify the LICENSE_FILE_PATH environment variable in the relevant Compose file (docker-compose.yml for Universal Streaming, docker-compose.u3pro.yml for Universal-3 Pro Streaming) to point to your license file location.

5. Start services

Start the stack you configured in step 3. Both stacks share the same streaming-api, load balancer, and license proxy — they differ only in the ASR backend. Stop one stack with docker compose down before starting the other.

Universal Streaming (English + Multilingual)

$# Start all services
$docker compose up -d
$
$# View logs
$docker compose logs -f
$
$# Check service status
$docker compose ps
$
$# Stop services before switching stacks
$docker compose down

Universal-3 Pro Streaming

$# Start all services
$docker compose -f docker-compose.u3pro.yml up -d
$
$# View logs
$docker compose -f docker-compose.u3pro.yml logs -f
$
$# Check service status
$docker compose ps
$
$# Stop services before switching stacks
$docker compose -f docker-compose.u3pro.yml down

The ASR service containers include built-in model weights — no separate model download required. The Universal Streaming ASR services log "Ready to serve!" when ready (typically ~2 minutes). The U3 Pro ASR service logs "U3Pro ASR Server ready!" when ready (typically ~5 minutes).

Configuration

The inline docker-compose.yml and nginx_streaming_asr.conf shown in this section are for the Universal Streaming stack. For Universal-3 Pro Streaming, see the upstream repo for docker-compose.u3pro.yml and the corresponding nginx routing.

Docker Compose configuration

The docker-compose.yml file defines the service architecture:

1services:
2 streaming-api:
3 image: ${STREAMING_API_IMAGE}
4 ports:
5 - "8080:8080"
6 environment:
7 - AAI_WSS_PORT=8080
8 - AAI_LOG_LEVEL=INFO
9 - AAI_USE_STRUCTURED_LOGGING=False
10 - AAI_ASR_ENDPOINT=streaming-asr-lb:80
11 - AAI_USE_SECURE_CHANNEL_TO_ASR_SERVICE=False
12 - AAI_LICENSE_AND_USAGE_PROXY_ENDPOINT=http://license-and-usage-proxy:8080
13 healthcheck:
14 test: ["CMD", "curl", "-f", "http://localhost:8080/v3/health"]
15 interval: 10s
16 timeout: 2s
17 retries: 2
18 start_period: 5s
19 depends_on:
20 - streaming-asr-lb
21 - license-and-usage-proxy
22 networks:
23 - streaming-network
24
25 streaming-asr-lb:
26 image: nginx:alpine
27 ports:
28 - "8081:80"
29 healthcheck:
30 test: ["CMD", "curl", "-fsS", "http://localhost:80/nginx_health"]
31 interval: 10s
32 timeout: 2s
33 retries: 2
34 start_period: 10s
35 volumes:
36 - ./nginx_streaming_asr.conf:/etc/nginx/nginx.conf:ro
37 depends_on:
38 - streaming-asr-english
39 - streaming-asr-multilang
40 networks:
41 - streaming-network
42
43 streaming-asr-english:
44 image: ${STREAMING_ASR_ENGLISH_IMAGE}
45 ports:
46 - "50051:50051"
47 environment:
48 - SERVER_PORT=50051
49 - LOGGING_LEVEL=INFO
50 - USE_STRUCTURED_LOGGING=False
51 - MAX_OPEN_STREAMS=48
52 healthcheck:
53 test: ["CMD", "grpc_health_probe", "-addr=:50051"]
54 interval: 10s
55 timeout: 2s
56 retries: 5
57 start_period: 120s
58 networks:
59 - streaming-network
60 deploy:
61 resources:
62 reservations:
63 devices:
64 - driver: nvidia
65 count: 1
66 capabilities: ["gpu"]
67
68 streaming-asr-multilang:
69 image: ${STREAMING_ASR_MULTILANG_IMAGE}
70 ports:
71 - "50052:50051"
72 environment:
73 - SERVER_PORT=50051
74 - LOGGING_LEVEL=INFO
75 - USE_STRUCTURED_LOGGING=False
76 - MAX_OPEN_STREAMS=48
77 healthcheck:
78 test: ["CMD", "grpc_health_probe", "-addr=:50051"]
79 interval: 10s
80 timeout: 2s
81 retries: 5
82 start_period: 120s
83 networks:
84 - streaming-network
85 deploy:
86 resources:
87 reservations:
88 devices:
89 - driver: nvidia
90 count: 1
91 capabilities: ["gpu"]
92
93 license-and-usage-proxy:
94 image: ${LICENSE_AND_USAGE_PROXY_IMAGE}
95 ports:
96 - "8082:8080"
97 environment:
98 - HTTP_PORT=8080
99 - LOGGING_LEVEL=INFO
100 - USE_STRUCTURED_LOGGING=False
101 - LICENSE_FILE_PATH=/var/aai_license.jwt
102 - USAGE_TRACKING_API_KEY=${USAGE_TRACKING_API_KEY} # Set if license is for usage-based billing
103 volumes:
104 - ./license.jwt:/var/aai_license.jwt:ro
105 healthcheck:
106 test: ["CMD", "curl", "-fsS", "http://localhost:8080/health"]
107 interval: 10s
108 timeout: 2s
109 retries: 2
110 start_period: 10s
111 networks:
112 - streaming-network
113
114networks:
115 streaming-network:
116 driver: bridge
117 ipam:
118 config:
119 - subnet: 172.20.0.0/16

Nginx configuration

The ASR load balancer in nginx_streaming_asr.conf uses header-based routing to direct requests to the appropriate model service based on the X-Model-Version header:

1events { worker_connections 1024; }
2
3http {
4 access_log /dev/stdout;
5 error_log /dev/stderr info;
6
7 upstream streaming_asr_english { server streaming-asr-english:50051; }
8 upstream streaming_asr_multilang { server streaming-asr-multilang:50051; }
9
10 map $http_x_model_version $asr_backend {
11 default streaming_asr_english;
12 en-default streaming_asr_english;
13 ml-default streaming_asr_multilang;
14 }
15
16 keepalive_timeout 10h;
17 client_header_timeout 10h;
18 send_timeout 10h;
19
20 server {
21 listen 80;
22 http2 on;
23 client_max_body_size 0;
24
25 # ---- Health endpoint (NGINX itself) ----
26 location = /nginx_health {
27 access_log off;
28 default_type text/plain;
29 return 200 "OK\n";
30 }
31
32 location / {
33 grpc_pass grpc://$asr_backend;
34 grpc_connect_timeout 75s;
35 grpc_read_timeout 10h;
36 grpc_send_timeout 10h;
37 grpc_socket_keepalive on;
38 }
39 }
40}

Usage reporting configuration

The license-and-usage-proxy service supports two billing modes based on your AssemblyAI license:

Flat billing mode

If your license is configured for flat billing, usage tracking is disabled. No additional configuration is required.

Usage-based billing mode

If your license is configured for usage-based billing, the proxy will automatically report usage data to AssemblyAI’s usage tracker service. You must configure the following environment variable in the docker-compose.yml for the license-and-usage-proxy service:

1environment:
2 - USAGE_TRACKING_API_KEY=<your-api-key>

Important Notes:

  • For the API key, any key retrieved from the AssemblyAI dashboard can be used.
  • At startup, the proxy validates connectivity by registering with AssemblyAI’s https://usage-tracker.assemblyai.com.
  • If connectivity validation fails, the proxy will shut down.
  • Usage data is batched and reported every few seconds.
  • The proxy automatically retries failed requests up to several times. Critical Behavior: If https://usage-tracker.assemblyai.com becomes unreachable and all retry attempts fail (after 5-60 minutes), the license-and-usage-proxy service will terminate itself. This is a fail-safe mechanism to ensure usage data integrity. Your service orchestrator should be configured to automatically replace the container with a new one. Monitoring Recommendations:
  • Monitor the proxy’s logs for warnings about failed usage reporting attempts.
  • Set up alerts for proxy restarts, which may indicate persistent connectivity issues.
  • If the in-memory usage queue size exceeds 1000 items, the proxy will log a warning suggesting upscaling.

Service endpoints

  • WebSocket: ws://localhost:8080

Running the streaming example

A Python example script is provided to demonstrate how to stream a pre-recorded audio file to the self-hosted stack.

The example script below targets the Universal Streaming stack (--language en|multi, routing to streaming-asr-english or streaming-asr-multilang). For Universal-3 Pro Streaming, use the example script in the upstream repo’s streaming_example directory, which supports --speech-model u3-rt-pro for U3 Pro routing.

Note: You can initiate a session as soon as the relevant ASR containers are healthy. Universal Streaming containers (streaming-asr-english, streaming-asr-multilang) log "Ready to serve!" when ready; the U3 Pro container (streaming-asr-u3pro) logs "U3Pro ASR Server ready!" when ready.

Setup

Change to the streaming_example directory:

$cd streaming_example

Create a fresh Python virtual environment and activate it:

$python -m venv streaming_venv
$source streaming_venv/bin/activate

Install the required packages:

$pip install -r requirements.txt

Python script

Save this as example_with_prerecorded_audio_file.py:

1"""
2Example script for streaming audio to AssemblyAI's self-hosted streaming transcription API.
3This is a minimal reference implementation for demonstration purposes only.
4For production use cases, best practices, and the complete API specification, please visit https://www.assemblyai.com/docs
5"""
6
7import argparse
8import json
9import logging
10import math
11import os
12import time
13import wave
14from concurrent.futures import ThreadPoolExecutor
15from dataclasses import dataclass
16from datetime import datetime, timedelta
17from typing import List, Optional
18from urllib.parse import urlencode
19
20from websockets.sync.client import ClientConnection, connect
21
22LOGGER = logging.getLogger(__name__)
23
24
25@dataclass(frozen=True)
26class AudioChunk:
27 data: bytes
28 duration_ms: int
29
30
31def _validate_and_get_pcm16_raw_bytes(
32 wav_file_path: str, expected_sample_rate: int
33) -> bytes:
34 """
35 Validate that the WAV file is PCM16 encoded with the expected sample rate and extract raw audio data.
36
37 :param wav_file_path: Path to the WAV file.
38 :param expected_sample_rate: Expected sample rate (e.g., 16000).
39 :return: Raw audio content as bytes.
40 :raises ValueError: If the file is not PCM16 or doesn't match expected sample rate.
41 """
42 with wave.open(wav_file_path, "rb") as wav_file:
43 # Check if it's PCM16
44 if wav_file.getsampwidth() != 2:
45 raise ValueError(
46 f"Audio file must be 16-bit PCM. Found sample width: {wav_file.getsampwidth() * 8}-bit"
47 )
48
49 if wav_file.getcomptype() != "NONE":
50 raise ValueError(
51 f"Audio file must be uncompressed PCM. Found compression type: {wav_file.getcomptype()}"
52 )
53
54 # Check sample rate
55 actual_sample_rate = wav_file.getframerate()
56 if actual_sample_rate != expected_sample_rate:
57 raise ValueError(
58 f"Audio file must have sample rate of {expected_sample_rate} Hz. "
59 f"Found: {actual_sample_rate} Hz"
60 )
61
62 # Check if mono
63 if wav_file.getnchannels() != 1:
64 raise ValueError(
65 f"Audio file must be mono (1 channel). Found: {wav_file.getnchannels()} channels"
66 )
67
68 raw_audio = wav_file.readframes(wav_file.getnframes())
69
70 return raw_audio
71
72
73def _get_chunks_from_file(
74 filepath: str,
75 sample_rate: int,
76 chunk_size_ms: int,
77) -> List[AudioChunk]:
78 """
79 Read a PCM16 WAV file and split it into chunks.
80
81 :param filepath: Path to the PCM16 WAV file.
82 :param sample_rate: Expected sample rate of the audio file.
83 :param chunk_size_ms: Duration of each chunk in milliseconds.
84 :return: List of AudioChunk objects.
85 :raises ValueError: If the file is not in the correct format.
86 """
87 chunks = []
88 audio_bytes: bytes = _validate_and_get_pcm16_raw_bytes(filepath, sample_rate)
89
90 read_bytes = 0
91 while read_bytes < len(audio_bytes):
92 frame_size = 2 # 16-bit PCM (2 bytes per sample)
93 chunk_bytes_len = int(sample_rate * chunk_size_ms * frame_size // 1000)
94 data = audio_bytes[read_bytes : read_bytes + chunk_bytes_len]
95 read_bytes += len(data)
96 actual_chunk_ms = math.ceil(len(data) * 1000 / (sample_rate * frame_size))
97 chunks.append(AudioChunk(data=data, duration_ms=actual_chunk_ms))
98
99 return chunks
100
101
102def _write_to_ws(ws: ClientConnection, audio_chunks: List[AudioChunk]) -> None:
103 """
104 Write audio chunks to the WebSocket connection.
105
106 :param ws: WebSocket connection.
107 :param audio_chunks: List of audio chunks to send.
108 """
109 try:
110 for chunk in audio_chunks:
111 # Sleep for the chunk duration to send chunks with realtime rate
112 time.sleep(chunk.duration_ms / 1000)
113 ws.send(chunk.data)
114 ws.send('{"type": "Terminate"}')
115 except Exception as e:
116 LOGGER.error(
117 f"Exception occurred while writing to websocket: {e}", exc_info=True
118 )
119 ws.close()
120 raise
121
122
123def _read_from_ws(ws: ClientConnection) -> None:
124 """
125 Read and process messages from the WebSocket connection.
126
127 :param ws: WebSocket connection.
128 """
129 try:
130 for message in ws:
131 data = json.loads(message)
132 if "type" not in data:
133 raise Exception(f"Unknown message received: {data}")
134 elif data["type"] == "Turn":
135 if data["words"]:
136 text = " ".join([word["text"] for word in data["words"]])
137 audio_start = data["words"][0]["start"]
138 audio_end = data["words"][-1]["end"]
139 end_of_turn = "True " if data["end_of_turn"] else "False"
140 LOGGER.info(
141 f"{timedelta(milliseconds=audio_start)}-"
142 f"{timedelta(milliseconds=audio_end)}, end-of-turn: {end_of_turn}: {text}",
143 )
144 elif data["type"] == "Begin":
145 expires_at = datetime.fromtimestamp(int(data["expires_at"]))
146 LOGGER.info(
147 f"Session started. Session id: {data['id']}, expires at: {expires_at}",
148 )
149 elif data["type"] == "Termination":
150 LOGGER.info(
151 f"Session completed with session duration: {data['session_duration_seconds']} sec.",
152 )
153 else:
154 LOGGER.error(f"Unknown message type: {data}")
155 except Exception as e:
156 LOGGER.error(
157 f"Exception occurred while reading from the websocket: {e}", exc_info=True
158 )
159 ws.close()
160 raise
161
162
163def run_session(
164 api_endpoint: str,
165 audio_chunks: List[AudioChunk],
166 sample_rate: int,
167 keyterms_prompt: Optional[List[str]] = None,
168 language: Optional[str] = None,
169) -> None:
170 """
171 Run a WebSocket session to stream audio and receive transcriptions.
172
173 :param api_endpoint: WebSocket endpoint URL.
174 :param audio_chunks: List of audio chunks to send.
175 :param sample_rate: Sample rate of the audio.
176 :param keyterms_prompt: Optional list of key terms for the transcription.
177 :param language: Optional language code for transcription.
178 """
179 try:
180 params = {
181 "sample_rate": sample_rate,
182 }
183 if keyterms_prompt:
184 params["keyterms"] = json.dumps(keyterms_prompt)
185 if language:
186 params["language"] = language
187
188 endpoint_str = f"{api_endpoint}?{urlencode(params)}"
189 headers = {"Authorization": "self-hosted"}
190 LOGGER.info(f"Endpoint: {endpoint_str}")
191 with ThreadPoolExecutor(max_workers=2) as executor:
192 with connect(endpoint_str, additional_headers=headers) as websocket:
193 write_future = executor.submit(
194 _write_to_ws,
195 websocket,
196 audio_chunks,
197 )
198 read_future = executor.submit(
199 _read_from_ws,
200 websocket,
201 )
202 write_future.result()
203 read_future.result()
204 except Exception as e:
205 LOGGER.error(
206 f"Exception occurred: {e}",
207 exc_info=True,
208 )
209 raise
210
211
212def parse_args():
213 """Parse command line arguments."""
214 parser = argparse.ArgumentParser(
215 description="Stream audio to AssemblyAI self-hosted real-time transcription service",
216 formatter_class=argparse.RawDescriptionHelpFormatter,
217 epilog="""
218Examples:
219 # Basic usage with default endpoint
220 python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav
221
222 # Specify custom endpoint and language
223 python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav --endpoint ws://localhost:8080 --language multi
224
225Note: Audio file must be PCM 16-bit WAV format, mono channel, 16kHz sample rate.
226 """,
227 )
228 parser.add_argument(
229 "--audio-file",
230 type=str,
231 default=os.path.dirname(__file__) + os.path.sep + "example_audio_file.wav",
232 help="Path to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)",
233 )
234 parser.add_argument(
235 "--endpoint",
236 type=str,
237 default="ws://localhost:8080",
238 help="WebSocket endpoint URL (default: ws://localhost:8080)",
239 )
240 parser.add_argument(
241 "--language",
242 type=str,
243 default="",
244 help="Language code for transcription (e.g., 'multi')",
245 )
246 return parser.parse_args()
247
248
249if __name__ == "__main__":
250 try:
251 args = parse_args()
252 logging.basicConfig(level=logging.INFO, format="%(message)s")
253 sample_rate = 16_000
254
255 audio_chunks = _get_chunks_from_file(
256 args.audio_file,
257 sample_rate=sample_rate,
258 chunk_size_ms=100,
259 )
260 run_session(
261 api_endpoint=args.endpoint,
262 audio_chunks=audio_chunks,
263 sample_rate=sample_rate,
264 language=args.language if args.language else None,
265 )
266 except KeyboardInterrupt:
267 LOGGER.info("Interrupted by user, exiting.")
268 exit(0)
269 except ValueError as e:
270 LOGGER.error(f"Audio file validation error: {e}")
271 exit(1)

Usage

The example script (example_with_prerecorded_audio_file.py) requires a PCM 16-bit WAV file (mono channel, 16kHz sample rate).

Note on language parameter:

  • Use "en" or omit the --language parameter for English transcription (routes to English ASR service)
  • Use "multi" or any non-English language code for multilingual transcription (routes to multilingual ASR service)

Basic usage:

$python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav

Example with multilingual transcription:

$python example_with_prerecorded_audio_file.py \
> --audio-file example_audio_file.wav \
> --endpoint ws://localhost:8080 \
> --language multi

Command-line arguments:

ArgumentDescriptionDefault
--audio-filePath to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)example_audio_file.wav
--endpointWebSocket endpoint URLws://localhost:8080
--languageLanguage code for transcription. Use "en" for English or omit for English (default). Use "multi" for multilingual"en"

View help:

$python example_with_prerecorded_audio_file.py --help

Live microphone streaming example

This example demonstrates real-time microphone transcription using a remote self-hosted deployment. This is useful for testing your self-hosted instance from a local machine.

The script below routes by language (en → English ASR, anything else → Multilingual ASR) and only targets the Universal Streaming stack. For Universal-3 Pro Streaming, adapt the script to use the U3 Pro routing (speech_model=u3-rt-pro) or use the upstream repo’s example as a reference.

Setup

Install the required packages:

$pip install websockets pyaudio

Python script

Save this as live_microphone_streaming.py:

1import asyncio
2import websockets
3import pyaudio
4import json
5
6# Replace with your server's IP address or use 'localhost' for local testing
7SERVER_IP = "your.server.ip.address"
8
9async def stream_audio(language="en"):
10 # Build WebSocket URL with query parameters
11 params = f"sample_rate=16000&language={language}"
12 WS_URL = f"ws://{SERVER_IP}:8080/v3/ws?{params}"
13
14 # Add authorization header (required for self-hosted)
15 headers = {"Authorization": "self-hosted"}
16
17 print(f"Connecting to {WS_URL}...")
18
19 async with websockets.connect(WS_URL, extra_headers=headers) as ws:
20 print("Connected! Starting to stream audio...")
21
22 # Set up audio stream from microphone
23 p = pyaudio.PyAudio()
24 stream = p.open(
25 format=pyaudio.paInt16,
26 channels=1,
27 rate=16000,
28 input=True,
29 frames_per_buffer=3200 # 100ms chunks at 16kHz
30 )
31
32 print(f"\n🎤 Listening with language={language}... speak into your microphone!")
33 print("Press Ctrl+C to stop\n")
34
35 # Function to send audio
36 async def send_audio():
37 try:
38 while True:
39 data = stream.read(3200, exception_on_overflow=False)
40 await ws.send(data)
41 await asyncio.sleep(0.1) # 100ms chunks
42 except KeyboardInterrupt:
43 await ws.send(json.dumps({"type": "Terminate"}))
44 print("\nStopping...")
45 finally:
46 stream.stop_stream()
47 stream.close()
48 p.terminate()
49
50 # Function to receive transcripts
51 async def receive_transcripts():
52 try:
53 async for message in ws:
54 data = json.loads(message)
55
56 if data.get("type") == "Begin":
57 print(f"✅ Session started! ID: {data.get('id')}")
58
59 elif data.get("type") == "Turn":
60 if data.get("words"):
61 text = " ".join([word["text"] for word in data["words"]])
62 end_of_turn = "[FINAL]" if data.get("end_of_turn") else ""
63 print(f"📝 {text} {end_of_turn}")
64
65 elif data.get("type") == "Termination":
66 print(f"✅ Session completed. Duration: {data.get('session_duration_seconds')}s")
67 break
68
69 except Exception as e:
70 print(f"Error receiving: {e}")
71
72 # Run both tasks concurrently
73 await asyncio.gather(send_audio(), receive_transcripts())
74
75if __name__ == "__main__":
76 import sys
77
78 # Usage: python live_microphone_streaming.py [language]
79 # Examples:
80 # python live_microphone_streaming.py en # English
81 # python live_microphone_streaming.py multi # Multilingual with auto-detect
82 # python live_microphone_streaming.py es # Spanish
83 language = sys.argv[1] if len(sys.argv) > 1 else "en"
84
85 try:
86 asyncio.run(stream_audio(language))
87 except KeyboardInterrupt:
88 print("\nStopped by user")

Usage

Basic usage (English):

$python live_microphone_streaming.py

Multilingual transcription:

$python live_microphone_streaming.py multi

Specific language (e.g., Spanish):

$python live_microphone_streaming.py es

Note:

  • Make sure to replace SERVER_IP in the script with your actual server IP address
  • If testing locally on the same machine as the server, use localhost or 127.0.0.1
  • The Authorization: self-hosted header is required for all connections
  • Language routing: "en" routes to English ASR service, any other code (including "multi") routes to multilingual ASR service

Updating services

Model updates

To update to a new model version:

  1. Pull the new container images from ECR
  2. Update your .env file with the new image references
  3. Restart the services using Docker Compose

For the Universal Streaming stack:

$docker compose down
$docker compose up -d

For the Universal-3 Pro Streaming stack, pass the U3 Pro Compose file:

$docker compose -f docker-compose.u3pro.yml down
$docker compose -f docker-compose.u3pro.yml up -d

Monitoring and debugging

View service logs

$# All services
$docker compose logs -f
$
$# Specific service
$docker compose logs -f streaming-api

Check service status

$# Container status
$docker compose ps
$
$# Resource usage
$docker stats

Troubleshooting

Debug commands

For the Universal Streaming stack:

$# Check nginx configuration
$docker compose exec streaming-asr-lb nginx -t
$
$# Restart specific service
$docker compose restart streaming-api
$docker compose restart streaming-asr-english
$docker compose restart streaming-asr-multilang

For the Universal-3 Pro Streaming stack, pass the U3 Pro Compose file and use the U3 Pro service name:

$# Check nginx configuration
$docker compose -f docker-compose.u3pro.yml exec streaming-asr-lb nginx -t
$
$# Restart specific service
$docker compose -f docker-compose.u3pro.yml restart streaming-api
$docker compose -f docker-compose.u3pro.yml restart streaming-asr-u3pro

Common issues

  • GPU not detected: Verify NVIDIA Container Toolkit is properly installed and Docker has GPU access.

  • Services not starting: Check logs for specific error messages using docker compose logs -f [service-name].

  • Connection refused: Ensure all services are healthy by checking docker compose ps and reviewing health check status.

Production Deployment Recommendations

streaming-api service

  • Deployment Strategy: We recommend doing Blue/Green deployments to avoid disrupting ongoing sessions. Once you fully shift the traffic to the new color, wait at least 3 hours (the max session duration) before shutting down the old color to ensure no sessions get disrupted.
  • Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it’s better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
  • Autoscaling: We recommend setting up autoscaling based on the number of active sessions. A container with 1 CPU can generally handle around 32 concurrent sessions.
  • Monitoring: Always monitor the logs during deployment to catch any potential issues early.
  • Dependencies: For successful startup, the service depends on the license-and-usage-proxy service being up and running.
  • Configuration: You can enable features like TLS encryption and structured logging via environment variables.
  • Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
  • Usage Reporting Behavior: After each session completes, the streaming-api reports usage to the license-and-usage-proxy with automatic retries on failure. Monitor logs any messages at a >= warning level.

license-and-usage-proxy service

  • Deployment Strategy: Do gradual rollouts to ensure stability. Consider implementing monitoring and alerting for service restarts.
  • Resource Allocation: We recommend allocating 1 CPU per container with at least 2GB of RAM for better hardware utilization. For example, it’s better to have 4 containers with 1 CPU and 2GB RAM each rather than 1 container with 4 CPU and 8GB RAM.
  • Monitoring: Always monitor logs during deployment to catch any potential issues early. You can set up an alert based on the responses of the /v1/status endpoint to alert you on any license issues. For usage-based billing, also monitor for usage reporting warnings and service restarts.
  • Dependencies:
    • For successful startup, the service depends on having a valid license being mounted on the container filesystem. To mount it, set the LICENSE_FILE_PATH environment variable to point to the license file path on the host machine.
    • For usage-based billing, the service also requires connectivity to https://usage-tracker.assemblyai.com at startup. If connectivity validation fails, the container will terminate. Ensure the USAGE_TRACKING_API_KEY environment variable is properly configured.
  • Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.
  • Usage Reporting Resilience:
    • Network connectivity to the https://usage-tracker.assemblyai.com endpoint must be reliable for production deployments with usage-based billing.
    • Run at least a few containers behind a load balancer to ensure high availability.

License Status Endpoint

The /v1/status endpoint provides real-time information about the license validation state:

Endpoint: GET /v1/status

Response Schema:

1{
2 "state": "Ready | Connected | TrustBased | Failed",
3 "last_successful_checkin": "2025-01-01T12:00:00.000000Z",
4 "trust_expiration": "2025-01-05T12:00:00.000000Z"
5}

State Descriptions:

  • Ready: Initial state when the service starts before any license validation has occurred.
  • Connected: Last license validation check was successful.
  • TrustBased: Last license validation check failed, but the request was within the trust window grace period, so services will remain operational.
  • Failed: Last license validation check failed and the trust window has expired. streaming-api containers will shut down and stop serving requests.

Fields:

  • state: Current license validation state.
  • last_successful_checkin: ISO 8601 timestamp of the last successful license validation (null if never successful).
  • trust_expiration: ISO 8601 timestamp when the trust window expires (null if no successful validation yet).

Recommended Alerts:

  • Alert when state transitions to TrustBased (indicates license validation issues).
  • Critical alert when state is Failed (services will shut down).

streaming-asr-english and streaming-asr-multilang services

  • Deployment Strategy: Do gradual rollouts to ensure stability. Both Blue/Green and rolling deployments are good strategies, as the streaming-api can reconnect to a new streaming-asr container if a persistent connection gets disrupted with minimal state loss.
  • Hardware Requirements: The services can run on NVIDIA T4 or newer GPUs. We recommend allocating at least 4 CPU and 16GB of RAM per container.
  • Autoscaling: You can set up autoscaling based on the number of active sessions. A container with recommended hardware can generally handle up to 48 concurrent sessions.
  • Monitoring: Always monitor logs during deployment to catch any potential issues early.
  • Health Checks: Use the healthcheck command provided in the docker-compose.yml to monitor container health.

streaming-asr-u3pro service

  • Hardware Requirements: Universal-3 Pro Streaming requires NVIDIA L4 / A10 / A100 / L40S / H100 or equivalent with at least 24 GB VRAM. T4 GPUs are not sufficient. See the v0.6.0 changelog entry for details.
  • For all other operational guidance (deployment strategy, autoscaling, monitoring, health checks), see the streaming-self-hosting-stack repo.

Changelog

v0.6.0

Universal-3 Pro Streaming — new self-hosted stack

This release introduces the Universal-3 Pro Streaming self-hosted stack via a separate docker-compose.u3pro.yml file. U3 Pro is targeted at voice agent scenarios and delivers significant improvements over the universal English model on complex entities, short utterances, and end-of-turn (EOT) latency.

Highlights of U3 Pro behavior delivered with this release:

  • 22% reduction in voice agent hallucinations
  • 10% reduction in voice agent WER
  • 29% reduction in voice agent short-utterance error rate
  • 5% reduction in medical WER
  • Continuous partials during long turns — partials are emitted incrementally instead of being delayed; turns now stitch up to 60s instead of hard-cutting at 16s/32s.
  • 750ms early partial of detected speech for snappier voice agent UX.

Hardware: NVIDIA L4 / A10 / A100 / L40S / H100 (24 GB+ VRAM).

Streaming API — new features

  • continuous_partials query parameter — clients can opt into continuous partials during long turns.
  • Structured logging — both the U3 Pro ASR server and the universal ASR server now honor USE_STRUCTURED_LOGGING, matching the streaming-api behavior.

Other improvements

  • Various logging and metrics improvements across the streaming-api and ASR services.
  • Bug fixes and stability improvements.

v0.5.0

English ASR model

A new English model is released, which produces already-formatted outputs directly and delivers large quality gains on digits, telephony, medical, and CI segments:

  • 34% improvement on digit sequence error rate (DSER)
  • 17% improvement on telephony WER
  • 12% average improvement on medical WER
  • 10% average improvement on CI segments WER
  • ~2.4% absolute F1 score improvement on keyterms prompting
  • Significantly improved timestamp accuracy — resolves overlapping and zero-duration word issues.

Multilingual ASR model

  • ~70% absolute improvement in timestamp accuracy — fixes overlapping words and zero-duration word bugs.

Streaming API — new features

  • Error and Warning WebSocket message types — Dedicated message types that let clients distinguish actionable errors from non-fatal warnings without relying on close codes.
  • Configuration echoed in SessionBegins — The SessionBegins message now includes the resolved session configuration so clients can verify applied settings.
  • Explicit speech-model selection — Clients explicitly select the speech model at session start.

Streaming API — fixes and improvements

  • More specific WebSocket close codes for session termination scenarios, making client-side error handling more precise.
  • Improved word_finalized events — All word finalizations are emitted (not only the last word of a turn).

Other improvements

  • Various logging, metrics, and observability improvements across the streaming-api and ASR services.
  • Bug fixes and stability improvements.

v0.4.0

English ASR Model

Major improvements to short utterance handling and hallucination reduction:

  • 100% reduction in hallucinations
  • 12.8% improvement on short utterances - Better performance for voice agent use cases
  • 7.39% improvement on digit sequence error rate
  • 1.75% improvement on proper nouns
  • 0.46% improvement on CI segments
  • 0.39% improvement on accented speech

Multilingual ASR Model

  • Context biasing support - Customers can now use context biasing (model-based biasing) with the multilingual model

Other Improvements

  • Increased concurrent session handling per container, leading to reduced deployment costs
  • Improved observability for the license-and-usage-proxy service
  • Various bug fixes and stability improvements

Current limitations

As a design partner, please be aware of these current limitations:

  • Manual credential provisioning (no self-service dashboard yet)
  • Docker Compose deployment example only (production orchestration templates coming later)

Design partner support

What we provide

  • Docker Compose configuration file
  • Manual credential provisioning
  • Direct engineering support for deployment
  • Regular model updates

What we need from you

  • Feedback on deployment experience
  • Performance metrics in your environment
  • Feature requests and prioritization input
  • Use case validation

AWS deployment guide

This section provides step-by-step instructions for deploying the self-hosted streaming solution on AWS EC2, designed for users who may not be familiar with AWS infrastructure.

AWS prerequisites

Before you begin, ensure you have:

  • An AWS account with billing enabled
  • AWS CLI installed and configured on your local machine
  • Basic familiarity with SSH and command-line operations

EC2 instance setup

1. Request GPU quota increase

By default, AWS accounts have limited or zero quota for GPU instances. You’ll need to request an increase:

  1. Navigate to the AWS Service Quotas console
  2. Search for “EC2”
  3. Find “Running On-Demand G and VT instances” (for g4dn, g5, or similar GPU instances)
  4. Click “Request quota increase”
  5. Request at least 4 vCPUs (minimum for a g4dn.xlarge instance)
  6. Provide a use case description: “Self-hosted AI transcription service requiring GPU acceleration”
  7. Submit the request

Note: Quota requests typically take 24-48 hours to process. Plan accordingly.

2. Choose the right instance type

Recommended instance types based on your needs:

Instance TypevCPUsGPUMemoryUse CaseApproximate Cost/Hour
g4dn.xlarge41x T4 (16GB)16 GBUniversal Streaming — Development/Testing~$0.526
g4dn.2xlarge81x T4 (16GB)32 GBUniversal Streaming — Light Production~$0.752
g5.xlarge41x A10G (24GB)16 GBUniversal Streaming — Production / U3 Pro minimum~$1.006
g5.2xlarge81x A10G (24GB)32 GBU3 Pro — Production~$1.212

Recommendation:

  • For Universal Streaming, start with g4dn.xlarge for evaluation, then scale to g4dn.2xlarge or g5 instances for production workloads.
  • For Universal-3 Pro Streaming, you need a GPU with at least 24 GB VRAM. Use g5.xlarge at minimum (A10G, 24 GB) and g5.2xlarge for production. T4-based instances (g4dn) are not sufficient for U3 Pro. See the v0.6.0 changelog entry for the full hardware requirements.

3.1 Navigate to the EC2 console and click “Launch Instance”

3.2 Configure instance settings:

  • Name: assemblyai-self-hosted-streaming
  • AMI: Search for and select “AWS Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04)”
    • AMI ID format: ami-xxxxxxxxx (varies by region)
    • This AMI includes pre-installed NVIDIA drivers, CUDA toolkit, and Docker with GPU support
  • Instance type: Select the instance type that matches your stack from the recommendation table above. For Universal Streaming, g4dn.xlarge is a reasonable starter; for Universal-3 Pro Streaming, use g5.xlarge or larger.
  • Key pair: Create a new key pair or select an existing one
    • If creating new: Download the .pem file and save it securely
    • Set permissions: chmod 400 your-key.pem

3.3 Configure storage:

  • Root volume: Increase to at least 100 GB gp3 (model weights and containers require significant space)
  • The default 8 GB is insufficient

3.4 Configure security group (Network settings):

Create a new security group with the following inbound rules:

TypeProtocolPort RangeSourceDescription
SSHTCP22Your IP/0.0.0.0/0SSH access for management
Custom TCPTCP8080Your IP/0.0.0.0/0WebSocket endpoint
Custom TCPTCP8081Your IP/0.0.0.0/0Health check endpoint (optional)

Security recommendations:

  • For production: Restrict Source to your specific IP addresses or VPC CIDR ranges
  • For development/testing: You can use 0.0.0.0/0 but understand this allows public access
  • Consider using AWS VPN or Direct Connect for enhanced security
  • Enable AWS CloudTrail for audit logging

3.5 Launch the instance and wait for it to reach “Running” state

4. Connect to your EC2 instance

$# Replace with your instance's public IP and key file
>ssh -i your-key.pem ubuntu@EC2_PUBLIC_IP

5. Verify GPU and Docker setup

Once connected, verify the pre-installed components:

$# Verify NVIDIA drivers
$nvidia-smi
$
$# Verify Docker
$docker --version
$
$# Verify Docker Compose (v2 syntax)
$docker compose version
$
$# If the above fails, you may need to install Docker Compose v2
$# Remove old version if present
$sudo apt-get remove docker-compose
$
$# Install Docker Compose v2 (plugin)
$sudo apt-get update
$sudo apt-get install docker-compose-plugin
$
$# Verify installation
$docker compose version
$
$# Verify GPU access in Docker
$docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Important: This setup uses Docker Compose v2, which uses the command docker compose (space, no hyphen) instead of the older docker-compose (hyphen). All commands in this guide use the v2 syntax.

6. Configure AWS credentials on the instance

Set up AWS credentials to pull container images from ECR:

$# Install AWS CLI if not already installed
$sudo apt-get update
$sudo apt-get install -y awscli
$
$# Configure AWS credentials (use the credentials provided by AssemblyAI)
$aws configure

You’ll be prompted to enter:

  • AWS Access Key ID
  • AWS Secret Access Key
  • Default region: us-west-2
  • Default output format: json

7. Deploy the self-hosted streaming solution

Follow the standard deployment instructions from the “Setup and deployment” section above. Common setup steps first, then pick the stack you want to deploy.

Common setup (both stacks):

$# Authenticate with ECR
$aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com
$
$# Create project directory
$mkdir -p ~/assemblyai-streaming
$cd ~/assemblyai-streaming
Universal Streaming (English + Multilingual)
$# Create .env file with image references
$cat > .env << 'EOF'
$STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.6.0
$STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.6.0
$STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.6.0
$LICENSE_AND_USAGE_PROXY_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-license-and-usage-proxy:release-v0.6.0
$USAGE_TRACKING_API_KEY=<your_usage_tracking_api_key_here>
$EOF
$
$# Create docker-compose.yml file
$# Copy the complete docker-compose.yml content from the Configuration section above and save it
$# Or download it from the GitHub repository
$
$# Create nginx configuration file
$# Copy the nginx_streaming_asr.conf content from the Configuration section above and save it
$# Or download it from the GitHub repository
$
$# Start services
$docker compose up -d
$
$# Monitor logs (services may take 2-3 minutes to fully start)
$docker compose logs -f
Universal-3 Pro Streaming
$# Create .env file with image references
$cat > .env << 'EOF'
$STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.6.0
$LICENSE_AND_USAGE_PROXY_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-license-and-usage-proxy:release-v0.6.0
$STREAMING_ASR_U3PRO_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-u3-pro:release-v0.6.0
$USAGE_TRACKING_API_KEY=<your_usage_tracking_api_key_here>
$EOF
$
$# Create docker-compose.u3pro.yml file
$# Copy the docker-compose.u3pro.yml content from the upstream repo and save it
$
$# Create nginx configuration file
$# Copy the nginx_streaming_asr.conf content from the Configuration section above and save it
$# Or download it from the GitHub repository
$
$# Start services
$docker compose -f docker-compose.u3pro.yml up -d
$
$# Monitor logs (services may take ~5 minutes to fully start)
$docker compose -f docker-compose.u3pro.yml logs -f

Important startup notes:

  • Universal Streaming ASR services (streaming-asr-english, streaming-asr-multilang) take approximately 2-3 minutes to fully initialize and log "Ready to serve!" when ready.
  • The U3 Pro ASR service (streaming-asr-u3pro) takes approximately 5 minutes to fully initialize and logs "U3Pro ASR Server ready!" when ready.
  • Health checks may show “unhealthy” during startup — this is normal.
  • Wait until the relevant ASR service(s) show their ready log line before attempting to use the API.

8. Test the deployment

From your local machine, test the connection using the live microphone example (see the Live microphone streaming example section above).

Important: Replace SERVER_IP in the example script with your EC2 instance’s public IP address, which you can find in the EC2 console under your instance details.

AWS cost optimization tips

  • Use Spot Instances: Save up to 70% for non-critical workloads (may be interrupted)
  • Stop instances when not in use: GPU instances are expensive; stop them during off-hours
  • Use CloudWatch alarms: Set up billing alerts to avoid unexpected costs
  • Consider Reserved Instances: Save up to 60% with 1 or 3-year commitments for production workloads
  • Right-size your instance: Monitor GPU utilization and downgrade if consistently underutilized

Security best practices

  1. Enable AWS Systems Manager Session Manager for SSH-less access
  2. Use IAM roles instead of hardcoded credentials where possible
  3. Enable VPC Flow Logs for network monitoring
  4. Regular security updates: sudo apt update && sudo apt upgrade -y
  5. Use AWS Secrets Manager to store sensitive configuration
  6. Enable EBS encryption for data at rest
  7. Configure CloudWatch Logs for centralized logging
  8. Implement least privilege access with security groups and NACLs

Troubleshooting AWS-specific issues

Issue: “InsufficientInstanceCapacity” error when launching

  • Solution: Try a different availability zone within your region or a different instance type

Issue: Quota request denied or pending

  • Solution: Contact AWS Support through the console with your use case details

Issue: Cannot connect to EC2 instance

  • Solution: Verify security group allows SSH (port 22) from your IP
  • Solution: Check that you’re using the correct key pair and username (ubuntu for Ubuntu AMIs)

Issue: Docker containers fail to start with GPU errors

  • Solution: Verify NVIDIA Container Toolkit is properly configured
  • Solution: Check that the instance type has GPU resources

Issue: Services show “unhealthy” status

  • Solution: ASR services take 2-3 minutes to fully initialize - wait for “Ready to serve!” log messages
  • Solution: Health checks may fail during startup - this is normal and will resolve once services are ready

Issue: Connection refused when testing from local machine

  • Solution: Ensure you’re using the instance’s public IP address, not the private IP
  • Solution: Verify security group allows inbound traffic on port 8080 from your IP
  • Solution: Check that services are fully started with docker compose logs -f

Issue: “Authorization” header missing error

  • Solution: All WebSocket connections must include the header Authorization: self-hosted

Issue: Need to transfer files to EC2 instance (e.g., audio files)

  • Solution: Use SCP from your local machine:
    $scp -i your-key.pem local-file.wav ubuntu@<EC2_PUBLIC_IP>:~/destination/

Issue: High costs

  • Solution: Stop the instance when not in use
  • Solution: Review CloudWatch metrics to ensure you’re using the right instance size