April 2, 2026

Node.js voice agent with AssemblyAI Universal-3 Pro Streaming

Build a real-time voice agent in Node.js using the AssemblyAI Universal-3 Pro Streaming model

Kelsey Foster

Growth

AI voice agents

Universal-3 Pro Streaming

Reviewed by

Table of contents

[Visible on live site]

Build a real-time voice agent in Node.js using the AssemblyAI Universal-3 Pro Streaming model (u3-rt-pro) for speech-to-text — no Python required, no heavy framework dependencies.

Two modes in one repo:

Terminal agent (src/agent.js) — mic input via mic, plays TTS audio in your terminal
Browser server (src/server.js) — Node.js WebSocket server with a browser UI using getUserMedia

Build a Node.js voice agent today

Sign up for a free AssemblyAI account and connect to Universal-3 Pro Streaming from your Node.js app in under 15 minutes.

Start building

Why AssemblyAI Universal-3 Pro for Node.js?

Metric	AssemblyAI Universal-3 Pro	Deepgram Nova-3
P50 latency	307 ms	516 ms
Word Error Rate	8.14%	9.87%
Neural turn detection	✅	❌ (VAD only)
Mid-session prompting	✅	❌
Real-time diarization	✅	❌
Anti-hallucination	✅	❌

AssemblyAI's neural turn detection eliminates the need for a separate VAD library. The model uses both acoustic and linguistic signals to detect when a speaker has finished — not just when they've gone silent.

Quick start

git clone https://github.com/kelsey-aai/voice-agent-nodejs-assemblyai
cd voice-agent-nodejs-assemblyai

npm install
cp .env.example .env
# Edit .env with your API keys

‍

Terminal agent

npm start
# Speak into your mic — Ctrl+C to quit

‍

Browser agent

‍

npm run server
# Open http://localhost:3000

AssemblyAI WebSocket URL

const AAI_WS_URL =
  `wss://streaming.assemblyai.com/v3/ws` +
  `?speech_model=u3-rt-pro` +
  `&encoding=pcm_s16le` +
  `&sample_rate=16000` +
  `&end_of_turn_confidence_threshold=0.4` +
  `&min_end_of_turn_silence_when_confident=300` +
  `&max_turn_silence=1500` +
  `&token=${ASSEMBLYAI_API_KEY}`;

‍

Turn detection

AssemblyAI v3 uses three event types. Handle them like this:

ws.on("message", async (data) => {
  const msg = JSON.parse(data.toString());

  if (msg.type === "Begin") {
    console.log(`Session: ${msg.id}`);
  }

  if (msg.type === "Turn" && !msg.end_of_turn) {
    process.stdout.write(`\r${msg.transcript}`);
  }

  if (msg.type === "Turn" && msg.end_of_turn) {
    const reply = await generateResponse(msg.transcript);
    await speak(reply);
  }
});

‍

Sending audio

Browser (getUserMedia + ScriptProcessor):

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  const int16 = new Int16Array(float32.length);
  for (let i = 0; i < float32.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, Math.round(float32[i] * 32767)));
  }
  ws.send(int16.buffer);
};

‍

Terminal (mic package):

const micStream = micInstance.getAudioStream();
micStream.on("data", (chunk) => {
  aaiWs.send(chunk); // raw PCM s16le bytes
});

Tuning turn detection

Parameter	Default	Lower →	Higher →
`end_of_turn_confidence_threshold`	0.4	Faster response	Fewer false triggers
`min_end_of_turn_silence_when_confident`	300ms	Snappier	More natural pauses
`max_turn_silence`	1500ms	Faster cutoff	More thinking time

Keyterm prompting (mid-session)

Inject domain-specific vocabulary after the session starts without restarting:

ws.send(JSON.stringify({
  type: "UpdateConfiguration",
  keyterms: ["AssemblyAI", "Universal-3", "your-product-name"],
}));

Requirements

Node.js 18+
npm install installs: ws, openai, elevenlabs, mic, dotenv
macOS: afplay (built-in) for audio playback
Linux: aplay or mpg123 for audio playback

Deploy to Railway, Render, or Fly.io

# Set environment variables in the platform dashboard, then:
npm run server

‍

The browser server is stateless per-connection — each WebSocket session has its own AssemblyAI connection and conversation history.

Resources

Experiment with real-time turn detection

Try streaming transcription in our Playground and observe how punctuation and silence handling shape turn boundaries in real time.

Open playground

Node.js voice agent with AssemblyAI Universal-3 Pro Streaming

Why AssemblyAI Universal-3 Pro for Node.js?

Quick start

Terminal agent

Browser agent

AssemblyAI WebSocket URL

Turn detection

Tuning turn detection

Keyterm prompting (mid-session)

Requirements

Deploy to Railway, Render, or Fly.io

Resources

Top APIs and models for real-time speech recognition and transcription in 2026

Prompting Claude to build voice agents

Build a voice agent without Pipecat or LiveKit

Universal-3.5 Pro Realtime vs. Voice Agent API: Which one should you actually build on?

AI voice agents: what they are and how they work in 2026

Why evals in voice AI are so hard (and how to fix them)

AI for Universal Audio Understanding: Qwen-Audio Explained

8/31/2021 AWS Outage Post-Mortem

Node.js voice agent with AssemblyAI Universal-3 Pro Streaming

Why AssemblyAI Universal-3 Pro for Node.js?

Quick start

Terminal agent

Browser agent

AssemblyAI WebSocket URL

Turn detection

Tuning turn detection

Keyterm prompting (mid-session)

Requirements

Deploy to Railway, Render, or Fly.io

Resources

Related posts

Top APIs and models for real-time speech recognition and transcription in 2026

Prompting Claude to build voice agents

Build a voice agent without Pipecat or LiveKit

Universal-3.5 Pro Realtime vs. Voice Agent API: Which one should you actually build on?

AI voice agents: what they are and how they work in 2026

Why evals in voice AI are so hard (and how to fix them)

AI for Universal Audio Understanding: Qwen-Audio Explained

8/31/2021 AWS Outage Post-Mortem