Speech recognition in the browser using Web Speech API
Learn how to set up speech recognition in your browser using the Web Speech API and JavaScript.



The Web Speech API lets developers add speech recognition to web apps with just a few lines of JavaScript. It’s a fast way to prototype voice features directly in the browser—no server setup, no API keys, no billing. And with 87.5% of teams now actively building voice agents, demand for browser-based voice features is accelerating.
In this tutorial, you’ll build a working speech-to-text web app using the Web Speech API. We’ll walk through every line, from HTML to JavaScript to CSS. Then we’ll cover the API’s real-world limitations and show you how to move to production with AssemblyAI’s Streaming Speech-to-Text API and Voice Agent API.
Before we set up the app, let’s learn about the Web Speech API and how it works.
What is the Web Speech API?
The Web Speech API is a browser-based JavaScript interface that provides speech recognition and speech synthesis capabilities directly in web applications. It converts spoken words to text and text to speech without requiring external libraries.
The API has two main interfaces:
SpeechRecognition: Captures microphone input and sends audio to a cloud service (typically Google’s servers in Chrome) for transcription. Returns real-time transcription results to the browser:
// Set up a SpeechRecognition object
const recognition = new SpeechRecognition();
// Start and stop recording
recognition.start();
recognition.stop();
// Handle the result in a callback
recognition.addEventListener("result", onResult);
SpeechSynthesis: Takes text and converts it into spoken words using the browser’s built-in voices. If you want to explore this side of the API, see our guide to JavaScript text-to-speech. The exact voice and language depend on the user’s device and operating system, and the browser handles synthesis locally without needing an internet connection.
The Web Speech API abstracts these complex processes, so developers can integrate voice features without needing specialized infrastructure or machine learning expertise.
Browser support and compatibility
Before you start building, understand the Web Speech API’s biggest limitation: inconsistent browser support.
Safari’s support is a significant improvement over previous years. Since Safari 14.1 on macOS and Safari 14.5 on iOS/iPadOS, the SpeechRecognition interface is available via the webkitSpeechRecognition prefix. This means your Web Speech API code can now work across Chrome, Edge, Safari, and Opera—covering the vast majority of desktop and mobile users.
Firefox remains the holdout. Speech recognition is implemented but disabled by default behind the dom.webspeech.recognition.enable flag in about:config. Most Firefox users won’t have it turned on.
The JavaScript code in this tutorial already handles this with a fallback check:
This line checks for the standard API first, then falls back to the webkit prefixed version used by Safari.
Prerequisites
Let’s walk through each step of setting up the Web Speech API on a website. By the end, you’ll have a fully functional speech recognition web app.
To follow along with this guide, you need:
- A basic understanding of HTML, JavaScript, and CSS.
- A modern browser (Chrome, Edge, Safari 14.1+, or Opera) that supports the Web Speech API.
The full code is also available on GitHub here.
Step 1: Set up the project structure
First, create a folder for your project, and inside it, add three files:
- index.html: To define the structure of your web page.
- speech-api.js: To handle speech recognition using JavaScript.
style.css: To style the web page.
Step 2: Write the HTML file
We’ll start by writing the HTML code that will display the speech recognition UI. The page should contain a button for starting and stopping the recording, and a section for displaying the transcription results.
Add the following code to index.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,
initial-scale=1.0">
<title>Web Speech API example</title>
<link rel="stylesheet" href="./style.css" />
</head>
<body>
<h1>Web Speech API example</h1>
<p>Click the button and start speaking</p>
<button id="recording-button">Start recording</button>
<div id="transcription-result"></div>
<p id="error-message" hidden aria-hidden="true">
Button was removed<br>Your browser doesn't support Speech
Recognition
</p>
<script src="speech-api.js"></script>
</body>
</html>
This HTML creates a button for triggering speech recognition and a div for displaying results. The error message appears when browsers don’t support the API.
At the bottom of the body, we include a script that points to the speech-api.js file with the Web Speech API logic.
Step 3: Implement speech recognition API logic
Now, let’s write the JavaScript to handle speech recognition. Create the speech-api.js file and add the following code:
window.addEventListener("DOMContentLoaded", () => {
const recordingButton = document.getElementById("recording-button");
const transcriptionResult =
document.getElementById("transcription-result");
let isRecording = false;
const SpeechRecognition =
window.SpeechRecognition || window.webkitSpeechRecognition;
if (typeof SpeechRecognition !== "undefined") {
const recognition = new SpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
const onResult = (event) => {
transcriptionResult.textContent = "";
for (const result of event.results) {
const text = document.createTextNode(result[0].transcript);
const p = document.createElement("p");
p.appendChild(text);
if (result.isFinal) {
p.classList.add("final");
}
transcriptionResult.appendChild(p);
}
};
const onClick = (event) => {
if (isRecording) {
recognition.stop();
recordingButton.textContent = "Start recording";
} else {
recognition.start();
recordingButton.textContent = "Stop recording";
}
isRecording = !isRecording;
};
recognition.addEventListener("result", onResult);
recordingButton.addEventListener("click", onClick);
} else {
recordingButton.remove();
const message = document.getElementById("error-message");
message.removeAttribute("hidden");
message.setAttribute("aria-hidden", "false");
}
});Code breakdown
1. Browser support check: Detects SpeechRecognition availability (including the webkitSpeechRecognition prefix for Safari) and shows an error if unsupported.
2. API configuration:
- continuous: true — Listens until manually stopped
- interimResults: true — Shows real-time transcription as you speak
3. Result handling: Updates the DOM with transcribed text and applies .final class to completed results.
4. Button control: Toggles recording state between start/stop.
Step 4: Style the web page
Next, let’s add some styles to make the page visually appealing. Create the style.css file and add the following:
html,
body {
font-family: Arial, sans-serif;
text-align: center;
}
#transcription-result {
font-size: 18px;
color: #5e5e5e;
}
#transcription-result .final {
color: #000;
}
#error-message {
color: #ff0000;
}
button {
font-size: 20px;
font-weight: 200;
color: #fff;
background: #2f2ff2;
width: 220px;
border-radius: 20px;
margin-top: 2em;
margin-bottom: 2em;
padding: 1em;
cursor: pointer;
}
button:hover,
button:focus {
background: #2f70f2;
}This CSS ensures the button is easily clickable and the transcription result is clearly visible. The .final class makes completed transcription results appear in bold black. Every time the end of a sentence is detected, the interim gray text changes to black text.
Step 5: Test the web app
Once everything is in place, open the index.html file in a supported browser (Chrome, Edge, Safari 14.1+, or Opera). You should see a button labeled “Start recording”. When you click it, the browser will prompt you to grant permission to use the microphone.
After you allow access, the app will start transcribing spoken words into text and display them on the screen. The transcription results will continue to appear until you click the button again to stop recording.
Error handling and troubleshooting
Production applications require comprehensive error handling. The Web Speech API fails in predictable scenarios.
Common errors include:
- not-allowed: The user denied microphone permission.
- no-speech: No speech was detected after starting recognition.
- network: A network error occurred, since Chrome relies on Google’s servers.
- service-not-allowed: The browser or device has disabled speech recognition services.
You can catch these using the onerror event listener:
recognition.onerror = (event) => {
console.error(`Speech recognition error detected: ${event.error}`);
// You could add logic here to inform the user or attempt to restart.
};
For a robust application, you should provide clear feedback to the user for each error type and consider implementing logic to automatically restart the recognition service on recoverable errors like network.
Performance considerations and limitations
The Web Speech API works for prototypes, but it has significant limitations for production use. Here’s how it compares to a dedicated speech-to-text API like AssemblyAI:
According to AssemblyAI’s voice agent report, 52.5% of teams building voice agents cite accuracy as their top challenge. The Web Speech API’s inconsistent accuracy across accents, noise environments, and technical vocabulary makes it unsuitable for any application where transcription quality directly impacts your user experience.
When to use a dedicated speech-to-text API
The Web Speech API is great for quick prototypes and simple features. But if you’re building a product where transcription quality and reliability directly impact your customer experience, you need a dedicated speech-to-text AI solution.
AssemblyAI’s Universal-3 Pro Streaming model is designed for production applications that require high accuracy in real time, delivering transcripts within 300ms over WebSockets. It supports natural language prompting, real-time speaker diarization, keyterm prompting, and code-switching across 6 languages (English, Spanish, French, German, Italian, and Portuguese).
For broader language support, Universal Streaming Multilingual covers the same 6 languages at $0.15/hr, while Whisper Streaming supports 99+ languages at $0.30/hr. For a comparison of real-time speech recognition APIs, see our detailed benchmark.
All streaming models connect via the v3 WebSocket API at wss://streaming.assemblyai.com/v3/ws.
Streaming speech-to-text code example
Here’s how to connect to AssemblyAI’s Streaming Speech-to-Text API from a Node.js environment using a raw WebSocket. This is the production-grade alternative to the Web Speech API:
const WebSocket = require("ws");
const mic = require("mic");
const querystring = require("querystring");
const YOUR_API_KEY = "YOUR-API-KEY";
const CONNECTION_PARAMS = {
sample_rate: 16000,
speech_model: "u3-rt-pro",
};
const API_ENDPOINT = `wss://streaming.assemblyai.com/v3/ws?${querystring.stringify(CONNECTION
_PARAMS)}`;
const micInstance = mic({ rate: "16000", channels: "1", bitwidth: "16"
});
const micInputStream = micInstance.getAudioStream();
const ws = new WebSocket(API_ENDPOINT, {
headers: { Authorization: YOUR_API_KEY },
});
ws.on("open", () => {
console.log("Connected to AssemblyAI Streaming API");
micInstance.start();
micInputStream.on("data", (data) => {
if (ws.readyState === WebSocket.OPEN) {
ws.send(data);
}
});
});
ws.on("message", (message) => {
const data = JSON.parse(message);
if (data.type === "Turn") {
if (data.end_of_turn) {
console.log(`Final: ${data.transcript}`);
} else {
process.stdout.write(`\rPartial: ${data.transcript}`);
}
}
});
ws.on("close", () => console.log("Disconnected"));
process.on("SIGINT", () => {
ws.send(JSON.stringify({ type: "Terminate" }));
micInstance.stop();
});
This gives you sub-300ms latency, the highest real-time accuracy available, and none of the browser dependency issues—making it one of the best tools for live transcription. You get:
- Highest accuracy: Universal-3 Pro Streaming delivers the lowest word error rate on real-world audio, including names, account numbers, and technical terms.
- Reliability at scale: 99.9% uptime SLA with unlimited concurrent streams and autoscaling included.
- Advanced features: Speaker diarization, keyterm prompting, PII redaction, and profanity filtering—all available per session.
Sign up for a free API key to get started.
Building a voice agent in the browser with AssemblyAI
If you’re reading this tutorial because you want to build a voice assistant, voice chatbot, or interactive AI voice agent that runs in the browser, the Web Speech API won’t get you there. It handles speech-to-text only—there’s no LLM reasoning, no voice generation, no turn detection, and no interruption handling.
AssemblyAI’s Voice Agent API is a single WebSocket that handles the full voice pipeline: speech understanding (powered by Universal-3 Pro Streaming), LLM reasoning, voice generation, turn detection, and intelligent interruption handling. Stream audio in, get audio back. No separate STT, LLM, and TTS providers to manage.
Here’s why JavaScript and browser developers are choosing it:
- One WebSocket, one JSON protocol: Connect to wss://agents.assemblyai.com/v1/ws, send and receive JSON messages. No SDK required.
- $4.50/hr flat rate: Covers speech-to-text, LLM, and text-to-speech. No per-token billing, no separate invoices.
- Works from the browser: AssemblyAI provides a complete browser integration guide with AudioWorklet-based mic capture, echo cancellation, and token-based auth so your API key never touches the client.
- ~1 second end-to-end latency: Fast enough that conversations feel natural.
- 6 languages: English, Spanish, French, German, Italian, and Portuguese.
Browser voice agent quickstart
Here’s a simplified example of connecting to the Voice Agent API from the browser. In production, you’d generate a temporary token on your server (see the browser integration docs):
// 1. Get a temporary token from your server
const { token } = await fetch("/api/voice-token").then((r) => r.json());
// 2. Connect to the Voice Agent API
const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
wsUrl.searchParams.set("token", token);
const ws = new WebSocket(wsUrl);
// 3. Configure your agent on connection
ws.addEventListener("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
system_prompt: "You are a helpful voice assistant.",
greeting: "Hi there! How can I help you today?",
output: { voice: "ivy" },
},
}));
});
// 4. Handle messages — transcripts, audio, errors
ws.addEventListener("message", (event) => {
const msg = JSON.parse(event.data);
switch (msg.type) {
case "session.ready":
console.log("Voice agent ready — start speaking");
break;
case "transcript.user":
console.log("You said:", msg.text);
break;
case "transcript.agent":
console.log("Agent said:", msg.text);
break;
case "reply.audio":
// Decode base64 PCM16 and play through AudioContext
break;
}
});
AssemblyAI also provides a single-file HTML quickstart (voice-agent.html) that you can serve locally with npx serve . and start talking to a voice agent immediately. Check the Voice Agent API quickstart for the full working example.
Most developers go from API key to working voice agent the same afternoon. The API uses standard JSON over WebSocket—no framework to learn, no complex state machine to manage.
Frequently asked questions
Which browsers support the Web Speech API?
Chrome, Edge, and Opera have full support for the SpeechRecognition interface. Safari supports it from version 14.1+ on macOS and 14.5+ on iOS/iPadOS using the webkitSpeechRecognition prefix. Firefox has an implementation behind a flag (dom.webspeech.recognition.enable in about:config) but it’s disabled by default.
Can the Web Speech API work offline?
No. Speech recognition requires internet connectivity to send audio to cloud servers for processing (Google’s servers in Chrome’s case). Only SpeechSynthesis (text-to-speech) works offline using the browser’s local voices.
How does the Web Speech API compare to a professional API like AssemblyAI?
The Web Speech API is free but limited to basic transcription with no uptime guarantees, no data processing agreement, and inconsistent accuracy. AssemblyAI’s Universal-3 Pro Streaming model delivers 94.07% word accuracy (#1 on the Hugging Face Open ASR Leaderboard), a 99.9% uptime SLA, and advanced features like speaker diarization, entity detection, PII redaction, and summarization. The Web Speech API is best for prototypes; AssemblyAI is built for production.
How do I get started with a speech-to-text API for audio transcription?
Sign up for a free AssemblyAI account—no credit card required. Grab your API key from the dashboard, then connect to the Streaming Speech-to-Text API via WebSocket at wss://streaming.assemblyai.com/v3/ws with the speech_model parameter set to u3-rt-pro. AssemblyAI provides official SDKs for Python, JavaScript/Node.js, and Ruby, plus raw WebSocket examples if you prefer to work without an SDK.
What’s the difference between a transcription API and device speech recognition?
Device speech recognition (like the Web Speech API) runs through the browser and depends on the browser vendor’s cloud service. You have no control over the model, no SLA, no data processing guarantees, and limited features. A transcription API like AssemblyAI gives you direct access to dedicated Voice AI models with consistent accuracy, configurable features (diarization, PII redaction, language selection), enterprise security certifications (SOC 2, ISO 27001, HIPAA BAA), and a 99.9% uptime SLA. The API works in any environment—browsers, servers, mobile apps, or embedded devices.
What SDKs and programming languages are supported by speech recognition APIs?
AssemblyAI maintains official SDKs for Python and JavaScript/Node.js, as well as Ruby. For any other language—Go, Java, .NET, and beyond—you connect directly via the REST API or WebSocket; the API uses standard JSON messages, so no SDK is required. The Voice Agent API works the same way: one WebSocket, JSON in and out.
Can I build a voice agent in the browser?
Yes. AssemblyAI’s Voice Agent API lets you build a complete voice agent that runs in the browser. Connect to a single WebSocket at wss://agents.assemblyai.com/v1/ws, stream microphone audio in, and receive the agent’s spoken responses back—all over one connection. The API handles speech understanding, LLM reasoning, voice generation, turn detection, and interruption handling. It costs $4.50/hr flat and supports English, Spanish, French, German, Italian, and Portuguese. Most developers ship a working prototype the same day.
How do I use a WebSocket connection for live speech recognition?
For the Web Speech API, the browser handles the WebSocket to Google’s servers automatically—you just call recognition.start(). For production use, AssemblyAI’s Streaming Speech-to-Text API uses an explicit WebSocket connection to wss://streaming.assemblyai.com/v3/ws. You send raw audio bytes over the connection and receive JSON transcription results in real time. Include the speech_model parameter (e.g., u3-rt-pro for Universal-3 Pro Streaming) and sample_rate in your connection URL. Send a { "type": "Terminate" } message when you’re done to cleanly close the session.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.


