What is the best speech-to-text API for building an AI meeting notetaker?

AssemblyAI's Universal-3.5 Pro is the leading choice for meeting notetaker products — sub-10% word error rate on conversational audio, built-in speaker diarization that handles 2 to 10+ participants, and Auto Chapters that segment meetings by topic automatically. Universal-3.5 Pro Realtime ($0.45/hr per stream, unlimited concurrency) handles real-time transcription with ~150ms P50 latency for live meeting assistants. Pair it with the LLM Gateway (25+ models across Claude, GPT, and Gemini through one unified API) to extract action items, decisions, and follow-ups. For a fully managed conversational agent that can answer questions about the meeting in real time, use the Voice Agent API at $4.50/hr per session.

How does AssemblyAI compare to Deepgram for AI notetakers?

AssemblyAI is purpose-built for notetaker products in ways pure-STT providers like Deepgram are not. Universal-3.5 Pro Realtime handles transcription with ~150ms P50 latency and speaker diarization for 2 to 10+ participants, but the bigger differentiator is the layer above STT: Auto Chapters and Summarization are first-class API flags (not features you have to build on top of raw transcripts), and the LLM Gateway gives you 25+ models — Claude, GPT, Gemini, and more — through one unified API for action-item extraction, Q&A over the transcript, and custom meeting briefs. Notetaker customers like Grain and Supernormal pick AssemblyAI for the speech-foundation layer — multi-speaker diarization that holds up across long calls, plus Speech Understanding features that ship as flags rather than scratch builds — so their teams can focus on the workspace UX rather than wiring transcripts together with a separate LLM provider.

How does speaker diarization work in multi-speaker meetings?

AssemblyAI's streaming and pre-recorded APIs both support speaker diarization out of the box — enable `speaker_labels: true` and each transcript turn carries a `speaker_label` field (A, B, C, ...) while each word in the words array carries a per-word `speaker` field for mid-turn speaker change detection. The model handles 2 to 10+ participants and accuracy improves over the course of a session as the model accumulates embedding context. For meetings where participants are on separate audio channels (e.g., Zoom recordings with per-participant tracks), use multichannel transcription instead of diarization — it gives you cleaner separation. Pair diarization with role-based identification downstream (e.g., "PM", "Engineering") by post-processing the speaker labels against context cues in the transcript.

How do I extract meeting highlights, action items, and decisions from a transcript automatically?

AssemblyAI's LLM Gateway gives you unified access to 25+ models (Claude, GPT, Gemini, and more) for post-processing meeting transcripts. Pass the diarized transcript with a prompt template that extracts meeting highlights, action items (assignee, deadline, commitment), key decisions, open questions, and follow-ups. For real-time extraction, run the LLM step on every Nth turn during the meeting; for higher-quality post-meeting briefs, run it on the complete transcript after the meeting ends. AssemblyAI also offers Auto Chapters as a single API flag — automatically segments meetings into timestamped sections with headlines and summaries, no LLM prompt required.

Should I use the Voice Agent API or Streaming STT for my meeting product?

It depends on whether your assistant needs to talk back. The Voice Agent API ($4.50/hr all-in) is the right pick when you want an interactive assistant — one that can answer "who said what about the launch date?" mid-meeting, or summarize key decisions on demand. One WebSocket handles STT, LLM, and TTS so you can ship a working agent in an afternoon. Universal-3.5 Pro Realtime ($0.45/hr) is the right pick for passive meeting notetakers — you bring your own LLM (or use AssemblyAI's LLM Gateway) for note generation and your own workspace integrations. Both share the same Universal-3.5 Pro speech foundation, so transcription quality and diarization are identical.

What conferencing platforms does AssemblyAI integrate with for meeting transcription?

AssemblyAI works with every major conferencing platform via a few different integration patterns. For real-time transcription, bridge from Zoom, Microsoft Teams, Google Meet, or Webex using a meeting bot (Recall.ai is a popular bot-as-a-service that works natively with AssemblyAI) or the platform's media stream APIs. For post-meeting transcription, send the platform's recording (mp4, m4a, or wav) to the pre-recorded API. LiveKit and Pipecat both have first-party AssemblyAI plugins (`livekit-plugins-assemblyai`) for cascading voice agent pipelines. The Voice Agent API also works natively with LiveKit and Pipecat via direct WebSocket — no proprietary SDK required.

Solutions

Voice agents for meeting intelligence & AI notetakers

Build AI meeting assistants that transcribe every conversation, identify speakers, and generate structured notes with action items, key decisions, and chapter summaries. Powered by the fastest, most accurate speech-to-text with built-in Speech Understanding.

Get started free Contact sales

Meeting notes — auto-generated

Q3 planning sync · 4 speakers · 42 min

Key decisions

Launch date moved to Sept 15. Budget approved for 2 additional engineers. Partner integration deprioritized to Q4.

Action items

Sarah: draft revised timeline by Friday. Mike: open 2 eng reqs in Greenhouse. Priya: update stakeholder deck

Chapters

0:00 Status update · 8:12 Timeline discussion · 22:40 Resourcing · 35:15 Next steps

The problem

Meetings end and the context disappears

Teams spend 31 hours per month in meetings, yet most decisions, action items, and commitments vanish within 24 hours. Manual note-taking splits attention, shared docs are incomplete, and recordings sit unwatched. AI meeting assistants — built on accurate speech-to-text, speaker diarization, and LLM-powered summarization — capture everything so your team can stay present during the conversation and act on the outcomes after.

Built for meeting transcription performance

Latency ~150ms

P50 median streaming latency for real-time meeting assistants.

Speakers 10+

Speakers identified across multi-participant meetings with built-in diarization.

Uptime 99.9%

SLA with SOC 2 Type 2 certification.

Scale 40TB+

Audio processed daily in production.

Two ways to build

Pick the API that fits your meeting product

Ship an interactive meeting assistant in an afternoon, or drop industry-leading STT and Speech Understanding into your notetaker product.

Recommended

Voice Agent API

Our proprietary voice stack via one WebSocket. Build interactive meeting assistants that transcribe, summarize, and answer questions about the conversation in real time.

Best for

Interactive meeting assistants with voice Q&A
Teams shipping fast — working assistant in an afternoon
Real-time transcription with speaker diarization
Claude Code compatible — paste the docs and build anything

$4.50/hr — speech, LLM, and voice all included

Get started for free

Free tier available · No credit card required

Bring Your Own Stack

Universal-3.5 Pro Realtime STT API

The STT and Speech Understanding layer for your meeting product. Transcription, diarization, chapters, action items, and LLM Gateway for custom summaries.

Best for

AI notetaker products with custom UX and workflows
Auto Chapters, action items, and topic detection built in
LLM Gateway for custom summaries and Q&A over transcripts
PII redaction before notes hit your workspace or CRM
High-scale deployments where margin and full control matter

$0.45/hr — transcription only, unlimited streams

View integration docs

No concurrency caps · Autoscaling included

Your meeting intelligence pipeline

Capture meeting audio

Voice Agent API: single WebSocket for real-time. Or ingest recordings from Zoom, Teams, Meet, or any conferencing platform via bot or API.

Transcribe with speaker diarization

Speaker labels identify who said what. ~150ms P50 streaming latency. Keyterm boosting for your product names, team members, and project codenames.

Extract structure and insights

Auto Chapters segment by topic. Summarization generates notes. LLM Gateway extracts action items, decisions, and follow-ups across 25+ models (Claude, GPT, Gemini).

Distribute to your workspace

Push notes, action items, and summaries to Slack, Notion, Google Docs, or your CRM via webhook. Searchable transcript archive for async review.

schedule

Meeting transcript

Sarah (PM)

"Let's lock the launch date — I'm proposing September 15."

Mike (Eng)

"That works if we get two more engineers. The API layer needs another sprint."

Priya (Design)

"I'll have the updated flows to eng by end of week."

Quickstart

Get a working assistant in minutes

Voice Agent API — recommended

# Voice Agent API: real-time meeting assistant
import asyncio, json, websockets

API_KEY = "YOUR_API_KEY"

async def run_agent():
    async with websockets.connect(
        "wss://agents.assemblyai.com/v1/ws",
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": (
                    "You are a meeting assistant. Listen to the "
                    "conversation and answer questions about what was discussed. "
                    "Track action items, decisions, and follow-ups."
                ),
                "input": {"keyterms": ["Q3 launch", "Project Atlas", "Greenhouse"]},
                "output": {"voice": "ivy"},
            },
        }))
        # Stream meeting audio in, get transcript + answers back
        async for msg in ws:
            handle(json.loads(msg))  # transcript.user, reply.audio, tool.call, ...

Universal-3.5 Pro Realtime + LiveKit — BYO stack

# LiveKit + AssemblyAI STT in a real-time meeting notetaker pipeline
from livekit.agents import Agent, AgentSession
from livekit.plugins import assemblyai, cartesia, openai, silero

class MeetingAssistant(Agent):
    def __init__(self):
        super().__init__(
            instructions=(
                "You are a meeting assistant. Track action items, "
                "key decisions, and generate structured notes."
            ),
        )

async def entrypoint(ctx):
    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            speaker_labels=True,                     # diarize multi-speaker meetings
            keyterms_prompt=["Q3 launch", "Project Atlas", "Greenhouse"],
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(),
    )
    await session.start(room=ctx.room, agent=MeetingAssistant())

Try in Playground View full docs

Auto Chapters & summaries

Automatically segment meetings by topic with timestamped chapter headings. Generate paragraph, bullet, or headline summaries with a single API flag.

Speaker diarization

Identify who said what across meetings with 2 to 10+ participants — essential for attribution and action item assignment.

LLM Gateway

Ask questions about your transcripts, extract custom fields, generate meeting briefs, or build searchable archives. 25+ models across Claude, GPT, and Gemini through one unified API.

Teams shipping meeting products on AssemblyAI

12% increase in customer satisfaction

Grain's AI notetaker platform automates meeting transcription and conversation analysis for 31,000+ customers — built on AssemblyAI's speech foundation.

Frequently asked questions

: AssemblyAI's Universal-3.5 Pro is the leading choice for meeting notetaker products — sub-10% word error rate on conversational audio, built-in speaker diarization that handles 2 to 10+ participants, and Auto Chapters that segment meetings by topic automatically. Universal-3.5 Pro Realtime ($0.45/hr per stream, unlimited concurrency) handles real-time transcription with ~150ms P50 latency for live meeting assistants. Pair it with the LLM Gateway (25+ models across Claude, GPT, and Gemini through one unified API) to extract action items, decisions, and follow-ups. For a fully managed conversational agent that can answer questions about the meeting in real time, use the Voice Agent API at $4.50/hr per session.
: AssemblyAI is purpose-built for notetaker products in ways pure-STT providers like Deepgram are not. Universal-3.5 Pro Realtime handles transcription with ~150ms P50 latency and speaker diarization for 2 to 10+ participants, but the bigger differentiator is the layer above STT: Auto Chapters and Summarization are first-class API flags (not features you have to build on top of raw transcripts), and the LLM Gateway gives you 25+ models — Claude, GPT, Gemini, and more — through one unified API for action-item extraction, Q&A over the transcript, and custom meeting briefs. Notetaker customers like Grain and Supernormal pick AssemblyAI for the speech-foundation layer — multi-speaker diarization that holds up across long calls, plus Speech Understanding features that ship as flags rather than scratch builds — so their teams can focus on the workspace UX rather than wiring transcripts together with a separate LLM provider.
: AssemblyAI's streaming and pre-recorded APIs both support speaker diarization out of the box — enable `speaker_labels: true` and each transcript turn carries a `speaker_label` field (A, B, C, ...) while each word in the words array carries a per-word `speaker` field for mid-turn speaker change detection. The model handles 2 to 10+ participants and accuracy improves over the course of a session as the model accumulates embedding context. For meetings where participants are on separate audio channels (e.g., Zoom recordings with per-participant tracks), use multichannel transcription instead of diarization — it gives you cleaner separation. Pair diarization with role-based identification downstream (e.g., "PM", "Engineering") by post-processing the speaker labels against context cues in the transcript.
: AssemblyAI's LLM Gateway gives you unified access to 25+ models (Claude, GPT, Gemini, and more) for post-processing meeting transcripts. Pass the diarized transcript with a prompt template that extracts meeting highlights, action items (assignee, deadline, commitment), key decisions, open questions, and follow-ups. For real-time extraction, run the LLM step on every Nth turn during the meeting; for higher-quality post-meeting briefs, run it on the complete transcript after the meeting ends. AssemblyAI also offers Auto Chapters as a single API flag — automatically segments meetings into timestamped sections with headlines and summaries, no LLM prompt required.
: It depends on whether your assistant needs to talk back. The Voice Agent API ($4.50/hr all-in) is the right pick when you want an interactive assistant — one that can answer "who said what about the launch date?" mid-meeting, or summarize key decisions on demand. One WebSocket handles STT, LLM, and TTS so you can ship a working agent in an afternoon. Universal-3.5 Pro Realtime ($0.45/hr) is the right pick for passive meeting notetakers — you bring your own LLM (or use AssemblyAI's LLM Gateway) for note generation and your own workspace integrations. Both share the same Universal-3.5 Pro speech foundation, so transcription quality and diarization are identical.
: AssemblyAI works with every major conferencing platform via a few different integration patterns. For real-time transcription, bridge from Zoom, Microsoft Teams, Google Meet, or Webex using a meeting bot (Recall.ai is a popular bot-as-a-service that works natively with AssemblyAI) or the platform's media stream APIs. For post-meeting transcription, send the platform's recording (mp4, m4a, or wav) to the pre-recorded API. LiveKit and Pipecat both have first-party AssemblyAI plugins (`livekit-plugins-assemblyai`) for cascading voice agent pipelines. The Voice Agent API also works natively with LiveKit and Pipecat via direct WebSocket — no proprietary SDK required.

Build your meeting assistant today

Free tier, no credit card. From raw meeting audio to structured notes in an afternoon.

Get started free