Best Practices for building Meeting Notetakers | AssemblyAI

Introduction

Building a robust meeting notetaker requires careful consideration of accuracy, latency, speaker identification, and real-time capabilities. This guide addresses common questions and provides practical solutions for both post-call and live meeting transcription scenarios.

Why AssemblyAI for Meeting Notetakers?

AssemblyAI stands out as the premier choice for meeting notetakers with several key advantages:

Industry-Leading Accuracy with Pre-recorded Audio

93.3%+ transcription accuracy ensures reliable meeting documentation
2.9% speaker diarization error rate for precise “who said what” attribution
Speech Understanding integration for intelligent post-processing and insights
Keyterms prompt allows providing meeting context to improve accuracy of transcription

Real-Time Streaming Advantages

As meeting notetakers evolve toward real-time capabilities, AssemblyAI’s Universal-Streaming model offers significant benefits:

Ultra-low latency (~300ms) enables live transcription without delays
Format turns feature provides structured, readable output in real-time
Keyterms prompt allows providing meeting context to improve accuracy of transcription

End-to-End Voice AI Platform

Unlike fragmented solutions, AssemblyAI provides a unified API for:

Transcription with speaker diarization
Automatic language detection and code switching
Boosting accuracy via meeting context with keyterms prompt
Speech Understanding tasks like speaker identification, translation, and transcript styling
Post-processing workflows with custom prompting - from summarization to completely custom workflows
Real-time and batch processingof pre-recorded audio in a single platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Understanding when to use pre-recorded versus streaming speech-to-text is critical for building the right meeting notetaker.

Pre-recorded (Async) Speech-to-text

Post-call analysis - Meeting already happened, you have the full recording

Highest accuracy needed - Pre-recorded models have higher accuracy (93.3%+)
Speaker diarization is critical - Async has 2.9% speaker error rate
Broad language support - Need any of 99+ languages
Advanced features required - Summarization, sentiment analysis, entity detection, PII redaction, speaker identification
Batch processing - Processing multiple recordings at once
Quality over speed - Can wait seconds/minutes for perfect results

Best for: Zoom/Teams/Meet recording uploads, compliance, documentation, post-call summaries, searchable archives

Streaming (Real-time) Speech-to-text

Live meetings - Transcribing as the meeting happens

Real-time captions - Displaying subtitles/captions to participants during calls
Immediate feedback - Need transcription within ~300ms
Interactive features - Live note-taking, real-time keyword detection, action item alerts
No recording available - Processing live audio only

Best for: Live captions, real-time note-taking apps, accessibility features, live keyword alerts

Hybrid Approach (Recommended)

Many successful meeting notetakers use both pre-recorded and streaming speech-to-text:

Streaming during the call - Provide live captions and real-time notes to participants
Async after the call - Generate high-quality transcript with speaker labels, summary, and insights

This gives users immediate value during meetings while providing comprehensive documentation afterward.

Example workflow:

User joins meeting → Start streaming for live captions
Meeting ends → Upload recording to pre-recorded API for final transcript with speaker names
Generate meeting summary, action items, and searchable archive from pre-recorded transcript

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

For post-call analysis, AssemblyAI supports:

Languages:

99 languages supported
Automatic Language Detection to route to the most spoken language
Code Switching to preserve changes in speech between languages

Core Features:

Speaker diarization (1-10 speakers by default, expandable to any min/max)
Multichannel audio support (each channel = one speaker)
Automatic formatting, punctuation, and capitalization
Keyterms prompting for boosting domain-specific terms

Speech Understanding Models:

Summarization for meeting recaps
Sentiment analysis for meeting tone assessment
Entity detection for extracting key information
Speaker identification to map generic labels to actual names/roles
Translation between 99+ languages

Real-Time Streaming

For live meeting transcription:

Languages:

English-only model (default)
Multilingual model supporting English, Spanish, French, German, Portuguese, and Italian

Streaming-Specific Features:

Partial and final transcripts for responsive UI
Format turns for structured, readable output
Keyterms prompt for contextual accuracy
End-of-utterance detection for natural speech boundaries

Coming Soon (Public Roadmap)

Enhanced accuracy for English, Spanish, French, German, Portuguese, and Italian post-call transcription
Domain specific models out of the box (i.e. Medical)
Speaker Diarization improvements (especially on shorter files)
Speaker diarization for streaming

How Can I Get Started Building a Post-Call Meeting Notetaker?

Here’s a complete example implementing async transcription with all essential features:

1 import assemblyai as aai
2 import asyncio
3 from typing import Dict, List
4 from assemblyai.types import (
5     SpeakerOptions,
6     LanguageDetectionOptions,
7     PIIRedactionPolicy,
8     PIISubstitutionPolicy,
9 )
10 
11 # Configure API key
12 aai.settings.api_key = "your_api_key_here"
13 
14 async def transcribe_meeting_async(audio_source: str) -> Dict:
15     """
16     Asynchronously transcribe a meeting recording with full features
17 
18     Args:
19         audio_source: Either a local file path or publicly accessible URL
20     """
21     # Configure comprehensive meeting analysis
22     config = aai.TranscriptionConfig(
23         # Speaker diarization
24         speaker_labels=True,
25         speakers_expected=None,  # Use if you know exact number from Zoom/Meet/Teams
26         speaker_options=SpeakerOptions(
27             min_speakers_expected=2,
28             max_speakers_expected=20
29         ),  # Use if you know the min/max range
30         multichannel=False,  # Set to True if audio has separate channel per speaker
31 
32         # Language detection
33         language_detection=True,  # Auto-detect the most used language
34         language_detection_options=LanguageDetectionOptions(
35             code_switching=True,  # Preserve language switches
36             code_switching_confidence_threshold=0.5,
37         ),
38 
39         # Punctuation and formatting
40         punctuate=True,
41         format_text=True,
42 
43         # Boost accuracy of meeting-specific vocabulary
44         keyterms_prompt=["quarterly", "KPI", "roadmap", "deliverables"],
45 
46         # Speech Understanding - commonly used models
47         summarization=True,
48         sentiment_analysis=True,
49         entity_detection=True,
50         redact_pii=True,
51         redact_pii_policies=[
52             PIIRedactionPolicy.person_name,
53             PIIRedactionPolicy.organization,
54             PIIRedactionPolicy.occupation,
55         ],
56         redact_pii_sub=PIISubstitutionPolicy.hash,
57         redact_pii_audio=True
58     )
59 
60     # Create transcriber
61     transcriber = aai.Transcriber()
62 
63     try:
64         # Submit transcription job
65         transcript = await asyncio.to_thread(
66             transcriber.transcribe,
67             audio_source,
68             config=config
69         )
70 
71         # Check status
72         if transcript.status == aai.TranscriptStatus.error:
73             raise Exception(f"Transcription failed: {transcript.error}")
74 
75         # Process speaker-labeled utterances
76         print("\n=== SPEAKER-LABELED TRANSCRIPT ===\n")
77 
78         for utterance in transcript.utterances:
79             # Format timestamp
80             start_time = utterance.start / 1000  # Convert to seconds
81             end_time = utterance.end / 1000
82 
83             # Print formatted utterance
84             print(f"[{start_time:.1f}s - {end_time:.1f}s] Speaker {utterance.speaker}:")
85             print(f"  {utterance.text}")
86             print(f"  Confidence: {utterance.confidence:.2%}\n")
87 
88         # Print summary data
89         print("\n=== MEETING SUMMARY ===\n")
90         print({
91             "id": transcript.id,
92             "status": transcript.status,
93             "duration": transcript.audio_duration,
94             "speaker_count": len(set(u.speaker for u in transcript.utterances)),
95             "word_count": len(transcript.words) if transcript.words else 0,
96             "detected_language": transcript.language_code if hasattr(transcript, 'language_code') else None,
97             "summary": transcript.summary,
98         })
99 
100         return {
101             "transcript": transcript,
102             "utterances": transcript.utterances,
103             "summary": transcript.summary,
104         }
105 
106     except Exception as e:
107         print(f"Error during transcription: {e}")
108         raise
109 
110 async def main():
111     """
112     Example usage with error handling
113     """
114     # Use either local file OR URL (not both)
115     audio_source = "https://assembly.ai/wildfires.mp3"  # Or "path/to/recording.mp3"
116 
117     try:
118         result = await transcribe_meeting_async(audio_source)
119 
120         # Additional processing
121         print(f"\nTotal speakers identified: {len(set(u.speaker for u in result['utterances']))}")
122         print(f"Meeting duration: {result['transcript'].audio_duration} seconds")
123 
124     except Exception as e:
125         print(f"Failed to process meeting: {e}")
126 
127 if __name__ == "__main__":
128     asyncio.run(main())

How Can I Get Started Building a During-Call Live Meeting Notetaker?

Here’s a complete example for real-time streaming transcription with meeting-optimized settings:

1 # pip install pyaudio websocket-client
2 import pyaudio
3 import websocket
4 import json
5 import threading
6 import time
7 from urllib.parse import urlencode
8 from datetime import datetime
9 
10 # --- Configuration ---
11 YOUR_API_KEY = "your_api_key"
12 
13 # Keyterms to improve recognition accuracy
14 KEYTERMS = [
15     "Alice Johnson",
16     "Bob Smith",
17     "Carol Davis",
18     "quarterly review",
19     "action items",
20     "follow up",
21     "deadline",
22     "budget"
23 ]
24 
25 # MEETING NOTETAKER CONFIGURATION (different from voice agents!)
26 CONNECTION_PARAMS = {
27     "sample_rate": 16000,
28     "format_turns": True,  # ALWAYS TRUE for meetings - users need readable text
29 
30     # Meeting-optimized turn detection (wait longer than voice agents)
31     "end_of_turn_confidence_threshold": 0.6,  # Higher than voice agents (0.4)
32     "min_end_of_turn_silence_when_confident": 560,  # Wait longer for natural pauses (voice agents use 160ms)
33     "max_turn_silence": 2000,  # Allow thinking pauses (voice agents use 1280ms)
34 
35     # Keyterms for accuracy
36     "keyterms_prompt": json.dumps(KEYTERMS)
37 }
38 
39 API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
40 API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
41 
42 # Audio Configuration
43 FRAMES_PER_BUFFER = 800  # 50ms of audio
44 SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
45 CHANNELS = 1
46 FORMAT = pyaudio.paInt16
47 
48 # Global variables
49 audio = None
50 stream = None
51 ws_app = None
52 audio_thread = None
53 stop_event = threading.Event()
54 transcript_buffer = []
55 
56 
57 def on_open(ws):
58     """Called when the WebSocket connection is established."""
59     print("=" * 80)
60     print(f"[{datetime.now().strftime('%H:%M:%S')}] Meeting transcription started")
61     print(f"Connected to: {API_ENDPOINT_BASE_URL}")
62     print(f"Keyterms configured: {', '.join(KEYTERMS)}")
63     print("=" * 80)
64     print("\nSpeak into your microphone. Press Ctrl+C to stop.\n")
65 
66     def stream_audio():
67         """Stream audio from microphone to WebSocket"""
68         global stream
69         while not stop_event.is_set():
70             try:
71                 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
72                 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
73             except Exception as e:
74                 if not stop_event.is_set():
75                     print(f"Error streaming audio: {e}")
76                 break
77 
78     global audio_thread
79     audio_thread = threading.Thread(target=stream_audio)
80     audio_thread.daemon = True
81     audio_thread.start()
82 
83 
84 def on_message(ws, message):
85     """Handle incoming messages from AssemblyAI"""
86     try:
87         data = json.loads(message)
88         msg_type = data.get("type")
89 
90         # Uncomment to see full JSON for debugging:
91         # print("=" * 80)
92         # print(json.dumps(data, indent=2, ensure_ascii=False))
93         # print("=" * 80)
94         # print()
95 
96         if msg_type == "Begin":
97             session_id = data.get("id", "N/A")
98             print(f"[SESSION] Started - ID: {session_id}\n")
99 
100         elif msg_type == "Turn":
101             end_of_turn = data.get("end_of_turn", False)
102             turn_is_formatted = data.get("turn_is_formatted", False)
103             transcript = data.get("transcript", "")
104             turn_order = data.get("turn_order", 0)
105             end_of_turn_confidence = data.get("end_of_turn_confidence", 0.0)
106 
107             # FOR MEETING NOTETAKERS: Show partials for responsive UI
108             if not end_of_turn and transcript:
109                 print(f"\r[LIVE] {transcript}", end="", flush=True)
110 
111             # FOR MEETING NOTETAKERS: Use formatted finals for readable display
112             # (Unlike voice agents which should use utterance for speed)
113             if end_of_turn and turn_is_formatted and transcript:
114                 timestamp = datetime.now().strftime('%H:%M:%S')
115                 print(f"\n[{timestamp}] {transcript}")
116                 print(f"           Turn: {turn_order} | Confidence: {end_of_turn_confidence:.2%}")
117 
118                 # Detect action items
119                 transcript_lower = transcript.lower()
120                 if any(term in transcript_lower for term in ["action item", "follow up", "deadline", "assigned to", "todo"]):
121                     print("           ⚠️  ACTION ITEM DETECTED!")
122 
123                 # Store final transcript
124                 transcript_buffer.append({
125                     "timestamp": timestamp,
126                     "text": transcript,
127                     "turn_order": turn_order,
128                     "confidence": end_of_turn_confidence,
129                     "type": "final"
130                 })
131                 print()
132 
133         elif msg_type == "Termination":
134             audio_duration = data.get("audio_duration_seconds", 0)
135             print(f"\n[SESSION] Terminated - Duration: {audio_duration}s")
136             save_transcript()
137 
138         elif msg_type == "Error":
139             error_msg = data.get("error", "Unknown error")
140             print(f"\n[ERROR] {error_msg}")
141 
142     except json.JSONDecodeError as e:
143         print(f"Error decoding message: {e}")
144     except Exception as e:
145         print(f"Error handling message: {e}")
146 
147 
148 def on_error(ws, error):
149     """Called when a WebSocket error occurs."""
150     print(f"\n[WEBSOCKET ERROR] {error}")
151     stop_event.set()
152 
153 
154 def on_close(ws, close_status_code, close_msg):
155     """Called when the WebSocket connection is closed."""
156     print(f"\n[WEBSOCKET] Disconnected - Status: {close_status_code}, Message: {close_msg}")
157 
158     global stream, audio
159     stop_event.set()
160 
161     # Clean up audio stream
162     if stream:
163         if stream.is_active():
164             stream.stop_stream()
165         stream.close()
166         stream = None
167     if audio:
168         audio.terminate()
169         audio = None
170     if audio_thread and audio_thread.is_alive():
171         audio_thread.join(timeout=1.0)
172 
173 
174 def save_transcript():
175     """Save the transcript to a file"""
176     if not transcript_buffer:
177         print("No transcript to save.")
178         return
179 
180     filename = f"meeting_transcript_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
181 
182     with open(filename, "w", encoding="utf-8") as f:
183         f.write("Meeting Transcript\n")
184         f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
185         f.write(f"Keyterms: {', '.join(KEYTERMS)}\n")
186         f.write("=" * 80 + "\n\n")
187 
188         for entry in transcript_buffer:
189             f.write(f"[{entry['timestamp']}] {entry['text']}\n")
190             f.write(f"Confidence: {entry['confidence']:.2%}\n\n")
191 
192     print(f"Transcript saved to: {filename}")
193 
194 
195 def run():
196     """Main function to run the streaming transcription"""
197     global audio, stream, ws_app
198 
199     # Initialize PyAudio
200     audio = pyaudio.PyAudio()
201 
202     # Open microphone stream
203     try:
204         stream = audio.open(
205             input=True,
206             frames_per_buffer=FRAMES_PER_BUFFER,
207             channels=CHANNELS,
208             format=FORMAT,
209             rate=SAMPLE_RATE,
210         )
211         print("Microphone stream opened successfully.")
212     except Exception as e:
213         print(f"Error opening microphone stream: {e}")
214         if audio:
215             audio.terminate()
216         return
217 
218     # Create WebSocketApp
219     ws_app = websocket.WebSocketApp(
220         API_ENDPOINT,
221         header={"Authorization": YOUR_API_KEY},
222         on_open=on_open,
223         on_message=on_message,
224         on_error=on_error,
225         on_close=on_close,
226     )
227 
228     # Run WebSocketApp in a separate thread
229     ws_thread = threading.Thread(target=ws_app.run_forever)
230     ws_thread.daemon = True
231     ws_thread.start()
232 
233     try:
234         # Keep main thread alive until interrupted
235         while ws_thread.is_alive():
236             time.sleep(0.1)
237     except KeyboardInterrupt:
238         print("\n\nCtrl+C received. Stopping transcription...")
239         stop_event.set()
240 
241         # Send termination message to the server
242         if ws_app and ws_app.sock and ws_app.sock.connected:
243             try:
244                 terminate_message = {"type": "Terminate"}
245                 ws_app.send(json.dumps(terminate_message))
246                 time.sleep(1)
247             except Exception as e:
248                 print(f"Error sending termination message: {e}")
249 
250         if ws_app:
251             ws_app.close()
252 
253         ws_thread.join(timeout=2.0)
254 
255     finally:
256         # Final cleanup
257         if stream and stream.is_active():
258             stream.stop_stream()
259         if stream:
260             stream.close()
261         if audio:
262             audio.terminate()
263         print("Cleanup complete. Exiting.")
264 
265 
266 if __name__ == "__main__":
267     run()

These settings wait longer before ending turns to accommodate natural conversation pauses and ensure readable formatted text for display. You can tweak these settings to get the best results for your notetaker.

How Do I Handle Multichannel Meeting Audio?

Many meeting platforms (Zoom, Teams, Google Meet) can record each participant on separate audio channels. This dramatically improves speaker identification accuracy.

For Pre-recorded Meetings

1 config = aai.TranscriptionConfig(
2     multichannel=True,  # Enable when each speaker is on different channel
3     speaker_labels=False,  # Disable - channels already separate speakers
4     # Other settings...
5 )
6 
7 transcriber = aai.Transcriber()
8 transcript = transcriber.transcribe(audio_file, config=config)
9 
10 # Access per-channel transcripts
11 for channel, channel_transcript in enumerate(transcript.channels):
12     print(f"\n=== Channel {channel} ===")
13     print(channel_transcript.text)

When to use multichannel:

Zoom local recordings with “Record separate audio file for each participant” enabled
Professional podcast recordings with individual microphones
Conference systems with dedicated channels per participant
Phone calls with caller and callee on separate channels

Benefits:

Perfect speaker separation - No diarization errors
No speaker confusion or overlap issues
Faster processing time - Diarization not needed
Higher accuracy - Model processes clean single-speaker audio

How to enable in meeting platforms:

Zoom: Settings → Recording → Advanced → “Record a separate audio file for each participant”
Teams: Requires third-party recording solutions like Recall.ai
Google Meet: Requires third-party recording solutions like Recall.ai

For Streaming Meetings

For real-time multichannel audio, create separate streaming sessions per channel:

1 import asyncio
2 import websockets
3 
4 class ChannelTranscriber:
5     def __init__(self, channel_id: int, speaker_name: str):
6         self.channel_id = channel_id
7         self.speaker_name = speaker_name
8         self.connection_params = {
9             "sample_rate": 16000,
10             "format_turns": True,
11         }
12 
13     async def transcribe_channel(self, audio_stream):
14         """Transcribe a single audio channel"""
15         url = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(self.connection_params)}"
16 
17         async with websockets.connect(url, extra_headers={"Authorization": API_KEY}) as ws:
18             # Send audio from this channel only
19             async for audio_chunk in audio_stream:
20                 await ws.send(audio_chunk)
21 
22             # Receive transcripts
23             async for message in ws:
24                 data = json.loads(message)
25                 if data.get("type") == "Turn" and data.get("turn_is_formatted"):
26                     print(f"{self.speaker_name}: {data['transcript']}")
27 
28 # Create transcriber for each channel
29 async def transcribe_multichannel_meeting(channel_audio_streams):
30     transcribers = [
31         ChannelTranscriber(0, "Alice"),
32         ChannelTranscriber(1, "Bob"),
33     ]
34 
35     # Run all channels concurrently
36     await asyncio.gather(*[
37         t.transcribe_channel(stream)
38         for t, stream in zip(transcribers, channel_audio_streams)
39     ])

See our multichannel streaming guide for complete implementation details.

How Should I Handle Pre-recorded Transcription in Production?

Choose the right approach based on your application’s needs:

Option 1: Async/Await (Simple Cases)

1 # Simple blocking call
2 transcript = await asyncio.to_thread(transcriber.transcribe, audio_url, config=config)

Pros:

Simple, straightforward code
Good for low volume applications
Easy to understand and debug

Cons:

Ties up resources while waiting
Not suitable for high volume
Cannot process multiple files simultaneously

Best for: Personal projects, prototypes, low-traffic applications

Option 2: Webhook Callbacks (Production Recommended)

1 config = aai.TranscriptionConfig(
2     webhook_url="https://your-app.com/webhooks/assemblyai",
3     webhook_auth_header_name="X-Webhook-Secret",
4     webhook_auth_header_value="your_secret_here",
5     speaker_labels=True,
6     summarization=True,
7     # ... other config
8 )
9 
10 # Submit job and return immediately (non-blocking)
11 transcript = transcriber.submit(audio_url, config=config)
12 print(f"Job submitted: {transcript.id}")
13 # Your app can continue processing other requests
14 
15 # Your webhook receives results when ready (typically 15-30% of audio duration)

Webhook handler example:

1 from flask import Flask, request, jsonify
2 
3 app = Flask(__name__)
4 
5 @app.route("/webhooks/assemblyai", methods=["POST"])
6 def assemblyai_webhook():
7     # Verify webhook authenticity
8     if request.headers.get("X-Webhook-Secret") != "your_secret_here":
9         return jsonify({"error": "Unauthorized"}), 401
10 
11     data = request.json
12     transcript_id = data["transcript_id"]
13     status = data["status"]
14 
15     if status == "completed":
16         # Process completed transcript
17         process_completed_meeting(data)
18     elif status == "error":
19         # Handle error
20         log_transcription_error(transcript_id, data["error"])
21 
22     return jsonify({"received": True}), 200
23 
24 def process_completed_meeting(transcript_data):
25     """Process completed meeting transcript"""
26     # Extract data
27     utterances = transcript_data["utterances"]
28     summary = transcript_data["summary"]
29 
30     # Store in database
31     save_to_database(transcript_data)
32 
33     # Notify user
34     send_notification(transcript_data["id"])

Pros:

Non-blocking - submit and forget
Scales to high volume
Process multiple files in parallel
Automatic retry on failures
Get notified when complete

Best for: Production apps, user-uploaded recordings, batch processing, SaaS products

Option 3: Polling (Custom Workflows)

1 # Submit job
2 transcript = transcriber.submit(audio_url, config=config)
3 print(f"Submitted: {transcript.id}")
4 
5 # Poll for completion with progress tracking
6 while transcript.status not in [aai.TranscriptStatus.completed, aai.TranscriptStatus.error]:
7     await asyncio.sleep(5)
8     transcript = transcriber.get_transcript(transcript.id)
9 
10     # Optional: Show progress
11     print(f"Status: {transcript.status}...")
12 
13 if transcript.status == aai.TranscriptStatus.completed:
14     process_transcript(transcript)
15 else:
16     print(f"Error: {transcript.error}")

Pros:

Full control over retry logic
Can show progress to users
Good for background jobs
Works without webhook infrastructure

Cons:

Must implement your own polling loop
Ties up resources while polling
More complex than webhooks

Best for: Background job processors, CLIs with progress bars, custom retry logic

Comparison Table

Method	Blocking	Scalability	Complexity	Best For
Async/Await	Yes	Low	Low	Prototypes, low volume
Webhooks	No	High	Medium	Production, high volume
Polling	Partial	Medium	Medium	Background jobs, progress UI

How Do I Identify Speakers in My Recording?

Speaker diarization tells you when speakers change (“Speaker A”, “Speaker B”), but Speaker Identification tells you who they are by name or role.

Why Use Speaker Identification?

Instead of:

Speaker A: Let's review the Q3 numbers.
Speaker B: Revenue was up 15% this quarter.
Speaker A: Excellent work on that launch.

You get:

Sarah Chen: Let's review the Q3 numbers.
Michael Rodriguez: Revenue was up 15% this quarter.
Sarah Chen: Excellent work on that launch.

How It Works

Speaker Identification uses AssemblyAI’s Speech Understanding API to map generic speaker labels to actual names or roles that you provide:

1 import assemblyai as aai
2 
3 aai.settings.api_key = "your_api_key"
4 
5 # Step 1: Transcribe with speaker diarization
6 config = aai.TranscriptionConfig(
7     speaker_labels=True,  # Must enable speaker diarization first
8     speech_understanding={
9         "request": {
10             "speaker_identification": {
11                 "speaker_type": "name",  # or "role"
12                 "known_values": ["Sarah Chen", "Michael Rodriguez", "Alex Kim"]
13             }
14         }
15     }
16 )
17 
18 transcriber = aai.Transcriber()
19 transcript = transcriber.transcribe("meeting_recording.mp3", config=config)
20 
21 # Access results with identified speakers
22 for utterance in transcript.utterances:
23     print(f"{utterance.speaker}: {utterance.text}")

Identifying by Role Instead of Name

For customer service, sales calls, or scenarios where you don’t know names:

1 config = aai.TranscriptionConfig(
2     speaker_labels=True,
3     speech_understanding={
4         "request": {
5             "speaker_identification": {
6                 "speaker_type": "role",
7                 "known_values": ["Agent", "Customer"]  # or ["Interviewer", "Interviewee"]
8             }
9         }
10     }
11 )

Common role combinations:

["Agent", "Customer"] - Customer service calls
["Support", "Customer"] - Technical support
["Interviewer", "Interviewee"] - Interviews
["Host", "Guest"] - Podcasts
["Doctor", "Patient"] - Medical consultations (with HIPAA compliance)

How to Get Speaker Names

For platform recordings:

Zoom: Extract participant names from Zoom API or meeting JSON
Teams: Get attendees from Microsoft Graph API
Google Meet: Use Google Calendar API to get participants

Example with Zoom:

1 # Get participant names from Zoom meeting
2 zoom_participants = get_zoom_meeting_participants(meeting_id)
3 speaker_names = [p["name"] for p in zoom_participants]
4 
5 # Use in speaker identification
6 config = aai.TranscriptionConfig(
7     speaker_labels=True,
8     speakers_expected=len(speaker_names),  # Hint: exact number of speakers
9     speech_understanding={
10         "request": {
11             "speaker_identification": {
12                 "speaker_type": "name",
13                 "known_values": speaker_names
14             }
15         }
16     }
17 )

How Speaker Identification Works

Speaker Identification Requirements:

Speaker diarization must be enabled - Cannot identify speakers without diarization first
Requires sufficient audio per speaker - Each speaker needs enough speech for accurate matching
Works best with distinct voices - Similar voices may be confused
US English only - Currently only supports US region audio
Post-processing step - Adds additional processing time after transcription

Accuracy depends on:

Audio quality (clear, minimal background noise)
Voice distinctiveness (different genders, accents, tones)
Amount of speech per speaker (more = better)
Number of speakers (fewer = more accurate)

Alternative: Add Identification Later

You can also add speaker identification to an existing transcript:

1 # First, transcribe with speaker diarization
2 transcript = transcriber.transcribe(audio_url, config=aai.TranscriptionConfig(speaker_labels=True))
3 
4 # Later, add speaker identification
5 transcript_dict = transcript.json_response
6 
7 # Add speaker identification
8 transcript_dict["speech_understanding"] = {
9     "request": {
10         "speaker_identification": {
11             "speaker_type": "name",
12             "known_values": ["Sarah Chen", "Michael Rodriguez"]
13         }
14     }
15 }
16 
17 # Send to Speech Understanding API
18 import requests
19 result = requests.post(
20     "https://llm-gateway.assemblyai.com/v1/understanding",
21     headers={"Authorization": aai.settings.api_key},
22     json=transcript_dict
23 ).json()
24 
25 # Access identified speakers
26 for utterance in result["utterances"]:
27     print(f"{utterance['speaker']}: {utterance['text']}")

This approach is useful when:

You get speaker names after the transcription completes
You want to try different name mappings
Building iterative workflows where users confirm speaker identities

For complete API details, see our Speaker Identification documentation.

How Do I Translate Between Languages in Meetings?

AssemblyAI supports translation between 99+ languages, enabling you to transcribe meetings in one language and translate to another.

When to Use Translation

Common use cases:

Transcribe Spanish meeting → Translate to English for documentation
Transcribe multilingual meeting → Translate all to common language
Create translated meeting notes for international teams
Provide translated summaries for stakeholders

Basic Translation

1 import assemblyai as aai
2 
3 aai.settings.api_key = "your_api_key"
4 
5 config = aai.TranscriptionConfig(
6     language_code="es",  # Spanish audio
7     translation_language_code="en",  # Translate to English
8     speaker_labels=True,  # Maintain speaker labels
9 )
10 
11 transcriber = aai.Transcriber()
12 transcript = transcriber.transcribe("spanish_meeting.mp3", config=config)
13 
14 # Access original Spanish transcript
15 print("Original (Spanish):")
16 print(transcript.text)
17 
18 # Access English translation
19 print("\nTranslation (English):")
20 print(transcript.translation)

Translation with Code Switching

For meetings where participants switch between languages:

1 config = aai.TranscriptionConfig(
2     language_detection=True,  # Auto-detect languages
3     language_detection_options=aai.LanguageDetectionOptions(
4         code_switching=True,  # Preserve language switches
5         code_switching_confidence_threshold=0.5
6     ),
7     translation_language_code="en",  # Translate everything to English
8 )
9 
10 transcript = transcriber.transcribe("multilingual_meeting.mp3", config=config)
11 
12 # Original transcript preserves language switches
13 for utterance in transcript.utterances:
14     detected_lang = utterance.language_code if hasattr(utterance, 'language_code') else "unknown"
15     print(f"[{detected_lang}] {utterance.speaker}: {utterance.text}")
16 
17 # Translation combines everything in English
18 print(f"\nEnglish translation:\n{transcript.translation}")

Supported Language Pairs

AssemblyAI supports translation between 99+ languages, including:

Popular combinations:

Spanish ↔ English
French ↔ English
German ↔ English
Mandarin ↔ English
Japanese ↔ English
Portuguese ↔ English
And all combinations between supported languages

Translation Response Format

1 {
2     "text": "Original transcript in source language",
3     "translation": "Translated transcript in target language",
4     "language_code": "es",  # Detected/specified source language
5     "utterances": [
6         {
7             "speaker": "A",
8             "text": "Hola, ¿cómo estás?",  # Original
9             "translation": "Hello, how are you?",  # Translated
10             "start": 0,
11             "end": 1500
12         }
13     ]
14 }

Real-World Example: International Team Meeting

1 async def process_international_meeting(audio_url: str, target_language: str = "en"):
2     """
3     Process multilingual meeting with translation and identification
4     """
5     config = aai.TranscriptionConfig(
6         # Handle multiple languages
7         language_detection=True,
8         language_detection_options=aai.LanguageDetectionOptions(
9             code_switching=True,
10             code_switching_confidence_threshold=0.5
11         ),
12 
13         # Translate to common language
14         translation_language_code=target_language,
15 
16         # Speaker identification
17         speaker_labels=True,
18 
19         # Generate summary in target language
20         summarization=True,
21     )
22 
23     transcriber = aai.Transcriber()
24     transcript = await asyncio.to_thread(
25         transcriber.transcribe,
26         audio_url,
27         config=config
28     )
29 
30     return {
31         "original_transcript": transcript.text,
32         "translated_transcript": transcript.translation,
33         "summary": transcript.summary,  # Already in target language
34         "speakers": list(set(u.speaker for u in transcript.utterances)),
35         "languages_detected": list(set(
36             u.language_code for u in transcript.utterances
37             if hasattr(u, 'language_code')
38         ))
39     }

For complete language support and translation details, see our Translation documentation.

What Workflows Can I Build for My AI Meeting Notetaker?

Use these Speech Understanding and Guardrails features to transform raw transcripts into actionable insights.

Summarization

summarization: true

What it does: Generates an abstractive recap of the conversation (not verbatim).
Output: summary string (bullets/paragraph format).
Great for: Meeting notes, call recaps, executive summaries.
Notes: Condenses and rephrases; minor details may be omitted by design.

Example:

1 config = aai.TranscriptionConfig(
2     summarization=True,
3     summary_type="bullets",  # or "paragraph"
4     summary_model="informative",  # or "conversational"
5 )

Sentiment Analysis

sentiment_analysis: true

What it does: Scores per-utterance sentiment (positive / neutral / negative).
Output: Array of { text, sentiment, confidence, start, end }.
Great for: Customer satisfaction tracking, coaching, churn prediction.
Notes: Segment-level (not global mood); sarcasm and very short utterances are harder to classify.

Example:

1 for utterance in transcript.sentiment_analysis_results:
2     if utterance.sentiment == "NEGATIVE":
3         print(f"Negative sentiment detected: {utterance.text}")

Entity Detection

entity_detection: true

What it does: Extracts named entities (people, organizations, locations, products, etc.).
Output: Array of { entity_type, text, start, end, confidence }.
Great for: Auto-tagging topics, tracking competitors mentioned, CRM enrichment.
Notes: Operates on post-redaction text if PII redaction is enabled.

Example:

1 # Extract all organizations mentioned
2 organizations = [
3     entity.text for entity in transcript.entities
4     if entity.entity_type == "organization"
5 ]
6 print(f"Companies mentioned: {', '.join(organizations)}")

Redact PII Text

redact_pii: true

What it does: Scans transcript for personally identifiable information and replaces matches per policy.
Output: text with replacements; original words timing preserved.
Great for: GDPR/CCPA compliance, safe sharing, SOC2 requirements.
Notes: Runs before downstream features; they see the redacted text.

Recommended policies for meetings:

1 config = aai.TranscriptionConfig(
2     redact_pii=True,
3     redact_pii_policies=[
4         PIIRedactionPolicy.person_name,      # Remove names
5         PIIRedactionPolicy.email_address,    # Remove emails
6         PIIRedactionPolicy.phone_number,     # Remove phone numbers
7         PIIRedactionPolicy.organization,     # Remove company names
8     ],
9     redact_pii_sub=PIISubstitutionPolicy.hash,  # Stable hash tokens
10 )

Why hash substitution?

Stable across the file (same value → same token)
Maintains sentence structure for LLM processing
Prevents reconstruction of original data

Redact PII Audio

redact_pii_audio: true

What it does: Produces a second audio file where redacted portions are bleeped/silenced.
Output: redacted_audio_url in the transcript response.
Great for: External sharing, training materials, demos.
Notes: Original audio is untouched; bleeped sections may sound choppy.

Complete Example

1 config = aai.TranscriptionConfig(
2     # Core transcription
3     speaker_labels=True,
4 
5     # Speech Understanding
6     summarization=True,
7     sentiment_analysis=True,
8     entity_detection=True,
9 
10     # PII protection
11     redact_pii=True,
12     redact_pii_policies=[
13         PIIRedactionPolicy.person_name,
14         PIIRedactionPolicy.email_address,
15         PIIRedactionPolicy.phone_number,
16     ],
17     redact_pii_sub=PIISubstitutionPolicy.hash,
18     redact_pii_audio=True,
19 )
20 
21 transcript = transcriber.transcribe(audio_url, config=config)
22 
23 # Access all features
24 meeting_insights = {
25     "summary": transcript.summary,
26     "sentiment_trend": analyze_sentiment_trend(transcript.sentiment_analysis_results),
27     "entities": extract_entities(transcript.entities),
28     "safe_transcript": transcript.text,  # PII redacted
29     "safe_audio": transcript.redacted_audio_url,  # PII bleeped
30 }

How Do I Improve the Accuracy of My Notetaker?

Best practices:

Include participant names for better speaker recognition
Add company-specific jargon and acronyms
Include product names and technical terms
Keep individual terms under 50 characters
Maximum 100 terms per request

Using Keyterms Prompt for Async Transcription

Keyterms prompting improves recognition accuracy for domain-specific vocabulary by up to 21%:

1 # Define domain-specific vocabulary
2 company_terms = [
3     "AssemblyAI",
4     "Universal-Streaming",
5     "Speech Understanding",
6     "diarization"
7 ]
8 
9 participant_names = [
10     "Dylan Fox",
11     "Sarah Chen",
12     "Michael Rodriguez"
13 ]
14 
15 technical_terms = [
16     "API endpoint",
17     "WebSocket",
18     "latency metrics",
19     "TTFT"
20 ]
21 
22 # Configure with keyterms prompt
23 config = aai.TranscriptionConfig(
24     keyterms_prompt=company_terms + participant_names + technical_terms,
25     speaker_labels=True,
26     # ... other settings
27 )

Using Keyterms Prompt for Streaming

1 # Streaming with contextual keyterms
2 keyterms = [
3     # Participant names
4     "Alice Johnson",
5     "Bob Smith",
6 
7     # Meeting-specific vocabulary
8     "Q4 objectives",
9     "revenue targets",
10     "customer acquisition",
11 
12     # Technical terms
13     "API integration",
14     "cloud migration"
15 ]
16 
17 CONNECTION_PARAMS = {
18     "sample_rate": 16000,
19     "format_turns": True,
20     "keyterms_prompt": json.dumps(keyterms),  # JSON-encode for URL params
21 }

How Do I Process the Response from the API?

Processing Async Responses

1 def process_async_transcript(transcript):
2     """
3     Extract and process all relevant data from async transcript
4     """
5     # Basic transcript data
6     meeting_data = {
7         "id": transcript.id,
8         "duration": transcript.audio_duration,
9         "confidence": transcript.confidence,
10         "full_text": transcript.text
11     }
12 
13     # Process speaker utterances
14     speakers = {}
15     for utterance in transcript.utterances:
16         speaker = utterance.speaker
17 
18         if speaker not in speakers:
19             speakers[speaker] = {
20                 "utterances": [],
21                 "total_speaking_time": 0,
22                 "word_count": 0
23             }
24 
25         speakers[speaker]["utterances"].append({
26             "text": utterance.text,
27             "start": utterance.start,
28             "end": utterance.end,
29             "confidence": utterance.confidence
30         })
31 
32         # Calculate speaking time
33         speakers[speaker]["total_speaking_time"] += (utterance.end - utterance.start) / 1000
34         speakers[speaker]["word_count"] += len(utterance.text.split())
35 
36     meeting_data["speakers"] = speakers
37 
38     # Extract summary
39     if transcript.summary:
40         meeting_data["summary"] = transcript.summary
41 
42     # Calculate meeting statistics
43     total_duration = transcript.audio_duration / 1000  # Convert to seconds
44     meeting_data["statistics"] = {
45         "total_speakers": len(speakers),
46         "total_words": sum(s["word_count"] for s in speakers.values()),
47         "average_confidence": transcript.confidence,
48         "speaking_distribution": {
49             speaker: {
50                 "percentage": (data["total_speaking_time"] / total_duration) * 100,
51                 "minutes": data["total_speaking_time"] / 60
52             }
53             for speaker, data in speakers.items()
54         }
55     }
56 
57     return meeting_data
58 
59 # Example usage
60 result = process_async_transcript(transcript)
61 print(f"Meeting had {result['statistics']['total_speakers']} speakers")
62 print(f"Speaker A spoke for {result['statistics']['speaking_distribution']['A']['minutes']:.1f} minutes")

Processing Streaming Responses

1 class StreamingResponseProcessor:
2     def __init__(self):
3         self.partial_buffer = ""
4         self.final_transcripts = []
5         self.turn_metadata = []
6 
7     def process_message(self, message: dict):
8         """
9         Process real-time streaming messages
10         """
11         msg_type = message.get("type")
12 
13         if msg_type == "Begin":
14             return {
15                 "event": "session_started",
16                 "session_id": message.get("id"),
17                 "expires_at": message.get("expires_at")
18             }
19 
20         elif msg_type == "Turn":
21             return self.process_turn(message)
22 
23         elif msg_type == "Termination":
24             return {
25                 "event": "session_ended",
26                 "audio_duration": message.get("audio_duration_seconds"),
27                 "session_duration": message.get("session_duration_seconds")
28             }
29 
30     def process_turn(self, data: dict):
31         """Process turn messages"""
32         is_final = data.get("end_of_turn")
33         is_formatted = data.get("turn_is_formatted")
34         transcript = data.get("transcript", "")
35         turn_order = data.get("turn_order")
36 
37         response = {
38             "turn_order": turn_order,
39             "is_final": is_final,
40             "is_formatted": is_formatted,
41             "confidence": data.get("end_of_turn_confidence", 0)
42         }
43 
44         # Handle partials (for live display)
45         if not is_final and transcript:
46             self.partial_buffer = transcript
47             response["event"] = "partial"
48             response["text"] = transcript
49 
50         # Handle finals (for storage)
51         elif is_final and is_formatted:
52             final_transcript = {
53                 "turn_order": turn_order,
54                 "text": transcript,
55                 "confidence": data.get("end_of_turn_confidence"),
56                 "timestamp": datetime.now().isoformat()
57             }
58             self.final_transcripts.append(final_transcript)
59             response["event"] = "final"
60             response["text"] = transcript
61 
62             # Clear partial buffer
63             self.partial_buffer = ""
64 
65         return response
66 
67     def get_full_transcript(self):
68         """
69         Combine all final transcripts into complete meeting transcript
70         """
71         return {
72             "full_text": " ".join(t["text"] for t in self.final_transcripts),
73             "transcripts": self.final_transcripts,
74             "total_turns": len(self.final_transcripts)
75         }
76 
77 # Example usage
78 processor = StreamingResponseProcessor()
79 
80 async with websockets.connect(API_ENDPOINT, extra_headers=headers) as ws:
81     async for message in ws:
82         data = json.loads(message)
83         result = processor.process_message(data)
84 
85         if result["event"] == "partial":
86             # Update UI with live transcript
87             update_live_caption(result["text"])
88 
89         elif result["event"] == "final":
90             # Save final transcript
91             save_transcript_segment(result)
92 
93 # Get complete transcript when done
94 full_transcript = processor.get_full_transcript()

Introduction

Why AssemblyAI for Meeting Notetakers?

Industry-Leading Accuracy with Pre-recorded Audio

Real-Time Streaming Advantages

End-to-End Voice AI Platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Pre-recorded (Async) Speech-to-text

Streaming (Real-time) Speech-to-text

Hybrid Approach (Recommended)

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

Real-Time Streaming

Coming Soon (Public Roadmap)

How Can I Get Started Building a Post-Call Meeting Notetaker?

How Can I Get Started Building a During-Call Live Meeting Notetaker?

How Do I Handle Multichannel Meeting Audio?

For Pre-recorded Meetings

For Streaming Meetings

How Should I Handle Pre-recorded Transcription in Production?

Option 1: Async/Await (Simple Cases)

Option 2: Webhook Callbacks (Production Recommended)

Option 3: Polling (Custom Workflows)

Comparison Table

How Do I Identify Speakers in My Recording?

Why Use Speaker Identification?

How It Works

Identifying by Role Instead of Name

How to Get Speaker Names

How Speaker Identification Works

Alternative: Add Identification Later

How Do I Translate Between Languages in Meetings?

When to Use Translation

Basic Translation

Translation with Code Switching

Supported Language Pairs

Translation Response Format

Real-World Example: International Team Meeting

What Workflows Can I Build for My AI Meeting Notetaker?

Summarization

Sentiment Analysis

Entity Detection

Redact PII Text

Redact PII Audio

Complete Example

How Do I Improve the Accuracy of My Notetaker?

Using Keyterms Prompt for Async Transcription

Using Keyterms Prompt for Streaming

How Do I Process the Response from the API?

Processing Async Responses

Processing Streaming Responses

Additional Resources