Best Practices for building Meeting Notetakers

Introduction

Building a robust meeting notetaker requires careful consideration of accuracy, latency, speaker identification, and real-time capabilities. This guide addresses common questions and provides practical solutions for both post-call and live meeting transcription scenarios.

Why AssemblyAI for Meeting Notetakers?

AssemblyAI stands out as the premier choice for meeting notetakers with several key advantages:

Industry-Leading Accuracy with Pre-recorded Audio

  • 93.3%+ transcription accuracy ensures reliable meeting documentation
  • 2.9% speaker diarization error rate for precise “who said what” attribution
  • Speech Understanding integration for intelligent post-processing and insights
  • Keyterms prompt allows providing meeting context to improve accuracy of transcription

Real-Time Streaming Advantages

As meeting notetakers evolve toward real-time capabilities, AssemblyAI’s Universal-Streaming model offers significant benefits:

  • Ultra-low latency (~300ms) enables live transcription without delays
  • Format turns feature provides structured, readable output in real-time
  • Keyterms prompt allows providing meeting context to improve accuracy of transcription

End-to-End Voice AI Platform

Unlike fragmented solutions, AssemblyAI provides a unified API for:

  • Transcription with speaker diarization
  • Automatic language detection and code switching
  • Boosting accuracy via meeting context with keyterms prompt
  • Speech Understanding tasks like speaker identification, translation, and transcript styling
  • Post-processing workflows with custom prompting - from summarization to completely custom workflows
  • Real-time and batch processingof pre-recorded audio in a single platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Understanding when to use pre-recorded versus streaming speech-to-text is critical for building the right meeting notetaker.

Pre-recorded (Async) Speech-to-text

Post-call analysis - Meeting already happened, you have the full recording

  • Highest accuracy needed - Pre-recorded models have higher accuracy (93.3%+)
  • Speaker diarization is critical - Async has 2.9% speaker error rate
  • Broad language support - Need any of 99+ languages
  • Advanced features required - Summarization, sentiment analysis, entity detection, PII redaction, speaker identification
  • Batch processing - Processing multiple recordings at once
  • Quality over speed - Can wait seconds/minutes for perfect results

Best for: Zoom/Teams/Meet recording uploads, compliance, documentation, post-call summaries, searchable archives

Streaming (Real-time) Speech-to-text

Live meetings - Transcribing as the meeting happens

  • Real-time captions - Displaying subtitles/captions to participants during calls
  • Immediate feedback - Need transcription within ~300ms
  • Interactive features - Live note-taking, real-time keyword detection, action item alerts
  • No recording available - Processing live audio only

Best for: Live captions, real-time note-taking apps, accessibility features, live keyword alerts

Many successful meeting notetakers use both pre-recorded and streaming speech-to-text:

  1. Streaming during the call - Provide live captions and real-time notes to participants
  2. Async after the call - Generate high-quality transcript with speaker labels, summary, and insights

This gives users immediate value during meetings while providing comprehensive documentation afterward.

Example workflow:

  • User joins meeting → Start streaming for live captions
  • Meeting ends → Upload recording to pre-recorded API for final transcript with speaker names
  • Generate meeting summary, action items, and searchable archive from pre-recorded transcript

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

For post-call analysis, AssemblyAI supports:

Languages:

  • 99 languages supported
  • Automatic Language Detection to route to the most spoken language
  • Code Switching to preserve changes in speech between languages

Core Features:

  • Speaker diarization (1-10 speakers by default, expandable to any min/max)
  • Multichannel audio support (each channel = one speaker)
  • Automatic formatting, punctuation, and capitalization
  • Keyterms prompting for boosting domain-specific terms

Speech Understanding Models:

  • Summarization for meeting recaps
  • Sentiment analysis for meeting tone assessment
  • Entity detection for extracting key information
  • Speaker identification to map generic labels to actual names/roles
  • Translation between 99+ languages

Real-Time Streaming

For live meeting transcription:

Languages:

  • English-only model (default)
  • Multilingual model supporting English, Spanish, French, German, Portuguese, and Italian

Streaming-Specific Features:

  • Partial and final transcripts for responsive UI
  • Format turns for structured, readable output
  • Keyterms prompt for contextual accuracy
  • End-of-utterance detection for natural speech boundaries

Coming Soon (Public Roadmap)

  • Enhanced accuracy for English, Spanish, French, German, Portuguese, and Italian post-call transcription
  • Domain specific models out of the box (i.e. Medical)
  • Speaker Diarization improvements (especially on shorter files)
  • Speaker diarization for streaming

How Can I Get Started Building a Post-Call Meeting Notetaker?

Here’s a complete example implementing async transcription with all essential features:

1import assemblyai as aai
2import asyncio
3from typing import Dict, List
4from assemblyai.types import (
5 SpeakerOptions,
6 LanguageDetectionOptions,
7 PIIRedactionPolicy,
8 PIISubstitutionPolicy,
9)
10
11# Configure API key
12aai.settings.api_key = "your_api_key_here"
13
14async def transcribe_meeting_async(audio_source: str) -> Dict:
15 """
16 Asynchronously transcribe a meeting recording with full features
17
18 Args:
19 audio_source: Either a local file path or publicly accessible URL
20 """
21 # Configure comprehensive meeting analysis
22 config = aai.TranscriptionConfig(
23 # Speaker diarization
24 speaker_labels=True,
25 speakers_expected=None, # Use if you know exact number from Zoom/Meet/Teams
26 speaker_options=SpeakerOptions(
27 min_speakers_expected=2,
28 max_speakers_expected=20
29 ), # Use if you know the min/max range
30 multichannel=False, # Set to True if audio has separate channel per speaker
31
32 # Language detection
33 language_detection=True, # Auto-detect the most used language
34 language_detection_options=LanguageDetectionOptions(
35 code_switching=True, # Preserve language switches
36 code_switching_confidence_threshold=0.5,
37 ),
38
39 # Punctuation and formatting
40 punctuate=True,
41 format_text=True,
42
43 # Boost accuracy of meeting-specific vocabulary
44 keyterms_prompt=["quarterly", "KPI", "roadmap", "deliverables"],
45
46 # Speech Understanding - commonly used models
47 summarization=True,
48 sentiment_analysis=True,
49 entity_detection=True,
50 redact_pii=True,
51 redact_pii_policies=[
52 PIIRedactionPolicy.person_name,
53 PIIRedactionPolicy.organization,
54 PIIRedactionPolicy.occupation,
55 ],
56 redact_pii_sub=PIISubstitutionPolicy.hash,
57 redact_pii_audio=True
58 )
59
60 # Create transcriber
61 transcriber = aai.Transcriber()
62
63 try:
64 # Submit transcription job
65 transcript = await asyncio.to_thread(
66 transcriber.transcribe,
67 audio_source,
68 config=config
69 )
70
71 # Check status
72 if transcript.status == aai.TranscriptStatus.error:
73 raise Exception(f"Transcription failed: {transcript.error}")
74
75 # Process speaker-labeled utterances
76 print("\n=== SPEAKER-LABELED TRANSCRIPT ===\n")
77
78 for utterance in transcript.utterances:
79 # Format timestamp
80 start_time = utterance.start / 1000 # Convert to seconds
81 end_time = utterance.end / 1000
82
83 # Print formatted utterance
84 print(f"[{start_time:.1f}s - {end_time:.1f}s] Speaker {utterance.speaker}:")
85 print(f" {utterance.text}")
86 print(f" Confidence: {utterance.confidence:.2%}\n")
87
88 # Print summary data
89 print("\n=== MEETING SUMMARY ===\n")
90 print({
91 "id": transcript.id,
92 "status": transcript.status,
93 "duration": transcript.audio_duration,
94 "speaker_count": len(set(u.speaker for u in transcript.utterances)),
95 "word_count": len(transcript.words) if transcript.words else 0,
96 "detected_language": transcript.language_code if hasattr(transcript, 'language_code') else None,
97 "summary": transcript.summary,
98 })
99
100 return {
101 "transcript": transcript,
102 "utterances": transcript.utterances,
103 "summary": transcript.summary,
104 }
105
106 except Exception as e:
107 print(f"Error during transcription: {e}")
108 raise
109
110async def main():
111 """
112 Example usage with error handling
113 """
114 # Use either local file OR URL (not both)
115 audio_source = "https://assembly.ai/wildfires.mp3" # Or "path/to/recording.mp3"
116
117 try:
118 result = await transcribe_meeting_async(audio_source)
119
120 # Additional processing
121 print(f"\nTotal speakers identified: {len(set(u.speaker for u in result['utterances']))}")
122 print(f"Meeting duration: {result['transcript'].audio_duration} seconds")
123
124 except Exception as e:
125 print(f"Failed to process meeting: {e}")
126
127if __name__ == "__main__":
128 asyncio.run(main())

How Can I Get Started Building a During-Call Live Meeting Notetaker?

Here’s a complete example for real-time streaming transcription with meeting-optimized settings:

1# pip install pyaudio websocket-client
2import pyaudio
3import websocket
4import json
5import threading
6import time
7from urllib.parse import urlencode
8from datetime import datetime
9
10# --- Configuration ---
11YOUR_API_KEY = "your_api_key"
12
13# Keyterms to improve recognition accuracy
14KEYTERMS = [
15 "Alice Johnson",
16 "Bob Smith",
17 "Carol Davis",
18 "quarterly review",
19 "action items",
20 "follow up",
21 "deadline",
22 "budget"
23]
24
25# MEETING NOTETAKER CONFIGURATION (different from voice agents!)
26CONNECTION_PARAMS = {
27 "sample_rate": 16000,
28 "format_turns": True, # ALWAYS TRUE for meetings - users need readable text
29
30 # Meeting-optimized turn detection (wait longer than voice agents)
31 "end_of_turn_confidence_threshold": 0.6, # Higher than voice agents (0.4)
32 "min_end_of_turn_silence_when_confident": 560, # Wait longer for natural pauses (voice agents use 160ms)
33 "max_turn_silence": 2000, # Allow thinking pauses (voice agents use 1280ms)
34
35 # Keyterms for accuracy
36 "keyterms_prompt": json.dumps(KEYTERMS)
37}
38
39API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
40API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
41
42# Audio Configuration
43FRAMES_PER_BUFFER = 800 # 50ms of audio
44SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
45CHANNELS = 1
46FORMAT = pyaudio.paInt16
47
48# Global variables
49audio = None
50stream = None
51ws_app = None
52audio_thread = None
53stop_event = threading.Event()
54transcript_buffer = []
55
56
57def on_open(ws):
58 """Called when the WebSocket connection is established."""
59 print("=" * 80)
60 print(f"[{datetime.now().strftime('%H:%M:%S')}] Meeting transcription started")
61 print(f"Connected to: {API_ENDPOINT_BASE_URL}")
62 print(f"Keyterms configured: {', '.join(KEYTERMS)}")
63 print("=" * 80)
64 print("\nSpeak into your microphone. Press Ctrl+C to stop.\n")
65
66 def stream_audio():
67 """Stream audio from microphone to WebSocket"""
68 global stream
69 while not stop_event.is_set():
70 try:
71 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
72 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
73 except Exception as e:
74 if not stop_event.is_set():
75 print(f"Error streaming audio: {e}")
76 break
77
78 global audio_thread
79 audio_thread = threading.Thread(target=stream_audio)
80 audio_thread.daemon = True
81 audio_thread.start()
82
83
84def on_message(ws, message):
85 """Handle incoming messages from AssemblyAI"""
86 try:
87 data = json.loads(message)
88 msg_type = data.get("type")
89
90 # Uncomment to see full JSON for debugging:
91 # print("=" * 80)
92 # print(json.dumps(data, indent=2, ensure_ascii=False))
93 # print("=" * 80)
94 # print()
95
96 if msg_type == "Begin":
97 session_id = data.get("id", "N/A")
98 print(f"[SESSION] Started - ID: {session_id}\n")
99
100 elif msg_type == "Turn":
101 end_of_turn = data.get("end_of_turn", False)
102 turn_is_formatted = data.get("turn_is_formatted", False)
103 transcript = data.get("transcript", "")
104 turn_order = data.get("turn_order", 0)
105 end_of_turn_confidence = data.get("end_of_turn_confidence", 0.0)
106
107 # FOR MEETING NOTETAKERS: Show partials for responsive UI
108 if not end_of_turn and transcript:
109 print(f"\r[LIVE] {transcript}", end="", flush=True)
110
111 # FOR MEETING NOTETAKERS: Use formatted finals for readable display
112 # (Unlike voice agents which should use utterance for speed)
113 if end_of_turn and turn_is_formatted and transcript:
114 timestamp = datetime.now().strftime('%H:%M:%S')
115 print(f"\n[{timestamp}] {transcript}")
116 print(f" Turn: {turn_order} | Confidence: {end_of_turn_confidence:.2%}")
117
118 # Detect action items
119 transcript_lower = transcript.lower()
120 if any(term in transcript_lower for term in ["action item", "follow up", "deadline", "assigned to", "todo"]):
121 print(" ⚠️ ACTION ITEM DETECTED!")
122
123 # Store final transcript
124 transcript_buffer.append({
125 "timestamp": timestamp,
126 "text": transcript,
127 "turn_order": turn_order,
128 "confidence": end_of_turn_confidence,
129 "type": "final"
130 })
131 print()
132
133 elif msg_type == "Termination":
134 audio_duration = data.get("audio_duration_seconds", 0)
135 print(f"\n[SESSION] Terminated - Duration: {audio_duration}s")
136 save_transcript()
137
138 elif msg_type == "Error":
139 error_msg = data.get("error", "Unknown error")
140 print(f"\n[ERROR] {error_msg}")
141
142 except json.JSONDecodeError as e:
143 print(f"Error decoding message: {e}")
144 except Exception as e:
145 print(f"Error handling message: {e}")
146
147
148def on_error(ws, error):
149 """Called when a WebSocket error occurs."""
150 print(f"\n[WEBSOCKET ERROR] {error}")
151 stop_event.set()
152
153
154def on_close(ws, close_status_code, close_msg):
155 """Called when the WebSocket connection is closed."""
156 print(f"\n[WEBSOCKET] Disconnected - Status: {close_status_code}, Message: {close_msg}")
157
158 global stream, audio
159 stop_event.set()
160
161 # Clean up audio stream
162 if stream:
163 if stream.is_active():
164 stream.stop_stream()
165 stream.close()
166 stream = None
167 if audio:
168 audio.terminate()
169 audio = None
170 if audio_thread and audio_thread.is_alive():
171 audio_thread.join(timeout=1.0)
172
173
174def save_transcript():
175 """Save the transcript to a file"""
176 if not transcript_buffer:
177 print("No transcript to save.")
178 return
179
180 filename = f"meeting_transcript_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
181
182 with open(filename, "w", encoding="utf-8") as f:
183 f.write("Meeting Transcript\n")
184 f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
185 f.write(f"Keyterms: {', '.join(KEYTERMS)}\n")
186 f.write("=" * 80 + "\n\n")
187
188 for entry in transcript_buffer:
189 f.write(f"[{entry['timestamp']}] {entry['text']}\n")
190 f.write(f"Confidence: {entry['confidence']:.2%}\n\n")
191
192 print(f"Transcript saved to: {filename}")
193
194
195def run():
196 """Main function to run the streaming transcription"""
197 global audio, stream, ws_app
198
199 # Initialize PyAudio
200 audio = pyaudio.PyAudio()
201
202 # Open microphone stream
203 try:
204 stream = audio.open(
205 input=True,
206 frames_per_buffer=FRAMES_PER_BUFFER,
207 channels=CHANNELS,
208 format=FORMAT,
209 rate=SAMPLE_RATE,
210 )
211 print("Microphone stream opened successfully.")
212 except Exception as e:
213 print(f"Error opening microphone stream: {e}")
214 if audio:
215 audio.terminate()
216 return
217
218 # Create WebSocketApp
219 ws_app = websocket.WebSocketApp(
220 API_ENDPOINT,
221 header={"Authorization": YOUR_API_KEY},
222 on_open=on_open,
223 on_message=on_message,
224 on_error=on_error,
225 on_close=on_close,
226 )
227
228 # Run WebSocketApp in a separate thread
229 ws_thread = threading.Thread(target=ws_app.run_forever)
230 ws_thread.daemon = True
231 ws_thread.start()
232
233 try:
234 # Keep main thread alive until interrupted
235 while ws_thread.is_alive():
236 time.sleep(0.1)
237 except KeyboardInterrupt:
238 print("\n\nCtrl+C received. Stopping transcription...")
239 stop_event.set()
240
241 # Send termination message to the server
242 if ws_app and ws_app.sock and ws_app.sock.connected:
243 try:
244 terminate_message = {"type": "Terminate"}
245 ws_app.send(json.dumps(terminate_message))
246 time.sleep(1)
247 except Exception as e:
248 print(f"Error sending termination message: {e}")
249
250 if ws_app:
251 ws_app.close()
252
253 ws_thread.join(timeout=2.0)
254
255 finally:
256 # Final cleanup
257 if stream and stream.is_active():
258 stream.stop_stream()
259 if stream:
260 stream.close()
261 if audio:
262 audio.terminate()
263 print("Cleanup complete. Exiting.")
264
265
266if __name__ == "__main__":
267 run()

These settings wait longer before ending turns to accommodate natural conversation pauses and ensure readable formatted text for display. You can tweak these settings to get the best results for your notetaker.

How Do I Handle Multichannel Meeting Audio?

Many meeting platforms (Zoom, Teams, Google Meet) can record each participant on separate audio channels. This dramatically improves speaker identification accuracy.

For Pre-recorded Meetings

1config = aai.TranscriptionConfig(
2 multichannel=True, # Enable when each speaker is on different channel
3 speaker_labels=False, # Disable - channels already separate speakers
4 # Other settings...
5)
6
7transcriber = aai.Transcriber()
8transcript = transcriber.transcribe(audio_file, config=config)
9
10# Access per-channel transcripts
11for channel, channel_transcript in enumerate(transcript.channels):
12 print(f"\n=== Channel {channel} ===")
13 print(channel_transcript.text)

When to use multichannel:

  • Zoom local recordings with “Record separate audio file for each participant” enabled
  • Professional podcast recordings with individual microphones
  • Conference systems with dedicated channels per participant
  • Phone calls with caller and callee on separate channels

Benefits:

  • Perfect speaker separation - No diarization errors
  • No speaker confusion or overlap issues
  • Faster processing time - Diarization not needed
  • Higher accuracy - Model processes clean single-speaker audio

How to enable in meeting platforms:

  • Zoom: Settings → Recording → Advanced → “Record a separate audio file for each participant”
  • Teams: Requires third-party recording solutions like Recall.ai
  • Google Meet: Requires third-party recording solutions like Recall.ai

For Streaming Meetings

For real-time multichannel audio, create separate streaming sessions per channel:

1import asyncio
2import websockets
3
4class ChannelTranscriber:
5 def __init__(self, channel_id: int, speaker_name: str):
6 self.channel_id = channel_id
7 self.speaker_name = speaker_name
8 self.connection_params = {
9 "sample_rate": 16000,
10 "format_turns": True,
11 }
12
13 async def transcribe_channel(self, audio_stream):
14 """Transcribe a single audio channel"""
15 url = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(self.connection_params)}"
16
17 async with websockets.connect(url, extra_headers={"Authorization": API_KEY}) as ws:
18 # Send audio from this channel only
19 async for audio_chunk in audio_stream:
20 await ws.send(audio_chunk)
21
22 # Receive transcripts
23 async for message in ws:
24 data = json.loads(message)
25 if data.get("type") == "Turn" and data.get("turn_is_formatted"):
26 print(f"{self.speaker_name}: {data['transcript']}")
27
28# Create transcriber for each channel
29async def transcribe_multichannel_meeting(channel_audio_streams):
30 transcribers = [
31 ChannelTranscriber(0, "Alice"),
32 ChannelTranscriber(1, "Bob"),
33 ]
34
35 # Run all channels concurrently
36 await asyncio.gather(*[
37 t.transcribe_channel(stream)
38 for t, stream in zip(transcribers, channel_audio_streams)
39 ])

See our multichannel streaming guide for complete implementation details.

How Should I Handle Pre-recorded Transcription in Production?

Choose the right approach based on your application’s needs:

Option 1: Async/Await (Simple Cases)

1# Simple blocking call
2transcript = await asyncio.to_thread(transcriber.transcribe, audio_url, config=config)

Pros:

  • Simple, straightforward code
  • Good for low volume applications
  • Easy to understand and debug

Cons:

  • Ties up resources while waiting
  • Not suitable for high volume
  • Cannot process multiple files simultaneously

Best for: Personal projects, prototypes, low-traffic applications

1config = aai.TranscriptionConfig(
2 webhook_url="https://your-app.com/webhooks/assemblyai",
3 webhook_auth_header_name="X-Webhook-Secret",
4 webhook_auth_header_value="your_secret_here",
5 speaker_labels=True,
6 summarization=True,
7 # ... other config
8)
9
10# Submit job and return immediately (non-blocking)
11transcript = transcriber.submit(audio_url, config=config)
12print(f"Job submitted: {transcript.id}")
13# Your app can continue processing other requests
14
15# Your webhook receives results when ready (typically 15-30% of audio duration)

Webhook handler example:

1from flask import Flask, request, jsonify
2
3app = Flask(__name__)
4
5@app.route("/webhooks/assemblyai", methods=["POST"])
6def assemblyai_webhook():
7 # Verify webhook authenticity
8 if request.headers.get("X-Webhook-Secret") != "your_secret_here":
9 return jsonify({"error": "Unauthorized"}), 401
10
11 data = request.json
12 transcript_id = data["transcript_id"]
13 status = data["status"]
14
15 if status == "completed":
16 # Process completed transcript
17 process_completed_meeting(data)
18 elif status == "error":
19 # Handle error
20 log_transcription_error(transcript_id, data["error"])
21
22 return jsonify({"received": True}), 200
23
24def process_completed_meeting(transcript_data):
25 """Process completed meeting transcript"""
26 # Extract data
27 utterances = transcript_data["utterances"]
28 summary = transcript_data["summary"]
29
30 # Store in database
31 save_to_database(transcript_data)
32
33 # Notify user
34 send_notification(transcript_data["id"])

Pros:

  • Non-blocking - submit and forget
  • Scales to high volume
  • Process multiple files in parallel
  • Automatic retry on failures
  • Get notified when complete

Best for: Production apps, user-uploaded recordings, batch processing, SaaS products

Option 3: Polling (Custom Workflows)

1# Submit job
2transcript = transcriber.submit(audio_url, config=config)
3print(f"Submitted: {transcript.id}")
4
5# Poll for completion with progress tracking
6while transcript.status not in [aai.TranscriptStatus.completed, aai.TranscriptStatus.error]:
7 await asyncio.sleep(5)
8 transcript = transcriber.get_transcript(transcript.id)
9
10 # Optional: Show progress
11 print(f"Status: {transcript.status}...")
12
13if transcript.status == aai.TranscriptStatus.completed:
14 process_transcript(transcript)
15else:
16 print(f"Error: {transcript.error}")

Pros:

  • Full control over retry logic
  • Can show progress to users
  • Good for background jobs
  • Works without webhook infrastructure

Cons:

  • Must implement your own polling loop
  • Ties up resources while polling
  • More complex than webhooks

Best for: Background job processors, CLIs with progress bars, custom retry logic

Comparison Table

MethodBlockingScalabilityComplexityBest For
Async/AwaitYesLowLowPrototypes, low volume
WebhooksNoHighMediumProduction, high volume
PollingPartialMediumMediumBackground jobs, progress UI

How Do I Identify Speakers in My Recording?

Speaker diarization tells you when speakers change (“Speaker A”, “Speaker B”), but Speaker Identification tells you who they are by name or role.

Why Use Speaker Identification?

Instead of:

Speaker A: Let's review the Q3 numbers.
Speaker B: Revenue was up 15% this quarter.
Speaker A: Excellent work on that launch.

You get:

Sarah Chen: Let's review the Q3 numbers.
Michael Rodriguez: Revenue was up 15% this quarter.
Sarah Chen: Excellent work on that launch.

How It Works

Speaker Identification uses AssemblyAI’s Speech Understanding API to map generic speaker labels to actual names or roles that you provide:

1import assemblyai as aai
2
3aai.settings.api_key = "your_api_key"
4
5# Step 1: Transcribe with speaker diarization
6config = aai.TranscriptionConfig(
7 speaker_labels=True, # Must enable speaker diarization first
8 speech_understanding={
9 "request": {
10 "speaker_identification": {
11 "speaker_type": "name", # or "role"
12 "known_values": ["Sarah Chen", "Michael Rodriguez", "Alex Kim"]
13 }
14 }
15 }
16)
17
18transcriber = aai.Transcriber()
19transcript = transcriber.transcribe("meeting_recording.mp3", config=config)
20
21# Access results with identified speakers
22for utterance in transcript.utterances:
23 print(f"{utterance.speaker}: {utterance.text}")

Identifying by Role Instead of Name

For customer service, sales calls, or scenarios where you don’t know names:

1config = aai.TranscriptionConfig(
2 speaker_labels=True,
3 speech_understanding={
4 "request": {
5 "speaker_identification": {
6 "speaker_type": "role",
7 "known_values": ["Agent", "Customer"] # or ["Interviewer", "Interviewee"]
8 }
9 }
10 }
11)

Common role combinations:

  • ["Agent", "Customer"] - Customer service calls
  • ["Support", "Customer"] - Technical support
  • ["Interviewer", "Interviewee"] - Interviews
  • ["Host", "Guest"] - Podcasts
  • ["Doctor", "Patient"] - Medical consultations (with HIPAA compliance)

How to Get Speaker Names

For platform recordings:

  1. Zoom: Extract participant names from Zoom API or meeting JSON
  2. Teams: Get attendees from Microsoft Graph API
  3. Google Meet: Use Google Calendar API to get participants

Example with Zoom:

1# Get participant names from Zoom meeting
2zoom_participants = get_zoom_meeting_participants(meeting_id)
3speaker_names = [p["name"] for p in zoom_participants]
4
5# Use in speaker identification
6config = aai.TranscriptionConfig(
7 speaker_labels=True,
8 speakers_expected=len(speaker_names), # Hint: exact number of speakers
9 speech_understanding={
10 "request": {
11 "speaker_identification": {
12 "speaker_type": "name",
13 "known_values": speaker_names
14 }
15 }
16 }
17)

How Speaker Identification Works

Speaker Identification Requirements:

  1. Speaker diarization must be enabled - Cannot identify speakers without diarization first
  2. Requires sufficient audio per speaker - Each speaker needs enough speech for accurate matching
  3. Works best with distinct voices - Similar voices may be confused
  4. US English only - Currently only supports US region audio
  5. Post-processing step - Adds additional processing time after transcription

Accuracy depends on:

  • Audio quality (clear, minimal background noise)
  • Voice distinctiveness (different genders, accents, tones)
  • Amount of speech per speaker (more = better)
  • Number of speakers (fewer = more accurate)

Alternative: Add Identification Later

You can also add speaker identification to an existing transcript:

1# First, transcribe with speaker diarization
2transcript = transcriber.transcribe(audio_url, config=aai.TranscriptionConfig(speaker_labels=True))
3
4# Later, add speaker identification
5transcript_dict = transcript.json_response
6
7# Add speaker identification
8transcript_dict["speech_understanding"] = {
9 "request": {
10 "speaker_identification": {
11 "speaker_type": "name",
12 "known_values": ["Sarah Chen", "Michael Rodriguez"]
13 }
14 }
15}
16
17# Send to Speech Understanding API
18import requests
19result = requests.post(
20 "https://llm-gateway.assemblyai.com/v1/understanding",
21 headers={"Authorization": aai.settings.api_key},
22 json=transcript_dict
23).json()
24
25# Access identified speakers
26for utterance in result["utterances"]:
27 print(f"{utterance['speaker']}: {utterance['text']}")

This approach is useful when:

  • You get speaker names after the transcription completes
  • You want to try different name mappings
  • Building iterative workflows where users confirm speaker identities

For complete API details, see our Speaker Identification documentation.

How Do I Translate Between Languages in Meetings?

AssemblyAI supports translation between 99+ languages, enabling you to transcribe meetings in one language and translate to another.

When to Use Translation

Common use cases:

  • Transcribe Spanish meeting → Translate to English for documentation
  • Transcribe multilingual meeting → Translate all to common language
  • Create translated meeting notes for international teams
  • Provide translated summaries for stakeholders

Basic Translation

1import assemblyai as aai
2
3aai.settings.api_key = "your_api_key"
4
5config = aai.TranscriptionConfig(
6 language_code="es", # Spanish audio
7 translation_language_code="en", # Translate to English
8 speaker_labels=True, # Maintain speaker labels
9)
10
11transcriber = aai.Transcriber()
12transcript = transcriber.transcribe("spanish_meeting.mp3", config=config)
13
14# Access original Spanish transcript
15print("Original (Spanish):")
16print(transcript.text)
17
18# Access English translation
19print("\nTranslation (English):")
20print(transcript.translation)

Translation with Code Switching

For meetings where participants switch between languages:

1config = aai.TranscriptionConfig(
2 language_detection=True, # Auto-detect languages
3 language_detection_options=aai.LanguageDetectionOptions(
4 code_switching=True, # Preserve language switches
5 code_switching_confidence_threshold=0.5
6 ),
7 translation_language_code="en", # Translate everything to English
8)
9
10transcript = transcriber.transcribe("multilingual_meeting.mp3", config=config)
11
12# Original transcript preserves language switches
13for utterance in transcript.utterances:
14 detected_lang = utterance.language_code if hasattr(utterance, 'language_code') else "unknown"
15 print(f"[{detected_lang}] {utterance.speaker}: {utterance.text}")
16
17# Translation combines everything in English
18print(f"\nEnglish translation:\n{transcript.translation}")

Supported Language Pairs

AssemblyAI supports translation between 99+ languages, including:

Popular combinations:

  • Spanish ↔ English
  • French ↔ English
  • German ↔ English
  • Mandarin ↔ English
  • Japanese ↔ English
  • Portuguese ↔ English
  • And all combinations between supported languages

Translation Response Format

1{
2 "text": "Original transcript in source language",
3 "translation": "Translated transcript in target language",
4 "language_code": "es", # Detected/specified source language
5 "utterances": [
6 {
7 "speaker": "A",
8 "text": "Hola, ¿cómo estás?", # Original
9 "translation": "Hello, how are you?", # Translated
10 "start": 0,
11 "end": 1500
12 }
13 ]
14}

Real-World Example: International Team Meeting

1async def process_international_meeting(audio_url: str, target_language: str = "en"):
2 """
3 Process multilingual meeting with translation and identification
4 """
5 config = aai.TranscriptionConfig(
6 # Handle multiple languages
7 language_detection=True,
8 language_detection_options=aai.LanguageDetectionOptions(
9 code_switching=True,
10 code_switching_confidence_threshold=0.5
11 ),
12
13 # Translate to common language
14 translation_language_code=target_language,
15
16 # Speaker identification
17 speaker_labels=True,
18
19 # Generate summary in target language
20 summarization=True,
21 )
22
23 transcriber = aai.Transcriber()
24 transcript = await asyncio.to_thread(
25 transcriber.transcribe,
26 audio_url,
27 config=config
28 )
29
30 return {
31 "original_transcript": transcript.text,
32 "translated_transcript": transcript.translation,
33 "summary": transcript.summary, # Already in target language
34 "speakers": list(set(u.speaker for u in transcript.utterances)),
35 "languages_detected": list(set(
36 u.language_code for u in transcript.utterances
37 if hasattr(u, 'language_code')
38 ))
39 }

For complete language support and translation details, see our Translation documentation.

What Workflows Can I Build for My AI Meeting Notetaker?

Use these Speech Understanding and Guardrails features to transform raw transcripts into actionable insights.

Summarization

summarization: true

What it does: Generates an abstractive recap of the conversation (not verbatim).
Output: summary string (bullets/paragraph format).
Great for: Meeting notes, call recaps, executive summaries.
Notes: Condenses and rephrases; minor details may be omitted by design.

Example:

1config = aai.TranscriptionConfig(
2 summarization=True,
3 summary_type="bullets", # or "paragraph"
4 summary_model="informative", # or "conversational"
5)

Sentiment Analysis

sentiment_analysis: true

What it does: Scores per-utterance sentiment (positive / neutral / negative).
Output: Array of { text, sentiment, confidence, start, end }.
Great for: Customer satisfaction tracking, coaching, churn prediction.
Notes: Segment-level (not global mood); sarcasm and very short utterances are harder to classify.

Example:

1for utterance in transcript.sentiment_analysis_results:
2 if utterance.sentiment == "NEGATIVE":
3 print(f"Negative sentiment detected: {utterance.text}")

Entity Detection

entity_detection: true

What it does: Extracts named entities (people, organizations, locations, products, etc.).
Output: Array of { entity_type, text, start, end, confidence }.
Great for: Auto-tagging topics, tracking competitors mentioned, CRM enrichment.
Notes: Operates on post-redaction text if PII redaction is enabled.

Example:

1# Extract all organizations mentioned
2organizations = [
3 entity.text for entity in transcript.entities
4 if entity.entity_type == "organization"
5]
6print(f"Companies mentioned: {', '.join(organizations)}")

Redact PII Text

redact_pii: true

What it does: Scans transcript for personally identifiable information and replaces matches per policy.
Output: text with replacements; original words timing preserved.
Great for: GDPR/CCPA compliance, safe sharing, SOC2 requirements.
Notes: Runs before downstream features; they see the redacted text.

Recommended policies for meetings:

1config = aai.TranscriptionConfig(
2 redact_pii=True,
3 redact_pii_policies=[
4 PIIRedactionPolicy.person_name, # Remove names
5 PIIRedactionPolicy.email_address, # Remove emails
6 PIIRedactionPolicy.phone_number, # Remove phone numbers
7 PIIRedactionPolicy.organization, # Remove company names
8 ],
9 redact_pii_sub=PIISubstitutionPolicy.hash, # Stable hash tokens
10)

Why hash substitution?

  • Stable across the file (same value → same token)
  • Maintains sentence structure for LLM processing
  • Prevents reconstruction of original data

Redact PII Audio

redact_pii_audio: true

What it does: Produces a second audio file where redacted portions are bleeped/silenced.
Output: redacted_audio_url in the transcript response.
Great for: External sharing, training materials, demos.
Notes: Original audio is untouched; bleeped sections may sound choppy.

Complete Example

1config = aai.TranscriptionConfig(
2 # Core transcription
3 speaker_labels=True,
4
5 # Speech Understanding
6 summarization=True,
7 sentiment_analysis=True,
8 entity_detection=True,
9
10 # PII protection
11 redact_pii=True,
12 redact_pii_policies=[
13 PIIRedactionPolicy.person_name,
14 PIIRedactionPolicy.email_address,
15 PIIRedactionPolicy.phone_number,
16 ],
17 redact_pii_sub=PIISubstitutionPolicy.hash,
18 redact_pii_audio=True,
19)
20
21transcript = transcriber.transcribe(audio_url, config=config)
22
23# Access all features
24meeting_insights = {
25 "summary": transcript.summary,
26 "sentiment_trend": analyze_sentiment_trend(transcript.sentiment_analysis_results),
27 "entities": extract_entities(transcript.entities),
28 "safe_transcript": transcript.text, # PII redacted
29 "safe_audio": transcript.redacted_audio_url, # PII bleeped
30}

How Do I Improve the Accuracy of My Notetaker?

Best practices:

  • Include participant names for better speaker recognition
  • Add company-specific jargon and acronyms
  • Include product names and technical terms
  • Keep individual terms under 50 characters
  • Maximum 100 terms per request

Using Keyterms Prompt for Async Transcription

Keyterms prompting improves recognition accuracy for domain-specific vocabulary by up to 21%:

1# Define domain-specific vocabulary
2company_terms = [
3 "AssemblyAI",
4 "Universal-Streaming",
5 "Speech Understanding",
6 "diarization"
7]
8
9participant_names = [
10 "Dylan Fox",
11 "Sarah Chen",
12 "Michael Rodriguez"
13]
14
15technical_terms = [
16 "API endpoint",
17 "WebSocket",
18 "latency metrics",
19 "TTFT"
20]
21
22# Configure with keyterms prompt
23config = aai.TranscriptionConfig(
24 keyterms_prompt=company_terms + participant_names + technical_terms,
25 speaker_labels=True,
26 # ... other settings
27)

Using Keyterms Prompt for Streaming

1# Streaming with contextual keyterms
2keyterms = [
3 # Participant names
4 "Alice Johnson",
5 "Bob Smith",
6
7 # Meeting-specific vocabulary
8 "Q4 objectives",
9 "revenue targets",
10 "customer acquisition",
11
12 # Technical terms
13 "API integration",
14 "cloud migration"
15]
16
17CONNECTION_PARAMS = {
18 "sample_rate": 16000,
19 "format_turns": True,
20 "keyterms_prompt": json.dumps(keyterms), # JSON-encode for URL params
21}

How Do I Process the Response from the API?

Processing Async Responses

1def process_async_transcript(transcript):
2 """
3 Extract and process all relevant data from async transcript
4 """
5 # Basic transcript data
6 meeting_data = {
7 "id": transcript.id,
8 "duration": transcript.audio_duration,
9 "confidence": transcript.confidence,
10 "full_text": transcript.text
11 }
12
13 # Process speaker utterances
14 speakers = {}
15 for utterance in transcript.utterances:
16 speaker = utterance.speaker
17
18 if speaker not in speakers:
19 speakers[speaker] = {
20 "utterances": [],
21 "total_speaking_time": 0,
22 "word_count": 0
23 }
24
25 speakers[speaker]["utterances"].append({
26 "text": utterance.text,
27 "start": utterance.start,
28 "end": utterance.end,
29 "confidence": utterance.confidence
30 })
31
32 # Calculate speaking time
33 speakers[speaker]["total_speaking_time"] += (utterance.end - utterance.start) / 1000
34 speakers[speaker]["word_count"] += len(utterance.text.split())
35
36 meeting_data["speakers"] = speakers
37
38 # Extract summary
39 if transcript.summary:
40 meeting_data["summary"] = transcript.summary
41
42 # Calculate meeting statistics
43 total_duration = transcript.audio_duration / 1000 # Convert to seconds
44 meeting_data["statistics"] = {
45 "total_speakers": len(speakers),
46 "total_words": sum(s["word_count"] for s in speakers.values()),
47 "average_confidence": transcript.confidence,
48 "speaking_distribution": {
49 speaker: {
50 "percentage": (data["total_speaking_time"] / total_duration) * 100,
51 "minutes": data["total_speaking_time"] / 60
52 }
53 for speaker, data in speakers.items()
54 }
55 }
56
57 return meeting_data
58
59# Example usage
60result = process_async_transcript(transcript)
61print(f"Meeting had {result['statistics']['total_speakers']} speakers")
62print(f"Speaker A spoke for {result['statistics']['speaking_distribution']['A']['minutes']:.1f} minutes")

Processing Streaming Responses

1class StreamingResponseProcessor:
2 def __init__(self):
3 self.partial_buffer = ""
4 self.final_transcripts = []
5 self.turn_metadata = []
6
7 def process_message(self, message: dict):
8 """
9 Process real-time streaming messages
10 """
11 msg_type = message.get("type")
12
13 if msg_type == "Begin":
14 return {
15 "event": "session_started",
16 "session_id": message.get("id"),
17 "expires_at": message.get("expires_at")
18 }
19
20 elif msg_type == "Turn":
21 return self.process_turn(message)
22
23 elif msg_type == "Termination":
24 return {
25 "event": "session_ended",
26 "audio_duration": message.get("audio_duration_seconds"),
27 "session_duration": message.get("session_duration_seconds")
28 }
29
30 def process_turn(self, data: dict):
31 """Process turn messages"""
32 is_final = data.get("end_of_turn")
33 is_formatted = data.get("turn_is_formatted")
34 transcript = data.get("transcript", "")
35 turn_order = data.get("turn_order")
36
37 response = {
38 "turn_order": turn_order,
39 "is_final": is_final,
40 "is_formatted": is_formatted,
41 "confidence": data.get("end_of_turn_confidence", 0)
42 }
43
44 # Handle partials (for live display)
45 if not is_final and transcript:
46 self.partial_buffer = transcript
47 response["event"] = "partial"
48 response["text"] = transcript
49
50 # Handle finals (for storage)
51 elif is_final and is_formatted:
52 final_transcript = {
53 "turn_order": turn_order,
54 "text": transcript,
55 "confidence": data.get("end_of_turn_confidence"),
56 "timestamp": datetime.now().isoformat()
57 }
58 self.final_transcripts.append(final_transcript)
59 response["event"] = "final"
60 response["text"] = transcript
61
62 # Clear partial buffer
63 self.partial_buffer = ""
64
65 return response
66
67 def get_full_transcript(self):
68 """
69 Combine all final transcripts into complete meeting transcript
70 """
71 return {
72 "full_text": " ".join(t["text"] for t in self.final_transcripts),
73 "transcripts": self.final_transcripts,
74 "total_turns": len(self.final_transcripts)
75 }
76
77# Example usage
78processor = StreamingResponseProcessor()
79
80async with websockets.connect(API_ENDPOINT, extra_headers=headers) as ws:
81 async for message in ws:
82 data = json.loads(message)
83 result = processor.process_message(data)
84
85 if result["event"] == "partial":
86 # Update UI with live transcript
87 update_live_caption(result["text"])
88
89 elif result["event"] == "final":
90 # Save final transcript
91 save_transcript_segment(result)
92
93# Get complete transcript when done
94full_transcript = processor.get_full_transcript()

Additional Resources