Evaluate Streaming transcription accuracy with WER
Learn how to evaluate the accuracy of your AssemblyAI streaming transcripts using Word Error Rate (WER), the industry-standard metric for measuring speech-to-text performance. This guide walks you through setting up a complete benchmarking workflow to measure how well your streaming implementation performs against a reference transcript.
Quickstart
1 # pip install assemblyai whisper-normalizer jiwer 2 import jiwer 3 from whisper_normalizer.english import EnglishTextNormalizer 4 from whisper_normalizer.basic import BasicTextNormalizer 5 6 import assemblyai as aai 7 from assemblyai.streaming.v3 import ( 8 BeginEvent, 9 StreamingClient, 10 StreamingClientOptions, 11 StreamingError, 12 StreamingEvents, 13 StreamingParameters, 14 TerminationEvent, 15 TurnEvent 16 ) 17 from typing import Type 18 19 # Global variable to collect assembly transcripts 20 assembly_streaming_transcript = "" 21 22 def on_begin(self: Type[StreamingClient], event: BeginEvent): 23 "This function is called when the connection has been established." 24 25 print("Session ID:", event.id) 26 27 def on_turn(self: Type[StreamingClient], event: TurnEvent): 28 "This function is called when a new transcript has been received." 29 global assembly_streaming_transcript 30 31 if event.end_of_turn and event.turn_is_formatted: 32 assembly_streaming_transcript += event.transcript + " " 33 34 print(event.transcript, end="\r\n") 35 36 def on_terminated(self: Type[StreamingClient], event: TerminationEvent): 37 "This function is called when an error occurs." 38 39 print( 40 f"Session terminated: {event.audio_duration_seconds} seconds of audio processed" 41 ) 42 43 def on_error(self: Type[StreamingClient], error: StreamingError): 44 "This function is called when the connection has been closed." 45 46 print(f"Error occurred: {error}") 47 48 49 # Create the streaming client 50 client = StreamingClient( 51 StreamingClientOptions( 52 api_key="YOUR-API-KEY" 53 ) 54 ) 55 56 client.on(StreamingEvents.Begin, on_begin) 57 client.on(StreamingEvents.Turn, on_turn) 58 client.on(StreamingEvents.Termination, on_terminated) 59 client.on(StreamingEvents.Error, on_error) 60 61 def stream_file(filepath: str, sample_rate: int): 62 """Stream audio file in 50ms chunks instead of 300ms""" 63 import time 64 import wave 65 66 chunk_duration = 0.1 67 68 with wave.open(filepath, 'rb') as wav_file: 69 if wav_file.getnchannels() != 1: 70 raise ValueError("Only mono audio is supported") 71 72 file_sample_rate = wav_file.getframerate() 73 if file_sample_rate != sample_rate: 74 print(f"Warning: File sample rate ({file_sample_rate}) doesn't match expected rate ({sample_rate})") 75 76 frames_per_chunk = int(file_sample_rate * chunk_duration) 77 78 while True: 79 frames = wav_file.readframes(frames_per_chunk) 80 81 if not frames: 82 break 83 84 yield frames 85 86 # time.sleep(chunk_duration) 87 88 file_stream = stream_file( 89 filepath="audio.wav", 90 sample_rate=48000, 91 ) 92 93 client.connect( 94 StreamingParameters( 95 sample_rate=48000, 96 format_turns=True, 97 ) 98 ) 99 100 try: 101 client.stream(file_stream) 102 finally: 103 client.disconnect(terminate=True) 104 105 # Evaluate collected transcripts 106 reference_transcript = "AssemblyAI is a deep learning company that builds powerful APIs to help you transcribe and understand audio. The most common use case for the API is to automatically convert prerecorded audio and video files as well as real time audio streams into text transcriptions. Our APIs convert audio and video into text using powerful deep learning models that we research and develop end to end in house. Millions of podcasts, zoom recordings, phone calls or video files are being transcribed with Assembly AI every single day. But where Assembly AI really excels is with helping you understand your data. So let's say we transcribe Joe Biden's State of the Union using Assembly AI's API. With our Auto Chapters feature, you can generate time coded summaries of the key moments of your audio file. For example, with the State of the Union address we get chapter summaries like this. Auto Chapters automatically segments your audio or video files into chapters and provides a summary for each of these chapters. With Sentiment Analysis, we can classify what's being spoken in your audio files as either positive, negative or neutral. So for example, in the State of the Union address we see that this sentence was classified as positive, whereas this sentence was classified as negative. Content Safety Detection can flag sensitive content as it is spoken like hate speech, profanity, violence or weapons. For example, in Biden's State of the Union address, content safety detection flagged parts of his speech as being about weapons. This feature is especially useful for automatic content moderation and brand safety use cases. With Auto Highlights, you can automatically identify important words and phrases that are being spoken in your data owned by the State of the Union address. AssemblyAI's API detected these words and phrases as being important. Lastly, with entity detection you can identify entities that are spoken in your audio like organization names or person names. In Biden's speech, these were the entities that were detected. This is just a preview of the most popular features of AssemblyAI's API. If you want a full list of features, go check out our API documentation linked in the description below. And if you ever need some support, our team of developers is here to help. Everyday developers are using these features to build really exciting applications. From meeting summarizers to brand safety or contextual targeting platforms to full blown conversational intelligence tools. We can't wait to see what you build with AssemblyAI." 107 108 # Initialize normalizers 109 normalizer = EnglishTextNormalizer() 110 # For Spanish and other languages 111 # normalizer = BasicTextNormalizer() 112 113 def calculate_wer(reference, hypothesis, language='en'): 114 # Normalize both texts 115 normalized_reference = normalizer(reference) 116 print("Reference: " + reference) 117 print("Normalized Reference: " + normalized_reference + "\n") 118 119 normalized_hypothesis = normalizer(hypothesis) 120 print("Hypothesis: " + hypothesis) 121 print("Normalized Hypothesis: " + normalized_hypothesis+ "\n") 122 123 # Calculate WER 124 wer = jiwer.wer(normalized_reference, normalized_hypothesis) 125 126 return wer * 100 # Return as percentage 127 128 wer_score = calculate_wer(reference_transcript, assembly_streaming_transcript.strip()) 129 print(f"Final WER: {wer_score:.2f}%")
Step-by-step implementation
- Install the required dependencies
$ pip install assemblyai whisper-normalizer jiwer
- Import the necessary libraries
1 # pip install assemblyai whisper-normalizer jiwer 2 import jiwer 3 from whisper_normalizer.english import EnglishTextNormalizer 4 from whisper_normalizer.basic import BasicTextNormalizer 5 6 import assemblyai as aai 7 from assemblyai.streaming.v3 import ( 8 BeginEvent, 9 StreamingClient, 10 StreamingClientOptions, 11 StreamingError, 12 StreamingEvents, 13 StreamingParameters, 14 TerminationEvent, 15 TurnEvent 16 ) 17 from typing import Type
- Set up transcript collection Create a global variable to store streaming transcripts. Your streaming session will append to this variable as it processes audio, and you’ll use it for WER analysis.
1 # Global variable to collect assembly transcripts 2 assembly_streaming_transcript = ""
- Configure streaming audio processing
Stream your audio file to the AssemblyAI endpoint. The
on_turn
function captures formatted transcripts and appends them to your collection variable.
1 def on_begin(self: Type[StreamingClient], event: BeginEvent): 2 "This function is called when the connection has been established." 3 4 print("Session ID:", event.id) 5 6 def on_turn(self: Type[StreamingClient], event: TurnEvent): 7 "This function is called when a new transcript has been received." 8 global assembly_streaming_transcript 9 10 if event.end_of_turn and event.turn_is_formatted: 11 assembly_streaming_transcript += event.transcript + " " 12 13 print(event.transcript, end="\r\n") 14 15 def on_terminated(self: Type[StreamingClient], event: TerminationEvent): 16 "This function is called when an error occurs." 17 18 print( 19 f"Session terminated: {event.audio_duration_seconds} seconds of audio processed" 20 ) 21 22 def on_error(self: Type[StreamingClient], error: StreamingError): 23 "This function is called when the connection has been closed." 24 25 print(f"Error occurred: {error}") 26 27 28 # Create the streaming client 29 client = StreamingClient( 30 StreamingClientOptions( 31 api_key="YOUR-API-KEY" 32 ) 33 ) 34 35 client.on(StreamingEvents.Begin, on_begin) 36 client.on(StreamingEvents.Turn, on_turn) 37 client.on(StreamingEvents.Termination, on_terminated) 38 client.on(StreamingEvents.Error, on_error) 39 40 def stream_file(filepath: str, sample_rate: int): 41 """Stream audio file in 50ms chunks instead of 300ms""" 42 import time 43 import wave 44 45 chunk_duration = 0.1 46 47 with wave.open(filepath, 'rb') as wav_file: 48 if wav_file.getnchannels() != 1: 49 raise ValueError("Only mono audio is supported") 50 51 file_sample_rate = wav_file.getframerate() 52 if file_sample_rate != sample_rate: 53 print(f"Warning: File sample rate ({file_sample_rate}) doesn't match expected rate ({sample_rate})") 54 55 frames_per_chunk = int(file_sample_rate * chunk_duration) 56 57 while True: 58 frames = wav_file.readframes(frames_per_chunk) 59 60 if not frames: 61 break 62 63 yield frames 64 65 # time.sleep(chunk_duration) 66 67 file_stream = stream_file( 68 filepath="audio.wav", 69 sample_rate=48000, 70 ) 71 72 client.connect( 73 StreamingParameters( 74 sample_rate=48000, 75 format_turns=True, 76 ) 77 ) 78 79 try: 80 client.stream(file_stream) 81 finally: 82 client.disconnect(terminate=True)
- Prepare your reference transcript Define the ground truth transcript for comparison. This serves as your accuracy benchmark for the WER calculation.
Pro tip: Create a high-quality reference transcript by first transcribing your audio file with AssemblyAI’s Slam-1 model, then manually reviewing and correcting any errors to achieve 100% accuracy.
1 # Evaluate collected transcripts 2 reference_transcript = "AssemblyAI is a deep learning company that builds powerful APIs to help you transcribe and understand audio. The most common use case for the API is to automatically convert prerecorded audio and video files as well as real time audio streams into text transcriptions. Our APIs convert audio and video into text using powerful deep learning models that we research and develop end to end in house. Millions of podcasts, zoom recordings, phone calls or video files are being transcribed with Assembly AI every single day. But where Assembly AI really excels is with helping you understand your data. So let's say we transcribe Joe Biden's State of the Union using Assembly AI's API. With our Auto Chapters feature, you can generate time coded summaries of the key moments of your audio file. For example, with the State of the Union address we get chapter summaries like this. Auto Chapters automatically segments your audio or video files into chapters and provides a summary for each of these chapters. With Sentiment Analysis, we can classify what's being spoken in your audio files as either positive, negative or neutral. So for example, in the State of the Union address we see that this sentence was classified as positive, whereas this sentence was classified as negative. Content Safety Detection can flag sensitive content as it is spoken like hate speech, profanity, violence or weapons. For example, in Biden's State of the Union address, content safety detection flagged parts of his speech as being about weapons. This feature is especially useful for automatic content moderation and brand safety use cases. With Auto Highlights, you can automatically identify important words and phrases that are being spoken in your data owned by the State of the Union address. AssemblyAI's API detected these words and phrases as being important. Lastly, with entity detection you can identify entities that are spoken in your audio like organization names or person names. In Biden's speech, these were the entities that were detected. This is just a preview of the most popular features of AssemblyAI's API. If you want a full list of features, go check out our API documentation linked in the description below. And if you ever need some support, our team of developers is here to help. Everyday developers are using these features to build really exciting applications. From meeting summarizers to brand safety or contextual targeting platforms to full blown conversational intelligence tools. We can't wait to see what you build with AssemblyAI."
- Initialize text normalization Set up the normalizer and create your WER calculation function to ensure consistent text formatting before comparison.
1 # Initialize normalizers 2 normalizer = EnglishTextNormalizer() 3 # For Spanish and other languages 4 # normalizer = BasicTextNormalizer() 5 6 def calculate_wer(reference, hypothesis, language='en'): 7 # Normalize both texts 8 normalized_reference = normalizer(reference) 9 print("Reference: " + reference) 10 print("Normalized Reference: " + normalized_reference + "\n") 11 12 normalized_hypothesis = normalizer(hypothesis) 13 print("Hypothesis: " + hypothesis) 14 print("Normalized Hypothesis: " + normalized_hypothesis+ "\n") 15 16 # Calculate WER 17 wer = jiwer.wer(normalized_reference, normalized_hypothesis) 18 19 return wer * 100 # Return as percentage
- Calculate your WER score Run the final calculation to measure transcription accuracy.
1 wer_score = calculate_wer(reference_transcript, assembly_streaming_transcript.strip()) 2 print(f"Final WER: {wer_score:.2f}%")