Product managers and developers at telephony companies are consistently leveraging automatic speech recognition (ASR) to power core features in their products and platforms.
- Interactive Voice Response (IVR)
- Virtual Voicemail
- Call Transcription
- Call Tracking
- Coaching Enablement
- Conversational Intelligence
Accuracy of a Speech-to-Text system is critical in order for these telephony platforms to build high quality features that users and customers love.
In this report, we look at 5 different earning calls from various companies (shown in more detail below), and review how accurately AssemblyAI, AWS Transcribe, and Google Speech-to-Text are able to automatically transcribe these recordings.
In addition to reviewing Speech Recognition accuracy, our research team reviewed the results of AssemblyAI's unique Personal Identifiable Information (PII) Redaction, Topic Recognition, Keyword Detection, and Content Safety features on these call recordings. This report is meant to serve as a point of reference to compare the best Automated Speech Recognition and Conversational Intelligence solutions in the market for Telephony Platforms.
Speech Recognition Accuracy
We included earning call recordings for 5 major companies, Twilio, Facebook, Apple, Microsoft, and MongoDB. Chosen at random, our intention is to provide you with a healthy sample size of audio transcription performance.
Here is more about our dataset below
How we calculate accuracy
- First, we transcribe the files in our dataset automatically through APIs (AssemblyAI, Google, and AWS).
- Second, we transcribe the files in our dataset by human transcriptionists—to approximately 100% accuracy.
- Finally, we compare the API's transcription with our human transcription to calculate Word Error Rate (WER)—more below.
Below, we outline the accuracy score that each transcription API achieved on each audio file. Each result is hyperlinked to a diff of the human transcript versus each API's automatically generated transcript. This helps to highlight the key differences between the human transcripts and the automated transcripts.
Word Error Rate (WER) is the industry-standard for calculating the accuracy of an Automatic Speech Recognition system. The WER compares the automatically generated transcription to the human transcription, for each file in our dataset, counting the number of insertions, deletions, and substitutions made by the automatic system (Google, AWS, etc) in order to calculate the WER.
Before calculating the WER for a particular file, both the truth (human transcriptions) and the automated transcriptions (predictions) must be normalized into the same format. To perform the most accurate comparison, all punctuation and casing is removed, and numbers are converted to the same format.
Personally Identifiable Information (PII) Redaction
Phone call recordings and transcripts often contain sensitive customer information like credit card numbers, addresses, and phone numbers. Other sensitive information like birth dates and medical info can also be stored in recordings and transcripts. AssemblyAI offers PII Detection and Redaction for both transcripts and audio files ran through our API. You can learn more about how this works in the AssemblyAI Documentation.
The full list of information our PII Detection and Redaction feature can detect can be found below:
With PII Detection and Redaction enabled, we ran the above earning calls through our API to generate a redacted transcript for each recording. Below is an excerpt from one of the recordings, and below are links to the full transcriptions.
Topic and Keyword Recognition
Our Topic Detection feature uses the IAB Taxonomy, and can classify transcription texts with up to 698 possible topics. For example, a Tesla earnings call would be classified with the "Automobiles>SelfDrivingCars" topic, among others.
You can find a list of all the topic categories that can be detected in the AssemblyAI Documentation:
Keyword and Phrase Recognition
AssemblyAI can also automatically extract keywords and phrases based on the transcription text. Below is an example of how the model works on a small sample of transcription text.
Using the same dataset, we included the topics and keywords automatically detected in each file below:
Content Safety Detection
Telephony companies are now leveraging machine learning to flag inappropriate content on phone calls. With AssemblyAI's Content Safety Detection feature, we can flag portions of your transcription that contain sensitive content such as hate speech, profanity, or violence. Telephony platforms are using this feature to, for example, detect hate speech and profanity in call centers and within voicemails.
AssemblyAI's Content Safety Detection feature is built using State of the Art Deep Learning models. Our models looks at the entire context of a word/sentence when deciding when to flag a piece of content or not - we don't rely on error-prone backlist approaches.
Below we review the results of AssemblyAI's Content Safety Detection feature on the above dataset:
As you can see, fortunately there was no profanity or sensitive content discussed on these earnings calls. However, the Content Safety Detection feature did correctly flag that "Company Financials" were discussed within these transcriptions- which many telephony platforms need to be able to detect for compliance reasons.
Benchmark Your Data
Benchmarking accuracy amongst providers takes both time and money to run on your content. We offer complimentary benchmark reports for any team seeking a transcription solution. To get started with a complimentary benchmark report, you can go here.