Comparing Speech-to-Text APIs on Phone Call Transcription

Product managers and developers at telephony companies are consistently leveraging automatic speech recognition (ASR) to power core features in their products and platforms.

For example, telephony platforms like Convirza, CallRail, TalkRoute, and WhatConverts offer their customers industry-leading solutions using Speech-to-Text. Below are a few examples:

Accuracy of a Speech-to-Text system is critical in order for these telephony platforms to build high quality features that users and customers love.

In this report, we look at 5 different earning calls from various companies (shown in more detail below), and review how accurately AssemblyAI, AWS Transcribe, and Google Speech-to-Text are able to automatically transcribe these recordings.

In addition to reviewing Speech Recognition accuracy, our research team reviewed the results of AssemblyAI's unique Personal Identifiable Information (PII) Redaction, Topic Recognition, Keyword Detection, and Content Safety features on these call recordings. This report is meant to serve as a point of reference to compare the best Automated Speech Recognition and Conversational Intelligence solutions in the market for Telephony Platforms.

Speech Recognition Accuracy

Our Dataset

We included earning call recordings for 5 major companies, Twilio, Facebook, Apple, Microsoft, and MongoDB. Chosen at random, our intention is to provide you with a healthy sample size of audio transcription performance.

Here is more about our dataset below:

How We Calculate Accuracy

First, we transcribe the files in our dataset automatically through APIs (AssemblyAI, Google, and AWS).
Second, we transcribe the files in our dataset by human transcriptionists—to approximately 100% accuracy.
Finally, we compare the API's transcription with our human transcription to calculate Word Error Rate (WER)—more below.

Below, we outline the accuracy score that each transcription API achieved on each audio file. Each result is hyperlinked to a diff of the human transcript versus each API's automatically generated transcript. This helps to highlight the key differences between the human transcripts and the automated transcripts.

Accuracy Averages

WER Methodology

Word Error Rate (WER) is the industry-standard for calculating the accuracy of an Automatic Speech Recognition system. The WER compares the automatically generated transcription to the human transcription, for each file in our dataset, counting the number of insertions, deletions, and substitutions made by the automatic system (Google, AWS, etc) in order to calculate the WER.

Before calculating the WER for a particular file, both the truth (human transcriptions) and the automated transcriptions (predictions) must be normalized into the same format. To perform the most accurate comparison, all punctuation and casing is removed, and numbers are converted to the same format.

For example:

truth -> Hi my name is Bob I am 72 years old.

normalized truth -> hi my name is bob i am seventy two years old

Personally Identifiable Information (PII) Redaction

Phone call recordings and transcripts often contain sensitive customer information like credit card numbers, addresses, and phone numbers. Other sensitive information like birth dates and medical info can also be stored in recordings and transcripts. AssemblyAI offers PII Detection and Redaction for both transcripts and audio files ran through our API. You can learn more about how this works in the AssemblyAI Documentation.

The full list of information our PII Detection and Redaction feature can detect can be found below:

With PII Detection and Redaction enabled, we ran the above earning calls through our API to generate a redacted transcript for each recording. Below is an excerpt from one of the recordings, and below are links to the full transcriptions.

Good afternoon. My name is [PERSON_NAME] and I will be your [OCCUPATION] 
[OCCUPATION] today. At this time, I would like to welcome everyone
to the [ORGANIZATION] [DATE] [DATE] and full Year [DATE] [DATE] earnings
conference call. All lines have been placed on mute to prevent any
background noise. After the speakers remarks, there will be a question
and answer session. If you would like to ask a question during
that time, please press star, then the number one on your telephone
keypad. This call will be recorded. Thank you very much, [PERSON_NAME]
[PERSON_NAME] [PERSON_NAME], [ORGANIZATION] [OCCUPATION] [OCCUPATION]
[OCCUPATION] [OCCUPATION] [OCCUPATION]. You may begin. Thank you. Good
afternoon and welcome to [ORGANIZATION] for quarter and full Year [DATE] [DATE] earnings conference call. Joining me today to discuss our results
or [PERSON_NAME] [PERSON_NAME], [OCCUPATION] [PERSON_NAME] [PERSON_NAME]
[OCCUPATION] and [PERSON_NAME] [PERSON_NAME], [OCCUPATION]. Before we
get started, I would like to take this opportunity to remind you that
our remarks today will include forward looking statements. Actual
results may differ materially from those contemplated by these forward
looking statements. Factors that could cause these results to differ
materially are set forth in today's press release and in our quarterly
report on Form 10 [DATE] filed with the [ORGANIZATION].

Topic Detection

Our Topic Detection feature uses the IAB Taxonomy, and can classify transcription texts with up to 698 possible topics. For example, a Tesla earnings call would be classified with the "Automobiles>SelfDrivingCars" topic, among others.

You can find a list of all the topic categories that can be detected in the AssemblyAI Documentation.

Keyword and Phrase Recognition

AssemblyAI can also automatically extract keywords and phrases based on the transcription text. Below is an example of how the model works on a small sample of transcription text.

Original Transcription:

Hi I'm joy. Hi I'm Sharon. Do you have kids in school? I have 
grandchildren in school. Okay, well, my kids are in middle school in high
school. Do you think there is anything wrong with the school system?

Detected Keywords:

"high school"
"middle school"
"kids"

Using the same dataset, we included the topics and keywords automatically detected in each file below:

Content Safety Detection

Telephony companies are now leveraging machine learning to flag inappropriate content on phone calls. With AssemblyAI's Content Safety Detection feature, we can flag portions of your transcription that contain sensitive content such as hate speech, profanity, or violence. Telephony platforms are using this feature to, for example, detect hate speech and profanity in call centers and within voicemails.

AssemblyAI's Content Safety Detection feature is built using State of the Art Deep Learning models. Our models looks at the entire context of a word/sentence when deciding when to flag a piece of content or not - we don't rely on error-prone backlist approaches.

Below we review the results of AssemblyAI's Content Safety Detection feature on the above dataset:

As you can see, fortunately there was no profanity or sensitive content discussed on these earnings calls. However, the Content Safety Detection feature did correctly flag that "Company Financials" were discussed within these transcriptions- which many telephony platforms need to be able to detect for compliance reasons.

Benchmark Your Data

Benchmarking accuracy amongst providers takes both time and money to run on your content. We offer complimentary benchmark reports for any team seeking a transcription solution. To get started with a complimentary benchmark report, you can go here.‍‍