Building with Automatic Speech Recognition (ASR) models: Why accuracy matters

The speech and voice recognition market is expected to grow to nearly $60 billion by 2030, thanks to recent advances in AI research that have made speech recognition models more accurate, accessible, and affordable than ever before.

In conjunction with this, new Generative AI models like DALLE-2, Stable Diffusion, and ChatGPT are helping to skyrocket demand for enterprise-facing Generative AI tools.

Companies that process large amounts of customer data—like sales intelligence platforms and contact centers—are exploring how to use this new research to build:

Useful speech transcription tools for their customers.
Generative AI tools and features on top of this transcribed audio and video data to offer intelligent, high-ROI insights for their customer base.

As product teams explore how to best integrate speech recognition models and Generative AI into their platforms, they may be wondering: how much does speech recognition accuracy matter?

In this article, we explore the answer to that question, as well as background information on what Automatic Speech Recognition (ASR) is, how ASR accuracy is measured, and some of the top industry use cases for building with ASR.

What are Automatic Speech Recognition (ASR) models?

Automatic Speech Recognition (ASR) models use Artificial Intelligence (AI) to process human speech into readable text. ASR models can transcribe audio and video files asynchronously, meaning after the file has been recorded, or synchronously with the aid of real-time transcription models.

While ASR has been around for decades, its full utility is finally being realized thanks to significant increases in model accuracy. In fact, ASR models today, such as AssemblyAI’s Conformer-2 model, are trained on enormous amounts of data to achieve near-human-level transcription accuracy.

How is ASR accuracy measured?

The accuracy of Automatic Speech Recognition models is typically measured by Word Error Rate, or WER. WER calculates how many “errors” there are in a transcription text when compared to a human transcription by adding the number of substitutions, deletions, and insertions and dividing by the total number of words in the transcription.

However, small variations in transcription texts such as capitalization, punctuation, and spelling can dramatically alter the WER outcome. That’s why it’s recommended to “normalize” the transcriptions by doing the following in order to obtain a more accurate comparison:

Lowercase all text
Remove all punctuation
Change all numbers to their written form
Etc.

It’s also important to choose the right dataset for comparison by ensuring that the chosen dataset is (a) relevant to the intended use case and (b) simulates a real-world (non-academic) dataset that contains noise, volume differences, speaker accents, etc.

Combined, these nuances will help ensure that a more accurate, useful WER measurement is calculated.

Why does ASR accuracy matter?

The accuracy of an ASR model on real-world data is especially important.

If product teams are looking to incorporate video subtitles or meeting transcription texts into a platform, they need to ensure the output will accurately reflect the conversation that took place and be valuable to their users.

However, the audio and/or video processed by enterprise platforms can be of varying quality. Think of a Zoom meeting that takes place in a busy coffee shop or one that includes a composition of accented speakers, for example.

It is important that ASR models are robust to low-quality data and a wide variety of speakers and accents to ensure high accuracy in a production setting.

Second, suppose product teams are looking to build Generative AI tools and features on top of audio and video data. In that case, the accuracy of these tools can only ever be as accurate as the transcription data that is first fed into the model.

For example, a meeting platform might want to build a Generative AI feature that automatically summarizes each meeting and suggests the next steps for attendees to take once the meeting concludes. However, if a low-accuracy ASR model is used, then the transcript will have errors that could affect how accurate the list of next steps is, making them of little use to the user.

Or a sales coaching platform might wish to create a Generative AI tool that helps coach sales representatives during a phone conversation with a customer. Again, the coaching suggestions here can only be as accurate, and useful, as the transcription that is fed into the model.

ASR accuracy matters. Always keep in mind that evaluating accuracy goes beyond a simple WER analysis, and adjustments must be made to produce real-world results (See How Accuracy is Measured section above).

Building with ASR

Today, product teams at top AI-first companies are starting to integrate ASR and Generative AI to offer an impressive portfolio of intelligent tools for their customers.

Let’s take a look at a few real-world use cases.

Use case 1: Reduce manual tasks

Screenloop, a hiring intelligence platform, added highly accurate ASR to build an interview transcription tool and additional AI-powered tools on top of this transcription data.

For its end users, the additions of these ASR and AI-powered tools result in 90% less time spent on manual tasks, fueling productivity and reducing time to hire.

Use case 2: Smarter and faster QA

As a Contact Center as a Service, Aloware had gigabytes of unstructured data sitting untouched. Its product team wanted to help its customers transform this data into meaningful insights that would drive more intelligent business decisions.

By integrating highly accurate ASR, Aloware was able to build a Smart Transcription tool that dramatically expedited QA tasks and reduced the potential for human errors in the review process, while also setting the foundation for additional high-ROI tools built on top of transcription data.

Use case 3: Augment productivity

The product team at Grain, an AI-powered meeting recorder, wanted to integrate ASR to build tools and features that help its users better understand and advocate for their customers’ needs.

Grain’s team integrated an ASR model to both transcribe all recorded conversations at near human-level accuracy and to build Generative AI tools on top of the transcription data that can automatically identify, flag, highlight, clip, and summarize the customer meeting files processed through its platform, significantly augmenting user productivity.

Use case 4: Perform better customer research

As a qualitative data analysis platform, Marvin helps its users collect, organize, and analyze research data to help them build better products and services.

To augment this workflow, Marvin’s product team integrated ASR to offer audio and video transcription, as well as to build AI-powered tools on top of the transcription data. After this integration, Marvin’s users now spend 60% less time analyzing data while gaining more intelligent insights into customer behavior.

Shipping enterprise-ready ASR and AI tools

Automatic Speech Recognition has the power to augment productivity, reduce manual tasks, and serve as a foundational component that feeds AI-generated insights for companies across industries.

For those looking to quickly integrate and build with highly accurate ASR, consider working with an AI partner that offers a complete Speech AI system from ASR to Large Language Models—along with enterprise-grade data security and tailored customer support to help build accurate, high-performing AI tools and features faster.