Do I Need A Custom Speech Recognition Model?

In the field of Automatic Speech Recognition, or ASR, custom models are actually considered an “old school” approach. This is because in the past, classical ASR models had plateaued, so the only option was to try to customize the models in order to increase accuracy.

Do I Need A Custom Speech Recognition Model?

Custom means better, right? Often. But with speech recognition models, this assumption doesn’t hold up. In fact, it’s hardly ever the case.

In the field of Automatic Speech Recognition, or ASR, custom models are actually considered an “old school” approach. This is because in the past, classical ASR models had plateaued, so the only option was to try to customize the models in order to increase accuracy.

The modern deep-learning powered models of today, like the one that powers our Speech-to-Text API, are nowhere near plateauing, however. Their accuracy keeps improving, if anything.

We’ll go further into why this is, as well as other misconceptions many people have about custom models, in this article.

Are Custom Models More Accurate than General Models?

In the field of ASR, custom models are rarely more accurate than the best general models (learn more about one measure of accuracy, Word Error Rate or WER, here). This is because general models are trained on huge datasets, and are constantly maintained and updated using the latest deep learning research.

For example, at AssemblyAI, we train large deep neural networks on almost 100,000 hours of speech data. This training data is a mix of many different types of audio (broadcast TV recordings, phone calls, Zoom meetings, videos, etc), accents, and speakers. This massive amount of diverse training data helps our ASR models to generalize extremely well across all types of audio/data, speakers, recording quality, and accents when converting Speech-to-Text in the real world.

Custom models usually come into the mix when dealing with audio data that have unique characteristics unseen by a general model. However, because large, accurate general models see most types of audio data during training, there are not many “unique characteristics” that would trip up a general model - or that a custom model would even be able to learn.

For an example of where a custom model would make sense, let’s look at  children’s speech. The actual audio signal of children’s speech is extremely unique, and often not included in the typical training datasets of general models due to privacy concerns. For this reason, custom models for children’s speech often work better than general models, since children’s speech contains “unique characteristics” previously unseen by a general model.

Does Custom Vocabulary Require a Custom Model?

Not necessarily. Most of the time, what you really need is better proper noun recognition for the custom vocabulary unique to your use case or application.

Many modern Speech-to-Text APIs allow you to dynamically add custom vocabulary to a general model when transcribing audio files. For example, at AssemblyAI, we support custom vocabulary with a feature called Word Boost. With our Word Boost feature, you can add words specific to your use case (industry terms, technical terms, names, etc.) when transcribing audio/video files in order to increase accuracy for those custom vocabulary terms.

Moreover, with AssemblyAI’s Word Boost feature, you can:

  1. Specify key words and/or phrases to boost (E.g, “The IQEZ iPhone App”).
  2. Control the weight of the boost (“low,” “default,” or “high”).

Let’s take a look at how this might look with a brand name like “Budweiser”.

Let’s say our model predicts “Budweiser” as “Bud wiser”. With Word Boost, you can add “Budweiser” to the vocabulary list so that the model will accurately predict “Budweiser” every time.

Are Custom Models Easier to Maintain?

In reality, custom models are much more expensive and time consuming to both train and maintain. Here’s why:

First off, training a custom model requires actually sourcing enough data for training. This is both a time consuming and, potentially, expensive task that will quickly become an ongoing investment for any company considering a custom model. Given the fact that custom ASR models rarely yield accuracy improvements over modern general models, the exercise of both collecting and labeling data for a custom model, in reality, has little ROI.

All models, custom included, must also be continually maintained. Your model must stay up-to-date with new terms (E.g., “Covid”) and the latest research in deep learning methods (which is moving at a dizzying pace). Unfortunately, most of the time a custom model gets “frozen in time” as it doesn’t get perpetually updated with the latest research and neural network architectures. And if not continuously updated, any ASR  model, custom or general, will quickly become obsolete. This is why at AssemblyAI, we push model updates almost weekly, as we are constantly exploring new deep learning techniques to improve the accuracy of our ASR models.

For example, our Core Transcription Model was recently updated to version 8, with 18.72% better accuracy across multiple audio and video types, as well as 24.47% better proper noun recognition.

Should You Invest In a Custom Model??

If you’re still considering a custom model, ask yourself these four questions:

  1. Do you have something very unique (e.g., children’s speech) about your data?
  2. Can you easily obtain large amounts of training data?
  3. Can you afford large amounts of training data?
  4. Do you have the ability, or budget, to continually update your model with the latest deep learning research?

If you answered yes to any of these questions, a custom model could be an option for you. But if you answered no, or are unsure, then you would probably be better suited to a modern, general ASR model built on the latest deep learning research.

If you’d like to chat more about custom models, AssemblyAI’s core transcription model, or anything else, please feel free to contact us here!