Releasing our v8 Transcription Model

Today, we’re happy to announce the release of our most accurate Speech Recognition model to date - version 8 (v8). This updated model delivers significant accuracy improvements across many types of audio and video data.

In the table below, we illustrate the accuracy improvements our v8 model provides across a few of our internal benchmark datasets.

We’ve heard from our customers how important proper noun recognition is, so we’re also excited to report that our v8 model introduces a major improvement in proper noun accuracy as well, as shown below.

To see an example of the v8 model in action, take a look at the below video. The captions in this video were generated automatically using the AssemblyAI API - zero human editing was done.

How We Got Here

At our core - AssemblyAI is a deep learning company. Over 75% of our company’s employees are talented AI researchers and engineers that have worked for companies like Apple, BMW, Cisco, and other leading technology companies.

We’re constantly researching and improving the models that power both our Speech-to-Text API, as well as other features, like our Topic Detection feature that can accurately detect the topics being spoken in an audio/video file.

Many of our customers choose to integrate with our API for that very reason - our speed and rate of improvement. It’s not out of the norm for our team to push model updates on a weekly basis.

Year-over-year, we’ve made huge strides in our API’s transcription accuracy. By the end of 2022, we expect to have developed speech recognition models that come very close to human level accuracy - even on challenging audio and video files with heavy accents and background noise - firmly establishing AssemblyAI as the leader in this space.

The Research Behind v8

Improving our use of Transformers

Transformer based neural networks are a popular type of neural network architecture for sequential data. Nowadays, they are used in many modern NLP, Computer Vision, and Speech Recognition systems like the ones we build at AssemblyAI.

OpenAI’s popular GPT-3 model, for example, is a giant Transformer based neural network.

While Transformers are very powerful at capturing global features of human speech, we’ve improved our models by interleaving Convolution Neural Network (CNN) layers between Transformer layers. CNN layers are very good at modeling local features, so this interleaving approach enforces our v8 model to pay attention to both local and global features of speech - which results in v8 being able to better model patterns in human speech in order to make accurate predictions.

In addition, by improving the regularization of our v8 model via techniques like Layer Norm - we were able to train a model that is robust to many different types of audio data (regardless of accent, medium, etc) - which is why we see such big accuracy gains across many different types of benchmark datasets (phone calls, zoom meetings, broadcast TV, etc).

Jointly trained Language Model

In the field of Speech Recognition, an external Language Model is commonly used to guide a Speech Recognition model into making more accurate predictions. For example, if a Speech Recognition model predicts “two” and “too” - a Language Model will help prioritize the appropriate spelling, based on the context of the word. For example, “me too” (preferred by the Language Model) versus “me two”.

With our v8 model, we jointly train a Language Model with the Speech Recognition model. Rather than being two separate models, jointly training these models help the system work synergistically and more accurately as a whole.

Word-Piece vs Characters

Our v8 model uses something called “word pieces”, as opposed to individual characters, when making predictions. This results in higher accuracy, especially for rare words such as proper nouns, and is one of the main drivers behind the 24.47% improvement in proper noun accuracy that v8 introduces.

This is because when a rare word is broken down into fewer but longer segments, there is less room for the model to make a mistake, because it needs to make a fewer number of correct predictions.

For example, assume the word “valedictorian” can be broken into “vale”, “dict”, “orian”. Then, the model only needs to make 3 correct predictions: “vale”, “dict”, and “orian”, whereas a character based model needs to make 13 correct predictions “v”, “a”, “l”, etc - leaving more room for error.