Blog

New Punctuation and Casing model released 🎉

Product Updates
New Punctuation and casing model


We’ve been hard at work at AssemblyAI dramatically improving our speech-to-text features and this week sees an update to our punctuation and casing features with a brand new model.

A model in AI or machine learning is a mathematical algorithm that replicates a decision process to enable automation and understanding. 

At AssemblyAI, our Speech-to-Text models output transcripts in their raw form, for example: “my name is bob and i am seventy eight years old”. This raw text is then sent through a second model, called a Punctuation Restoration model, which adds punctuation and casing to that raw text.

Our Punctuation Restoration model is a multi-class classifier under the hood. For each word, we are basically predicting a class that denotes either one or two actions: do nothing (i.e. leave the word as-is), or add punctuation and/or casing to the word. For example, the model might predict that we should uppercase a word, and add a comma to the end of it. Or, the model might predict that the word should stay lowercase, but a period should be added to the end of it. 

After getting the predictions for each word, we end up with a well-punctuated transcription with proper casing!

Our current best model leverages a transformer-based model architecture, which produces superior results compared to RNN based models. This model was trained on over 1 billion tokens, and yields an accuracy of over 92% for punctuation and casing restoration!

For more information on the above transformer-based model architectures, check out this whitepaper.

What does that look like?

Take the following text example. This text block lacks any punctuation or casing.

Raw sample:

hi how are you good how are you i'm good i am just enjoying the weather oh nice where do you live i live in sf i work for a company called assemblyai oh nice is your whole team in sf with you no actually we are a remote team so we have people everywhere in the us and in europe as well wow that's cool i used to work at aws and we were also a remote team nice what did you have for lunch today i had a hot dog was it good yes

Once this is passed through the new model, we get the following output:

Hi. How are you? Good. How are you? I'm good. I am just enjoying the weather. Oh, nice. Where do you live? I live in SF. I work for a company called AssemblyAI. Oh, nice. Is your whole team in SF with you? No, actually, we are a remote team. So we have people everywhere in the us and in Europe as well. Wow, that's cool. I used to work at AWS, and we were also a remote team. Nice. What did you have for lunch today? I had a hot dog. Was it good? Yes.

As you can see the model correctly interprets the different cadences within the speech, adding correct punctuation and casing. It will even correctly case rare words or business names like AssemblyAI and AWS.

Industry specific language

Even with very industry-specific language, the model performs exceptionally well as seen in the example below.

Raw sample:

all our statements are made as of today february 24 2021 based on information currently available to us except as required by law we assume no obligation to update any such statements during this call we will discuss non gaap financial measures you can find a reconciliation of these non gaap financial measures to gaap financial measures in our cfo commentary which is posted on our website during this call we may make forward looking statements based on current expectations

Once punctuation and casing is applied

All our statements are made as of today February 24, 2021 based on information currently available to us except as required by law. We assume no obligation to update any such statements during this call. We will discuss non GAAP financial measures. You can find a reconciliation of these non GAAP financial measures to GAAP financial measures in our CFO commentary, which is posted on our website. During this call, we may make forward looking statements based on current expectations.

Punctuation and casing are applied by default to all API requests, so you don’t even need to do anything to utilize this fantastic new model!

Get started with AssemblyAI speech-to-text transcriptions by signing up for a free account here.

If you have any questions or ideas, then I would love to hear from you!



Subscribe to our blog!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

You may also like

Checkout some of our recent research and product updates

Getting started with HttpClientFactory in C# and .NET 5
Developer
Getting started with HttpClientFactory in C# and .NET 5

HttpClientFactory has been around the .NET ecosystem for a few years now. In this post we will look at 3 basic implementations of HttpClientFactory; basic, named, and typed.

Feature Announcement: Content Safety Detection
Product Updates
Feature Announcement: Content Safety Detection is now GA!

Automatically transcribe audio and video files, and surface sensitive content, such "Hate Speech" or "NSFW" content, found within the audio.

Changelog: New Speaker Diarization model released
Changelog
Changelog: New Speaker Diarization model released

We have released a new Diarization model. Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.

ADVANCED TRANSCRIPTON FEATURES

Unlock your media with our advanced features like PII Redaction,
Keyword Boosts, Automatic Transcript Highlights, and more