What if you received a raw transcript that looked like this?
if you picture a sound meter with a needle that bounces up and down
every time there's a sound the tone is supposed to put the needle
perfectly at this one spot on the meter with a black numbers end and
the red part of the meter begins there's like a zero at that spot
marking this is where you want to be and the tone is just supposed to
rest there rock solid but this particular day with this particular
recording we put it on and keith and i watched the meter as the needle
first dipped below the zero then climbed above the zero and then
floated sort of tentatively to the spot that it was supposed to be at
the zero and rested there
It’s legible but takes quite a bit of effort to read as your mind naturally wants to add punctuation, casing, line breaks, etc. to make sense of the long string of text.
Compare the transcript above to this:
If you picture a sound meter with a needle that bounces up and down
every time there's a sound, the tone is supposed to put the needle
perfectly at this one spot on the meter with a black numbers end, and
the red part of the meter begins there's like a zero at that spot
marking, this is where you want to be. And the tone is just supposed to
rest there rock solid. But this particular day, with this particular
recording, we put it on, and Keith and I watched the meter as the
needle first dipped below the zero, then climbed above the zero, and
then floated sort of tentatively to the spot that it was supposed to be
at the zero and rested there.
See how much easier it is to read? This is because common punctuation and casing have been automatically applied to the transcription text.
Automatic Punctuation and Casing at AssemblyAI
When you transcribe an audio or video file with the AssemblyAI Speech-to-Text API, your transcript is automatically passed through our Punctuation and Casing Model. Instead of a long chunk of text, your transcript has appropriately placed punctuation, such as commas, periods, and question marks, and correctly capitalized proper nouns, acronyms, and more. This helps ease readability and increases the overall usefulness of your transcript.
How We Train Our Models
At AssemblyAI, we use a deep neural network built with the latest Deep Learning techniques to accurately predict punctuation and casing for any transcription text. Our model is trained on text with billions of words that boost prediction accuracy to industry-best standards.
To train the model, we first take a text and remove all punctuation and casing. We feed this raw text to our Deep Learning model; the model then tries to predict the correct punctuation and casing for the text. If there are any incorrect predictions, we correct them and feed the corrected text back into the model so it can learn to make better predictions in the future.
Punctuation refers to any commas, periods, question marks, exclamation marks, etc. that need to be added to a transcription text.
Casing refers to two different categories:
- Proper Nouns
- Special Scenarios, e.g., acronyms like NASA or NY Times.
Because our model is trained on such a wealth of data, it can accurately predict even the most obscure or rare proper nouns. But, if you have a special use case, like industry-specific jargon, you can also use our Word Boost feature to add custom vocabulary words or custom casings so that your transcription accurately predicts these words or scenarios each time.
The same model is used for both real-time and asynchronous transcriptions.
Inverse Text Normalization (ITN)
Inverse Text Normalization, or ITN, is a rule-based system (based on a FST, or Finite State Transducer) that also increases the readability of a transcript.
Essentially, ITN translates the spoken form of text (which is the output of the Speech-to-Text model) into its written form. For example, the raw transcript might output:
february fourth twenty twenty two
(spoken form)
The ITN model converts this to:
february 4th 2022
(written form)
ITN is helpful to ensure the proper written format of text such as emails, credit card numbers, social security numbers, dates, and more.
Model Accuracy
Our latest Punctuation and Casing Model is about 93.5% accurate – well above industry standards.
Accuracy is calculated with a Validation Data Set that consists of text that our model hasn’t seen before.
Say, for example, the Validation Data Set is a 100-word blog post. We would feed the model a text with all of the punctuation and casings removed. The model would then take the transcript and predict where the punctuation and casings should be added back. We then compare the two–say the model accurately predicts the punctuation and casings of 95 out of the 100 words–this would yield 95% accuracy.
AssemblyAI’s Punctuation and Casing Model in Action
Let’s look at the AssemblyAI Punctuation and Casing Model in action using this podcast.
First, we will feed the audio file through the AssemblyAI Speech-to-Text API but disable automatic punctuation and casing and ITN.
Here’s what the raw transcript would look like:
fundrise is a real estate investing platform complete with an app a
website and all the bells and whistles you'd expect from a tech
company that makes investing in highend real estate incredibly easy
you tell fundrise your investing goals and fundraise puts your money
into the real estate deals that are right for you you join the over
one hundred and seventy zero investors using fundrise to diversify
their portfolios without compromise sign up for free today at
fundrise that's f u n d r i s e i remember back when i was first
starting in radio working at npr this was in their old studios on m
street in washington dc my boss back then and my mentor keith talbot
who side note taught me what was possible with radio i would not be
here without keith we were in the studio listening to some recording
and this was back in the days of realtor tape recorders and so it was
long ago right and back then any realtor tape that you would throw up
on a machine at npr would start with i think it was like thirty seconds
Next, we’ll feed the same audio file through the AssemblyAI Speech-to-Text API but without disabling automatic punctuation and casing
Note:
Automatic punctuation and casing are always applied to a transcription text unless manually disabled.
Here’s what the new transcript would look like:
Fundrise is a real estate investing platform complete with an app, a
website, and all the bells and whistles you'd expect from a tech
company. That makes investing in highend real estate incredibly easy.
You tell fundrise your investing goals, and Fundraise puts your money
into the real estate deals that are right for you. You join the over
1700 investors using fundrise to diversify their portfolios without
compromise. Sign up for free today at fundrise. That's F-U-N-D-R-I-S-E.
I remember back when I was first starting in radio, working at NPR.
This was in their old Studios on M Street in Washington, DC. My boss
back then and my mentor, Keith Talbot, who, side note, taught me what
was possible with radio. I would not be here without Keith. We were in
the studio listening to some recording. And this was back in the days
of realtor tape recorders. And so it was long ago, right? And back
then, any realtor tape that you would throw up on a machine at NPR
would start with, I think it was like 30 seconds.
The second transcription is much easier to read, right?
Diffchecker is a great tool to use to visually compare two transcripts. Here’s what it shows for the two above:
Raw transcript:
With punctuation and casing and ITN:
Using Punctuation and Casing with the AssemblyAI Speech-to-Text API
As stated above, the AssemblyAI Speech-to-Text API will automatically punctuate and apply properly cased proper nouns to the transcription text. Numbers will also automatically be converted to their written format.
However, if you would like to disable these features, simply set the punctuate
and format_text
parameters to false
in the JSON body. More details can also be found in the AssemblyAI docs.
Future Model Updates
Since the AssemblyAI Punctuation and Casing Model is built using a neural network, we are always incorporating the latest Deep Learning approaches and new State-of-the-Art techniques to see if we can improve our model. Our changelog, updated weekly, records all of these changes and improvements.
We also regularly feed the model new training data; however, there is always a fine balance between model size and inference speed–if you increase the size of your model too much, you may unintentionally decrease your prediction speed to an unacceptable level.