April 3, 2024

Universal-1

Robust and accurate multilingual speech-to-text

We are excited to introduce Universal-1, our latest and most powerful speech recognition model. Trained on over 12.5 million hours of multilingual audio data, Universal-1 achieves best-in-class speech-to-text accuracy across four major languages: English, Spanish, French, and German.

Universal-1 achieves industry-leading performance over various dimensions in a multilingual setting. Training on more than 12.5 million hours of diverse multilingual audio data, 600M-parameter Conformer RNN-T based Universal-1 achieves remarkable robustness across the diverse set of key dimensions we have identified as crucial for our customers.

A chart showcasing the difference in Word Error Rate by Language between AssemblyAI Universal-1 and other models liks Whisper Large v3, Azure Batch v3.1, Deepgram Nova-2, Amazon, Google latest_lang.

Accurate and robust multilingual speech-to-text

Universal-1 represents another major milestone in our mission to provide accurate, faithful, and robust speech-to-text capabilities for multiple languages, helping our customers and developers worldwide build various Speech AI applications.

Universal-1 achieves 10% or greater improvement in English, Spanish, and German speech-to-text accuracy, compared to the next-best system we tested, including both commercially available open-source models and speech recognition providers [Link]
Universal-1 reduces hallucination rate by 30% on speech data and by 90% on ambient noise, over a widely used open-source model, Whisper Large-v3, providing customers with confidence in their transcripts. [Link]
People prefer the outputs from Universal-1 over Conformer-2, our previous Automatic Speech Recognition (ASR) model, 71% of the time when they have a preference. [Link]
Universal-1 exhibits the ability to code switch, transcribing multiple languages within a single audio file. [Link]

Precise timestamp estimation

Word-level timestamps are essential for various downstream applications, such as audio and video editing as well as conversation analytics.

Universal-1 improved our timestamp accuracy by 13% relative to Conformer-2, outperforming Whisper Large-v3 by 26%. [Link]
The improvement in timestamp estimation results in a positive impact on speaker diarization, improving concatenated minimum-permutation word error rate (cpWER) by 14% and speaker count estimation accuracy by 71% compared to Conformer-2. [Link]

Efficient parallel inference

Effective parallelization during inference is crucial to achieve very low turnaround processing time for long audio files.

Universal-1 achieves a 5x speed-up compared to a fast and batch-enabled implementation of Whisper Large-v3 on the same hardware. [Link]

Building Universal-1

To attain the improvements outlined above, we have leveraged state-of-the-art ASR research and assembled a robust system using the most fitting techniques.

Model

Universal-1 utilizes a Conformer encoder [3], building upon our Conformer-1 and Conformer-2 experiences. Within the landscape of ASR architectures, which includes the likes of connectionist temporal classification (CTC) and attention-encoder-decoder (AED), we chose to use the recurrent neural network transducer model (RNN-T) for its speed and accuracy. With its optimized decoder implementation, our RNN-T's compute is largely allocated to the encoder. As the encoder computation can effectively leverage highly parallelized processing, the model is a suitable choice for large-scale parallel inference, enabling low-cost, fast turnaround time. Additionally, this architecture is able to retain time alignment information well, facilitating precise word-level timestamp estimation.

Universal-1's encoder consists of a stack of convolutional layers for 4x subsampling, a positional encoder, and 24 Conformer layers comprising approximately 600 million parameters in total. Each Conformer block utilizes chunk-wise attention [1], with each chunk being 8 seconds long. Besides the benefit of robustness to audio duration variations, chunk-wise attention offers a processing speed benefit by limiting attention computation within each chunk. Our RNN-T decoder utilizes a two-layer LSTM predictor and a joiner for producing output tokens, where we use a WordPiece tokenizer trained on multilingual text corpora.

Training

The Universal-1 ASR model was trained on a vast amount of unlabeled multilingual audio data as well as a large-scale labeled dataset by leveraging the emerging self-supervised learning framework. As illustrated in Fig. 1, the encoder is first pre-trained based on the unlabeled data, preconditioning the model on a wide variety of acoustic and linguistic conditions. Then, a randomly initialized decoder was added, and the entire network was fine-tuned end-to-end to perform ASR.

Fig. 1—Overview of our model training pipeline.

Pre-training using unlabeled audio

We chose the BEST-RQ [7] algorithm for pre-training as our self-supervised learning (SSL) scheme. BEST-RQ is advantageous over other frequently used SSL methods such as Hubert [5] and wav2vec 2.0[6] in terms of data scalability, since it does not require extra representation learning. This is enabled by a random projection quantizer and a randomly initialized codebook. Efficient data scaling is crucial for dealing with tens of millions of hours of audio data.

In our BEST-RQ implementation, masking positions are randomly sampled with a probability of 1%. For each masking position, 10 consecutive frames are masked. The features at the masked frames are replaced by random vectors while the original features are used to compute codebook indices, which are used for training targets in BEST-RQ [7]. All input sequences are cropped to a random window of 32 seconds to maximize the utilization of available computation resources.

Pre-training is performed with a single pass over our full dataset comprising 12.5 million hours of audio. We utilize the AdamW optimizer with a linearly decaying learning rate after an initial warm-up period. A batch size of 2048 is used. Pre-training was performed on a cluster of v5e TPU chips, operating with 54% Model Flops Utilization (MFU), thanks to our optimized JAX implementation.

Supervised Fine-tuning

After pre-training the Conformer encoder, we fine-tune a full RNN-T ASR model by starting the training from a randomly initialized decoder and the pre-trained encoder. To achieve acoustic and linguistic robustness, we utilize a mix of human-transcribed and pseudo-labeled data, where the pseudo-labeled dataset is greater in quantity and covers a wider range of conditions.

We use an AdamW optimizer employing different learning schedules for the encoder and decoder. Specifically, a smaller peak learning rate with a longer warm-up period is applied to the encoder parameter update since it is already pre-trained. A batch size of 4096 is used with carefully chosen data sampling ratios for different dataset domains. This ensures consistency of the data distribution across training batches. We employ a custom JAX implementation of RNN-T loss [4], which we have optimized to substantially reduce TPU memory usage during training.

Infrastructure

Due to the huge data scale, a robust and scalable infrastructure and system design are critical. We have built our infrastructure on Google Cloud TPUs and used JAX for training, leveraging its expressiveness, flexibility, and speed. We have integrated Google Cloud’s Bigtableservice with pre-caching capabilities to aid in handling our massive-scale datasets, amounting to hundreds of millions of rows, each row comprising audio paths, labels, and various other elements that cannot be directly stored and loaded into memory locally. We have found this approach to be very reliable for our use case.

Data

To achieve our goals of robust and accurate speech-to-text, we paid special attention to the datasets we used for training. We combined various data enhancement approaches laid out in the literature.

Universal-1's Conformer encoder was pre-trained on 12.5 million hours of unlabeled multilingual audio data using BEST-RQ. Not only did this pre-training step expose the model to a very wide range of speech conditions, but it also allowed the fine-tuning step to converge quickly, facilitating experimentation cycles. To ensure the quality and diversity of the unlabeled dataset, various filtering and pre-processing techniques were implemented. These include careful selection of data sources, signal-to-noise ratio-based filtering, and speech density-based filtering, among others.

During fine-tuning, we utilized both human-transcribed and machine-transcribed (i.e. pseudo-labeled) datasets for all four languages. We employed the pseudo-labeled data because acquiring human-transcribed data is expensive and may not be adequate to capture the variations and nuances of speech, particularly in a multilingual context. This approach showed significant improvements in noise robustness for our previous ASR model, Conformer-2.

Pseudo-labeling involves using existing ASR models to transcribe unlabeled speech data for training purposes. While it allows us to leverage untranscribed data and circumvent the need for costly human labeling, it presents challenges as ASR models are inherently less reliable than human transcribers. Specifically, they may produce consistent error patterns, causing them to introduce systematic error into our system. To maintain high accuracy, we combined outputs from two ASR models and retained only the speech files for which the two machine-generated transcriptions exhibited sufficient consensus. As a result, we obtained a total of 1.62 million hours of fine-tuning data, comprising human-transcribed and pseudo-labeled speech files across the four languages Universal-1 is trained for, although dominated largely by English. This dataset was commonly used during fine-tuning for Spanish, German, and French, while a special data mix was used when fine-tuning for English to optimize for the English use case.

Performance Analysis

For an overview of how we evaluate ASR models at AssemblyAI, please reference our blog post: How to evaluate Speech Recognition models.

Speech-to-text Accuracy (English)

To evaluate Universal-1’s speech-to-text accuracy on English data, we measured the Word Error Rate (WER) across diverse testing scenarios using public and internal corpora. We compared Universal-1's performance with four commercial ASR providers and two large open-source models, OpenAI’s Whisper Large-v3 [2] and NVIDIA’s Canary-1B.

We used the default settings for each provider’s strongest model as of March 2024. Universal-1 achieved the lowest WER on 5 out of 11 datasets and exhibited strong robustness in acoustically challenging environments, including telephony and noisy scenarios. We confirmed that Universal-1 outperformed our predecessor, Conformer-2, by 11.0% relatively on the same datasets. Please refer to the Test Data section for the details of each test set.

Fig. 2—The above bar chart shows the WERs (lower is better) for Universal-1, commercial ASR providers, and open-source models, demonstrating Universal-1's superior robustness across various conditions. The average WER is calculated as a macro-average of the WERs over all datasets.

Dataset	AssemblyAI Universal-1	NVIDIA Canary-1B		Microsoft Azure Batch v3.1	Deepgram Nova-2	Amazon	Google Latest-long
Dataset	AssemblyAI Universal-1	5.4%	4.8%	6.0%	6.4%	5.9%	8.2%
CommonVoice v5.1	6.8%	6.9%	8.7%	9.2%	11.9%	9.0%	16.9%
CommonVoice v5.1	11.8%	9.4%	12.5%	12.7%	10.6%	11.3%	19.1%
Earnings 2021	9.9%	11.2%	9.6%	7.4%	12.3%	10.5%	11.9%
Earnings 2021	1.6%	1.5%	1.8%	2.8%	2.6%	2.9%	5.8%
LibriSpeech Test-Other	3.1%	3.0%	3.6%	6.4%	5.7%	6.6%	12.6%
Meanwhile	4.7%	6.0%	9.7%	6.2%	6.4%	7.3%	10.0%
Noisy	10.2%	12.9%	11.8%	16.8%	15.7%	27.5%	21.3%
Podcast	8.4%	11.7%	10.0%	9.7%	11.8%	10.2%	11.9%
TEDLIUM	7.5%	7.8%	7.4%	9.7%	6.6%	9.1%	11.3%
Telephony (internal)	11.0%	13.1%	12.8%	16.4%	15.1%	16.0%	20.4%
Average	7.3%	8.1%	8.4%	9.4%	9.5%	10.6%	13.6%

Table 1 - English WERs for different datasets, obtained by Universal-1 and other open-source and commercial ASR systems.

Speech-to-text Accuracy (Non-English)

We also measured the WER for Spanish, French, and German by using a mix of public and internal datasets. Our internal datasets focus on long-form audio resembling customer domains. Figure 3 and Table 1 compare Universal-1 with the open-source models and ASR providers. It achieved lower WERs in 5 out of 15 datasets, demonstrating its competitiveness in these languages.

Fig. 3—The bar chart above illustrates the WERs (lower is better) for each model and ASR provider in Spanish, French, and German ASR. The result demonstrates Universal-1's competitiveness across various datasets in these three non-English languages.

Dataset	AssemblyAI Universal-1	NVIDIA Canary-1B	OpenAI Whisper Large-v3	Microsoft Azure Batch v3.1	Deepgram Nova-2	Amazon	Google Latest-long
Spanish
Fleurs	5.0%	7.1%	2.8%	4.9%	7.0%	6.2%	5.0%
Multilingual LS	3.3%	3.0%	5.7%	5.7%	5.8%	3.4%	7.2%
Private	4.6%	11.3%	6.6%	8.9%	8.2%	6.2%	13.7%
Voxpopuli	7.5%	7.1%	9.6%	8.8%	9.3%	8.6%	10.7%
Common Voice v9	3.4%	4.2%	5.0%	7.5%	8.0%	4.7%	7.1%
Average	4.8%	6.5%	6.0%	7.2%	7.6%	5.8%	8.7%
German
Fleurs	10.2%	9.2%	7.5%	7.2%	12.4%	9.0%	12.1%
Multilingual LS	3.6%	4.5%	7.4%	5.7%	8.2%	3.2%	12.0%
Private	7.4%	11.9%	8.7%	9.5%	11.1%	9.7%	12.1%
Voxpopuli	12.2%	9.5%	12.6%	14.7%	15.1%	16.8%	17.4%
Common Voice v9	4.2%	4.6%	6.0%	7.4%	9.4%	5.8%	10.1%
Average	7.5%	7.9%	8.4%	8.9%	11.2%	8.9%	12.7%
French
Fleurs	6.8%	7.7%	5.6%	8.4%	9.7%	8.5%	12.4%
Multilingual LS	2.3%	4.0%	8.1%	9.7%	5.2%	4.5%	15.2%
Private	16.8%	27.6%	16.1%	23.7%	17.5%	14.9%	17.6%
Voxpopuli	9.2%	10.1%	11.2%	11.4%	11.2%	8.7%	14.7%
Common Voice v9	7.0%	6.5%	11.3%	12.9%	12.1%	7.4%	17.5%
Average	8.4%	11.2%	10.5%	13.2%	11.1%	8.8%	15.5%

Table 2 - Spanish, German, and French WERs for different datasets, obtained by Universal-1 and other open-source and commercial ASR systems.

Timestamp Accuracy

One of our objectives is to provide accurate timestamps for each recognized word. We have confirmed that, in addition to transcribing speech accurately, Universal-1 does provide more accurate word timestamps than Conformer-2 and Whisper Large-v3.

We developed an experimental setup using internal ASR evaluation datasets containing phone calls, podcast, news and webinar audio to evaluate timestamp accuracy. We used a high fidelity forced alignment algorithm to obtain reference timestamps. Forced alignment is a process of generating word-level timestamps based on audio and the corresponding human-labeled transcriptions. The ASR outputs are aligned with the human transcriptions to pair each recognized word with a ground-truth word. For each paired word, the difference between the reference and estimated timestamps is computed.

Figure 4 depicts the word-level timestamp estimation accuracy as a function of an estimation error tolerance threshold. For each x-axis value, the corresponding y-axis value shows the percentage of words whose estimated timestamp falls within this threshold when compared to the reference timestamp. A curve that approaches the upper-left corner indicates a model with more accurate timestamp estimation. The percentage of words with a predicted timestamp within 100ms of the reference improved by 25.5% relative to Whisper Large-v3, increasing from 67.2% to 84.3%. This improvement could be attributed to the utilization of the Transducer model architecture instead of a Transformer decoder. Universal-1 outperformed Conformer-2 by 12.6% in terms of word timestamp estimation accuracy for the 100ms tolerance threshold.

Fig. 4: Word-level timestamp estimation accuracy as a function of estimation error tolerance. A curve closer to the upper left corner indicates higher accuracy.

Inference Efficiency

The architectural choices described above enable Universal-1 to achieve very high inference speeds. First, chunk-wise attention reduces the overall number of attention computations. Moreover, the Transducer decoder is more compact than a typical Transformer decoder, and requires no attention computation at all. To validate Universal-1’s efficiency advantage using the same inference hardware, we compared Universal-1's processing speed with that of Whisper Large-v3 on popular NVIDIA Tesla T4 machines with 16GB of VRAM

Model	Batch Size	Time to transcribe 1h
Whisper (faster-whisper backend)	1	216s
Whisper (faster-whisper backend)	24	107s
Universal-1	1	68s
Universal-1	64	21s

Table 3 - Time required to transcribe 1 hour of audio for each model, with and without parallelized inference, benchmarked on T4 GPUs.

Table 3 shows the time it took for Universal-1 and Whisper Large-v3 to transcribe one hour of audio from our test sets, with and without parallelization.

As a strong baseline, we leveraged the highly optimized WhisperX implementation, which uses the faster-whisper backend, offering fast inference and low memory usage, and allows batched inference.

Without batching, Universal-1 is already 3x faster than faster-whisper. Since our model also reduces the overall memory footprint, it also allows for greater maximum batch sizes at inference. Overall, we observe a 5x speedup for parallelized inference over faster-whisper — a meaningful reduction in turnaround time for our users.

Hallucinations

A key improvement we measured in Universal-1 over popular open-source ASR models is its reduced propensity for hallucinations, often manifesting as long contiguous blocks of consecutive transcription errors.

Similar to large language models that sometimes generate non-factual text, ASR models can occasionally hallucinate by continuously producing words ungrounded in the input audio. These errors may involve auto-completing the transcription, repeating the same word, or completely ignoring speech segments present in the audio.

Such unpredictable recognition errors can be challenging to detect and may stem from factors like model architecture and quality of training data. The Whisper Large-v3 model, in particular, has exhibited an increased propensity for hallucinations compared to its predecessors. Universal-1 mitigates consecutive hallucinations through its Transducer-based architecture [4] and extensive data filtering pipelines, which remove low-quality samples which are likely to cause compounding errors during training.

To quantify the hallucination problem for ASR models, we introduce 3 proxy metrics based on consecutive transcription errors:

Omissions – Five or more consecutive speech deletions (skipping words that are present in the audio).
Fabrications - Five or more consecutive speech insertions (adding words not present in the original audio) or substitutions (replacing words with incorrect ones).
Hallucinations – Five or more consecutive insertions, substitutions, or deletions.

We feel this is an under-quantified aspect of ASR benchmarking and plan to continue reporting these numbers for future releases.

To evaluate hallucination rates, we measured the average number of hallucination events per hour of transcribed audio for each model on 146 hours of audio pulled from diverse datasets. Compared to Whisper Large-v3, Universal-1 shows a 30% reduction in hallucination rates (Figure 6).

Fig. 5–This chart compares the hallucination rates of Universal-1 vs. Canary-1B and Whisper Large-v3. Hallucination rates are defined as the number of times a model generates five or more consecutive errors per hour of transcribed audio. Universal-1 demonstrates reductions in hallucination rates compared to the other models.

The table below showcases some examples from our test sets where Universal-1 has fixed Whisper’s hallucinations.

Whisper	Universal-1	Ground-truth
hadja luis sima addjilu sime subtitles by the amara org community	her jewelry shimmering	her jewelry shimmered
the ride to price inte i daseline is about 3 feet tall and suites sizes is 하루	the tabet mountain chain is often considered the backbone of the korean venezuela	the taebaek mountain chain is often considered the backbone of the korean peninsula
does that mean we should not have interessant n	there's an englishman said nothing	the englishman said nothing
this time i am very happy and then thank you to my co workers get them back to jack corn again thank you to all of you who supported me the job you gave me ultimately gave me nothing however i thank all of you for supporting me thank you to everyone at jack corn thank you to michael john song trabalhar significant	marine a month of sundays	not in a month of sundays

Whisper’s hallucination problems are much more pronounced for noisy patches of audio that contain no real speech. The following table shows the transcription outputs on some examples of random sound effects collected from YouTube, some of which might be similar to background noises in online meetings. Whisper fabricates some transcription text, while Universal-1 correctly outputs empty transcriptions.

Whisper	Universal-1	Audio Description
thank you for watching you	(blank output)	Windows startup sounds
thank you thank you thank you	(blank output)	Waterfall sounds
tail sit thank you	(blank output)	Soft tapping and buzzing in the background
här	(blank output)	Humming girl
thank you	(blank output)	Typing noise
subtitles by the amara org community	(blank output)	Zoom app sound effects
I'm going to try to make it look like I'm eating a banana. It's really good. It's sweet and sour. It's so good. I love it. It's really good. It's so good. It's so good. It's so good. It's so good. It's so good.	(blank output)	Eating an apple

To systematically assess robustness against non-speech audio, we evaluated Universal-1 and Whisper Large-v3 on audio samples randomly selected from the AudioSet and DNC datasets (200 samples each), excluding sound categories that might contain human speech. As shown in the table below, Whisper Large-v3 almost always generates random text for such ambient sound samples, including long strings of unintelligible or repeated symbols. In contrast, Universal-1 only outputs a non-blank response around 10% of the time and generally avoids generating long text. These results demonstrate USM's ability to distinguish between speech and non-speech sounds, minimizing erroneous outputs for background noise.

	Dataset	Model
	Dataset	Whisper	Universal-1
Non-blank Response Rate	AudioSet	100.0%	10.5%
Non-blank Response Rate	DNC	99.5%	10.0%
Average #characters for non-blank response	AudioSet	16	3.6
Average #characters for non-blank response	DNC	7.6	2

Improvement over Conformer-2

Since we're upgrading our English ASR model from Conformer-2 to Universal-1, we've taken extra steps to ensure that this upgrade benefits our customers, beyond confirming the WER improvement.

Human Preference Evaluation

While WER and other objective metrics are critical for automatically measuring ASR performance, they do not necessarily capture the nuances of what humans deem a desirable transcription. To ensure continuous improvement over our previous generation model, we conducted a human preference test by comparing Universal-1 to Conformer-2.

Our human preference test involved several evaluators in a blind experiment setup, comparing Universal-1 and Conformer-2. Each evaluator compared unlabeled Universal-1 and Conformer-2 transcriptions side-by-side for 361 audio clips covering diverse topics and domains that were not included in our training set. The listeners were allowed to take breaks and come back as needed. To further reduce listening fatigue, each clip was shorter than 30 seconds. To conduct end-to-end performance analysis, we standardized our text formatting procedure applied to both model outputs, consisting of inverse text normalization, automatic punctuation, and truecasing.

Figure 6 shows the test result. Human evaluators preferred Universal-1’s outputs 60% of the time and chose Conformer-2’s outputs 24% of the time, with 16% of the votes resulting in ties. This result demonstrates substantial qualitative improvements over Conformer-2, a state-of-the-art model at its time of release.

Fig. 6—Human evaluators preferred Universal-1’s outputs 60% of the time, while Conformer-2 was chosen 24% of the time.

Speaker Diarization

Speaker diarization, the process of identifying "who spoke when" in an audio recording, is a critical capability for various use cases, including meeting transcription, call analytics, and audio content analysis.

For this analysis, we utilized our in-house speaker diarization algorithm, which assigns a speaker label to each word in a transcript given an audio file and the time-annotated words in its transcript. To assess Universal-1's performance in this domain, we conducted experiments using the AMI-IHM and DiPCo corpora, involving complex multi-speaker conversations. We observed significant improvements compared to Conformer-2:

7.7% relative reduction in Diarization Error Rate (DER), a metric that combines missed speech, false alarms, and speaker errors.
13.6% relative reduction in concatenated minimum-permutation Word Error Rate (cpWER), which jointly measures the accuracy of both speech-to-text and speaker labeling.
71.3% relative error reductionin speaker counting, which estimates the number of speakers present in the input audio file.

These improvements in speaker diarization accuracy are likely due to Universal-1's combined enhancements in timestamp prediction and word recognition capabilities.

Key Observations for Future Research

Codeswitching

Codeswitching refers to utterances comprising words from two different languages. We have found that training an ASR model with a multilingual tokenizer on multilingual data without language-specifying input results in the model exhibiting a certain level of code switching capability.

To test this aspect, we created a small dataset of 50 utterances by concatenating random samples from LibriSpeech test-clean and MLS Spanish. Universal-1 showed a 17.3% relative improvement in terms of WER over Whisper Large-v3 with automatic language detection (ALD). We found that most of this improvement came from consistently outputting tokens from the spoken language despite frequent switches. This is still an active area of research for us, but keep an eye out for more details in our follow-up report.

Universal-1 is Accessible through Our API Today

The easiest way to try Universal-1 is through our Playground, where you can upload a file or enter a YouTube link to see a transcription in just a few clicks.

You can also try out our API directly for free. Sign up for a free API token, and head over to our Docs or Welcome Colab to be up and running in just a few minutes.

If you’re thinking about integrating Universal-1 into your product, you can reach out to our Sales team with any questions you might have.

‍

Credits

Core Research

Francis McCann (lead), Luka Chkhetiani (lead), Andrew Ehrenberg, Robert McHardy, Rami Botros

Research Contributors

Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar

Research Data

Ahmed Etefy, Daniel McCrystal

Research Infrastructure

William Pipsico Ferreira, Ruben Bousbib

Production Engineering

Ben Gotthold, Soheyl Bahadoori, Enver Fakhan, Rahul Bagai, Mez Rashid, James He

Technical Leadership

Takuya Yoshioka, Travis Kupsche, Domenic Donato

Technical Writing

Marco Ramponi, Ryan O’Connor

Benchmarking

Sam Flamini

Quality Assurance

Sergio Ramirez Martin (lead)

English QA

Rajpreet Thethy, Lee Vaughn, Martin Schweiger, Anita Miller, Prachie Banthia, Audy Mulia, Amanda Lee, Christy Roach, Domenic Donato, Dylan Duan, Matt Lawler, Ryan O’Connor, and Whitney DeGraaf

Spanish QA

Francis McCann, Sergio Ramirez Martin, Jaime Lorenzo-Trueba

French QA

Ruben Bousbib, Michelle Asuamah, Marco Ramponi

German QA

Gabriel Oexle, Rami Botros, Ilya Skylar, Patrick Loeber

Special Thanks

Dillon Pulliam, Michael Chinen

Appendix

Evaluation Datasets

English Datasets

Common Voice V5.1: We used the English subset of the V5.1 dataset from the official website.
CORAAL: We used the version 2021.07 dataset from official sources and segmented according to the FairSpeech project.
TED-LIUM 3: We used 11 TED talks, following the Whisper’s TED-LIUM 3 long-form partition.
LibriSpeech: We used the test-clean and test-other splits from the LibriSpeech ASR corpus.
Earnings-21: We used the corpus of earnings calls from the speech-datasets repository from the 202206 version.
Meanwhile: We followed the dataset creation procedure from Whisper and downloaded the 64 segments from YouTube.
Podcast: We used a 18.2 hour human-labeled dataset of podcasts from a mix of public and internal sources.
Broadcast: We used a 7.5 hour human-labeled private dataset of news broadcasts from a mix of public and internal sources.
Telephony: We used a 8.2 hour human-labeled private dataset of telephone conversations from a mix of public and internal sources.
Noisy: We used a 7.2 hour human-labeled private dataset of noisy real world audio from a mix of public and internal sources.

Multilingual Datasets

Fleurs: We downloaded the test splits for each language from the HuggingFace distribution.
MLS: We used the test split of each language in the Multilingual LibriSpeech (MLS) corpus .
VoxPopuli: We downloaded and segmented the dataset according to the official instructions for each language.
Common Voice V9: We downloaded the V9 dataset from the official website for each language.
Private: We used a small 6 - 10 hour human-labeled dataset of real-world audio.

Timestamp Datasets

We used a combination of our internal ASR evaluation datasets (e.g. podcast, broadcast, telephony, etc) and a high fidelity forced alignment algorithm to create word-level reference timestamps.

References

[1] Zhang, Yu, et al. "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." arXiv preprint arXiv:2303.01037 (2023)

‍[2] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022).

‍[3] Gulati, Anmol, et al. "Conformer: Convolution-augmented transformer for speech recognition." arXiv preprint arXiv:2005.08100 (2020).

‍[4] Graves, Alex. "Sequence Transduction with Recurrent Neural Networks." arXiv preprint arXiv:1211.3711 (2012).

‍[5] Hsu, Wei-Ning, et al. "HuBERT: self-supervised speech representation learning by masked prediction of hidden units." arXiv preprint arXiv:2106.07447 (2021).

‍[6] Baevski, Alexei, et al. "wav2vec 2.0: a framework for self-supervised learning of speech representations." arXiv preprint arXiv:2006.11477 (2020).

‍[7] Chiu, Chung-Cheng, et al. “Self-supervised Learning with Random-projection Quantizer for Speech Recognition”. arXiv preprint arXiv:2202.01855(2022).