To build generative AI applications leveraging spoken data, product and development teams will need accurate speech to text as a critical component of their AI pipeline. We’re excited to see the innovative AI products built with the improved results from our Conformer-2 model.
Conformer-1 utilized a technique called noisy student-teacher training (NST)  to achieve state-of-the-art speech recognition. NST works by first training a “teacher” model on labeled data and then using it to generate predictions on unlabeled data. A “student” model is then trained on the same labeled data, as well as the same unlabeled data, where the teacher model's predictions act as labels.
When training Conformer-2, we took this approach a step further by leveraging model ensembling. This was accomplished by producing labels from multiple strong teachers rather than from a single one. Ensembling is a well known variance reduction technique and results in a model that is more robust when exposed to data that has not been seen during training. By using an ensemble of teacher models, the student model is exposed to a wider distribution of behaviors and the failure cases of an individual model are subdued by the successes of the other models in the ensemble.
Scaling up to 1M+ hours
The other research thread that we pursued further with Conformer-2 was that of data and model parameter scaling. The Chinchilla  paper demonstrated that current Large Language Models are significantly undertrained, meaning that they are not trained on a sufficient amount of data for their size. In addition, the paper provided scaling laws which identify the appropriate amount of training data for different model sizes. For Conformer-1, we adapted these scaling laws to the Speech Recognition domain to determine that the model would require 650K[I] hours of audio for training. Conformer-2 follows this curve up and to the right, increasing model size to 450M parameters and training on 1.1 million hours of audio data.
Usually when a larger model is released it comes with higher cost and slower speeds. We were able to counter these tradeoffs by investing heavily in our serving infrastructure so that Conformer-2 is actually faster than Conformer-1 by up to 55% depending on duration of the audio file.
Figure 1 shows the processing duration, represented as a function of audio duration for both Conformer-1 and Conformer-2. We can see that there are significant reductions in relative processing duration for Conformer-2 across all audio file durations. For Conformer-2 the transcription time for an hour-long file is down from 4.01 minutes to 1.85 minutes. These improvements will allow users to get their results faster.
As an applied research company, the primary objective in creating Conformer-2 was to improve performance for domains that are relevant to real-world use cases. WER can be a useful metric to get a sense of model performance, but it doesn’t always reflect the nuances of how a good model performs on real-world data. While model ensembling and model/dataset scaling did not lead to any significant improvements in WER, we observed large gains in Alphanumeric Transcription Accuracy, Proper Noun Error, and Robustness to Noise.
Proper Nouns Performance
While the most common metric for evaluating ASR models is word-error-rate (WER), it is not a perfect benchmark. WER only counts the number of errors, not the significance of these errors. Certain errors obfuscate the intended meaning of language significantly more than others, which is especially important for real-world use cases. For example, incorrectly transcribing someone's name or an address is much more consequential than transcribing "there” as “their”, and this reality is not captured by WER.
Therefore, in order to quantify these sorts of particularly important errors, we crafted a novel metric which we refer to as Proper Noun Error Rate (PPNER. PPNER measures a model’s performance specifically for proper nouns, by using a character-based metric called Jaro-Winkler similarity. For further details please reference our blog on how to evaluate speech recognition models.
Figure 2 shows the performance of Conformer-1 vs Conformer-2 across various datasets that reflect different industry domains. This data consists of 60+ hours of human labeled audio data, covering popular speech domains such as call centers, podcasts, broadcasts, and webinars. As evidenced by the figure, we were able to observe a 6.8% improvement on PPNE from Conformer-1 to Conformer-2. These improvements result in more consistency across transcripts in transcribing entities such as names, as well as more readable transcripts in general.
The fact that WER only counts the number of errors and does not differentiate between the significance of these errors is also of great importance for numerical data. Incorrectly transcribing, for example, a credit card number or a confirmation code has the potential to cause serious issues for downstream applications. In order to test Conformer-2’s performance on alphanumeric data, we randomly generated 100 sequences of between 5 and 25 digits, and then hired a commercial text to speech provider to voice each one using 1 of 10 different speakers. We then transcribed these voiced sequences with both Conformer-1 and Conformer-2 and calculated the Character Error Rate (CER) for each transcribed sequence. CER measures the number of digits that are incorrectly transcribed in a provided sequence. To make the aggregate comparison of Conformer-1 and Conformer-2 more informative, we removed transcripts where both models achieved 0% CER.
The results showed a 30.7% relative reduction in the mean CER on our newly curated alphanumeric dataset. In addition, Conformer-2 shows reduced variance, meaning that it is much more likely to avoid significant mistakes. Conformer-2 even gets 0% CER on some files that Conformer-1 made mistakes on. Applications which rely on numerical accuracy can expect notable improvements with Conformer-2.
Conformer-1 demonstrated impressive noise robustness, achieving 43% fewer errors on our noisy test dataset than the next best provider. We hypothesized that Conformer-2 would push this robustness even further, so we calculated WER on the Librispeech-clean dataset with different levels of added white noise in order to test this hypothesis.
From the results in the figure it can be seen that Conformer-2 improves upon Conformer-1, managing to push noise robustness even further. This improvement can likely be attributed to the increased diversity of the training data provided by multiple teachers. At a Signal-to-Noise Ratio (SNR) of 0 (equal parts original audio and noise) Conformer-2 improves upon Conformer-1 by 12.0%. Real-word data is rarely clean, so this added robustness will allow users to successfully apply Conformer-2 to such real-world data, and not just to the sterile, academic datasets against which Speech Recognition models are so commonly evaluated.
Building on In-House Hardware
Conformer-2 was trained on our own GPU compute cluster of 80GB-A100s. To do this, we deployed a fault-tolerant and highly scalable cluster management and job scheduling Slurm scheduler, capable of managing resources in the cluster, recovering from failures, and adding or removing specific nodes. We estimate that Conformer-2’s training speed was ~1.6x faster than it would have been on comparable infrastructure available via cloud providers. Additionally, the ability to train models on our own hardware gives us the flexibility to constantly experiment, inspiring research directions like the ones which lay at the foundation of Conformer-2’s impressive performance.
Conformer-1 extended cutting edge research by demonstrating a clear relationship between pseudo labeling wide distributions of data and the resulting robustness to noise. Conformer-2 took that further and was able to produce a much more industry friendly model via scaling and model ensembling. With Conformer-2 came an emphasis on quantifying that which is important to our customers. We plan to continually incorporate feedback and develop more metrics to guide our optimizations towards what matters for real use cases. While this progress has been exciting, bootstrapping strong teacher models was bound to run into an asymptotic limit and stop bearing fruit. We have begun to observe diminishing returns and are already exploring other promising research directions into multimodality and self-supervised learning.
New Features and Generally Available Today
With the Conformer-2 launch, we’re introducing a new API parameter
speech_threshold (read the documentation). The
speech_threshold parameter enables users to optionally set a threshold for the proportion of speech that must be present in an audio file for it to be processed. Our API will automatically reject audio files that contain a proportion of speech that is lesser than the set threshold. This feature is meant to help users control costs with files like sleep podcasts, instrumental music, and empty audio files where transcription is not desired.
Conformer-2 is accessible through our API today as our default model. Current users of our API will automatically be switched to Conformer-2 and start seeing better performance with no changes required on their ends.
The easiest way to try Conformer-2 is through our Playground, where you can upload a file or enter a YouTube link to see a transcription in just a few clicks.
If you’re thinking about integrating Conformer-2 into your product, you can reach out to our Sales team with any questions you might have.
 Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).
 V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
 Xie, Qizhe, et al. "Self-training with noisy student improves imagenet classification." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. (noisy student)
[I] To calculate this number we used number of text tokens and then converted to hours of audio with the following heuristic. 1 hour of audio = 7,200 words and 1 word = 1.33 tokens