How does Automatic Language Detection work?

Our Automatic Language Detection (ALD) model analyzes samples of the audio to determine the language spoken. It randomly selects up to 3 clips of 30 seconds each from the middle 50% of the audio duration (between 25% and 75% of the total length).

These 3 clips are passed through our ALD model, which predicts the language probabilities for each clip. The probabilities are then averaged across the clips, and the languages are sorted by their average probability scores.

This approach helps ensure that the language detection is based on a representative sample of the audio, rather than just the beginning or end portions which may contain greetings, silence, or other non-representative speech.

If you are seeing low confidence scores for transcriptions in a particular language, it may be due to factors like background noise, accents, or audio quality issues. Our transcription models perform best with clear audio recorded in a quiet environment.