Multichannel transcription is supported for all languages, regions, and models.
If you have a multichannel audio file with multiple speakers, you can transcribe each of them separately.
The response includes an audio_channels property with the number of different channels, and an additional utterances property, containing a list of turn-by-turn utterances.
Each utterance contains channel information, starting at 1.
Additionally, each word in the words array contains the channel identifier.
Multichannel audio increases the transcription time by approximately 40%.
If you have a multichannel audio file where individual channels may contain multiple speakers, you can combine multichannel and speaker_labels to perform diarization within each channel.
When using multichannel with speaker_labels, the speaker_options parameters (min_speakers_expected and max_speakers_expected) are applied per channel, not globally across the entire file. For example, setting min_speakers_expected: 5 and max_speakers_expected: 7 on a 5-channel file means the model will find 5–7 speakers on each channel, resulting in 25–35 total speakers. Adjust your speaker options accordingly when using multichannel transcription.
When both parameters are enabled:
{channel}{speaker} (e.g., “1A”, “1B”, “2A”)For example, if channel 1 has two speakers and channel 2 has one speaker, the labels would be:
1A1B2A