Using multichannel and speaker diarization
Learn how multichannel transcription and speaker diarization work, what their outputs look like, when to use each feature, and how you can implement them.



Separating and identifying multiple speakers in audio recordings is essential for creating accurate, organized transcripts. Two techniques that make this possible are multichannel transcription and speaker diarization.
Multichannel transcription, also known as channel diarization, processes audio with separate channels for each speaker, making it easier to isolate individual contributions. Speaker diarization distinguishes speakers in single-channel recordings. Both methods help create structured transcripts that are easy to analyze and use.

In this blog post, we'll explain how multichannel transcription and speaker diarization work, what their outputs look like, and how you can implement them using AssemblyAI.
Understanding multichannel transcription
Multichannel transcription processes audio with multiple separate channels, each capturing input from a distinct source, such as different speakers or devices. This approach isolates each participant's speech, ensuring clarity and accuracy without overlap or confusion.

For instance, conference calls often record each participant's microphone on a separate channel, making it easy to attribute speech to the correct person. In stereo recordings, which have two channels (left and right), Multichannel transcription can distinguish between the audio captured on each side, such as an interviewer on the left channel and an interviewee on the right. Similarly, podcast recordings may separate hosts and guests onto individual channels, and customer service calls often use one channel for the customer and another for the agent.
By keeping audio streams distinct, multichannel transcription minimizes background noise, enhances accuracy, and provides clear speaker attribution. It simplifies the transcription process and delivers organized, reliable transcripts that are easy to analyze and use across various applications.
Note: Multichannel audio increases transcription time by approximately 25%.
Understanding speaker diarization
Speaker diarization identifies and distinguishes speakers within single-channel audio recordings. It answers the question: "Who spoke when?" by segmenting the audio into speaker-specific portions.

Unlike multichannel transcription, where speakers are separated by distinct channels, diarization works within a single audio track to attribute speech segments to individual speakers. Advanced algorithms analyze voice characteristics such as pitch, tone, and cadence to differentiate between participants, even when their speech overlaps or occurs in rapid succession.
This technique is especially valuable in scenarios like recorded meetings, interviews, and panel discussions where speakers share the same recording track. For instance, a single-channel recording of a business meeting can be processed with diarization to label each participant's speech, providing a structured transcript that makes conversations easy to follow.
Important: Speaker Diarization and Multichannel transcription cannot be used together. Enabling both features will result in an error. Choose the method that best fits your audio recording setup.
By using speaker diarization, you can create clear and organized transcripts without the need for separate audio channels. This ensures accurate speaker attribution, improves usability, and allows for deeper insights into speaker-specific contributions in any audio recording.
Multichannel response
With AssemblyAI, you can transcribe each audio channel independently by configuring the multichannel parameter. See how to implement it in the next section.
Here is an example JSON response for an audio file with two separate channels when multichannel transcription is enabled:
{
"multichannel": true,
"audio_channels": 2,
"utterances": [
{
"text": "Here is Laura talking on channel one.",
"speaker": "1",
"channel": "1",
"start": 1000,
"end": 4500,
"confidence": 0.98,
"words": [
{
"text": "Here",
"speaker": "1",
"channel": "1",
"start": 1000,
"end": 1350,
"confidence": 0.99
},
{
"text": "is",
"speaker": "1",
"channel": "1",
"start": 1400,
"end": 1550,
"confidence": 0.98
}
]
},
{
"text": "And here is Alex talking on channel two.",
"speaker": "2",
"channel": "2",
"start": 5000,
"end": 8200,
"confidence": 0.97,
"words": [
{
"text": "And",
"speaker": "2",
"channel": "2",
"start": 5000,
"end": 5250,
"confidence": 0.99
},
{
"text": "here",
"speaker": "2",
"channel": "2",
"start": 5300,
"end": 5600,
"confidence": 0.98
}
]
}
]
}
The response contains the multichannel field set to true, and the audio_channels field with the number of different channels.
The important part is in the utterances field. This field contains an array of individual speech segments, each containing the details of one continuous utterance from a speaker. For each utterance, a unique identifier for the speaker (e.g., 1, 2) and the channel number are provided.
Additionally, the words field is provided, containing an array of information about each word, again with speaker and channel information.
How to implement multichannel transcription with AssemblyAI
You can use the API or one of the AssemblyAI SDKs to implement multichannel transcription (see developer documentation).
Let's see how to use multichannel transcription with the AssemblyAI Python SDK:
import assemblyai as aai
audio_file = "./multichannel-example.mp3"
config = aai.TranscriptionConfig(multichannel=True)
transcript = aai.Transcriber().transcribe(audio_file, config)
print(f"Number of audio channels: {transcript.json_response['audio_channels']}")
for utt in transcript.utterances:
print(f"Speaker {utt.speaker}, Channel {utt.channel}: {utt.text}")
To enable Multichannel transcription in the Python SDK, set multichannel to True in your TranscriptionConfig. Then, create a Transcriber object and call the transcribe function with the audio file and the configuration.
When the transcription process is finished, we can print the number of audio channels and iterate over the separate utterances while accessing the speaker identifier, the channel, and the text of each utterance.
Speaker diarization response
The API also supports speaker diarization by configuring the speaker_labels parameter. You'll see how to implement it in the next section.
Here is an example JSON response for a monochannel audio file when speaker_labels is enabled:
{
"multichannel": null,
"audio_channels": null,
"utterances": [
{
"text": "Today, I'm joined by Alex. Welcome, Alex!",
"speaker": "A",
"channel": null,
"start": 1000,
"end": 4500,
"confidence": 0.98,
"words": [
{
"text": "Today",
"speaker": "A",
"channel": null,
"start": 1000,
"end": 1400,
"confidence": 0.99
},
{
"text": "I'm",
"speaker": "A",
"channel": null,
"start": 1450,
"end": 1650,
"confidence": 0.97
}
]
},
{
"text": "I'm excited to be here!",
"speaker": "B",
"channel": null,
"start": 5000,
"end": 7200,
"confidence": 0.96,
"words": [
{
"text": "I'm",
"speaker": "B",
"channel": null,
"start": 5000,
"end": 5200,
"confidence": 0.98
},
{
"text": "excited",
"speaker": "B",
"channel": null,
"start": 5250,
"end": 5750,
"confidence": 0.95
}
]
}
]
}
The response is similar to a multichannel response, with an utterances and a words field including a speaker label (e.g. "A", "B").
The difference to a multichannel transcription response is that the speaker labels are denoted by "A", "B" etc. rather than numbers, and that the multichannel, the audio_channels, and the channel fields don't contain values.
How to implement speaker diarization with AssemblyAI
Speaker diarization is also supported through the API or one of the AssemblyAI SDKs (see developer documentation).
Here's how to implement speaker diarization with the Python SDK:
import assemblyai as aai
audio_file = "./monochannel-example.mp3"
config = aai.TranscriptionConfig(speaker_labels=True)
# or with speakers_expected:
# config = aai.TranscriptionConfig(speaker_labels=True, speakers_expected=2)
transcript = aai.Transcriber().transcribe(audio_file, config)
for utt in transcript.utterances:
print(f"Speaker {utt.speaker}: {utt.text}")
To enable speaker diarization in the Python SDK, set speaker_labels to True in your TranscriptionConfig. Optionally, if you know the number of speakers in advance, you can improve the diarization performance by setting the speakers_expected parameter.
Then, create a Transcriber object and call the transcribe function with the audio file and the configuration. When the transcription process is finished, we can again iterate over the separate utterances while accessing the speaker label and the text of each utterance.
The code is similar to the Multichannel code example except for enabling speaker_labels instead of multichannel, and the result does not contain audio_channels and channel information.
How speaker diarization works
Speaker diarization identifies speakers in single-channel audio by analyzing voice characteristics and organizing speech into distinct segments. For developers working with multi-speaker audio, understanding this process helps optimize implementation and troubleshoot results.
The process involves three key steps that work together to separate speakers:
- Segmentation: The audio is divided into time-based segments based on acoustic changes like pauses, tone shifts, or pitch variations. This creates boundaries where one speaker stops and another begins.
- Voice analysis: Each segment is processed to extract unique voice characteristics including pitch, timbre, and vocal texture. These characteristics create a numerical "voice fingerprint" for each speaker.
- Speaker grouping: Similar voice fingerprints are grouped together using clustering algorithms. Each group represents a distinct speaker, allowing the system to label speech segments accordingly.
This approach enables accurate speaker identification even when voices overlap or change throughout the recording, making it effective for real-world audio scenarios.
Choosing between multichannel and speaker diarization
Deciding between Multichannel transcription and Speaker Diarization depends on the structure of your audio and your specific needs. Both approaches are effective for separating and identifying speakers, but they are suited to different scenarios.
When to use multichannel transcription
Multichannel transcription is ideal when your recording setup allows for distinct audio channels for each speaker or source. For example, conference calls, podcast recordings, or customer service calls often produce multichannel audio files. With each speaker recorded on a separate channel, transcription becomes straightforward, as there's no need to differentiate speakers within a single track. Multichannel transcription ensures clarity, reduces overlap issues, and is particularly useful when high accuracy is required.
When to use speaker diarization
Speaker diarization is the better choice for single-channel recordings where all speakers share the same audio track. This technique is commonly applied in scenarios like in-person interviews, panel discussions, or courtroom recordings. Diarization uses advanced algorithms to differentiate speakers, making it effective when you don't have the option to record each participant on their own channel.
Making the right choice
If your recording setup supports separate channels for each speaker, multichannel transcription is generally the more precise and efficient option.
However, if your audio is limited to a single channel or includes overlapping voices, speaker diarization is essential for creating structured and accurate transcripts.
Ultimately, the choice depends on the recording setup and the level of detail needed for the transcript.
Conclusion
Both multichannel transcription and speaker diarization solve the challenge of organizing multi-speaker audio. Multichannel works with separate audio channels, while speaker diarization handles single-channel recordings with multiple speakers.
With AssemblyAI, implementing either approach requires just a simple parameter change—multichannel for channel-separated audio or speaker_labels for single-channel recordings. Both methods deliver structured transcripts that make multi-speaker conversations easy to analyze and use.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.