Speaker diarization (diarisation) is the process of splitting audio or video inputs automatically based on the speaker's identity. It helps you answer the question "who spoke when?".
With the recent application and advancement in deep learning over the last few years, the ability to verify and identify speakers automatically (with confidence) is now possible.
Industries like media monitoring, telephony, podcasting, telemedicine, and web conferencing almost always have audio and video with multiple speakers. These same industries, who are heavily impacted by the progression of automated transcription, rely on speaker diarization to fully replace human transcription from their workflows.
Speaker diarization, in combination with State-of-the-Art accuracy, has the potential to unlock a tremendous amount of value for any mono-channel recording. If you're interested in testing this out right now, check out our API Docs on Speaker Diarization.
For more information on how Speech-to-Text works, you can learn more about how to build an end-to-end model in PyTorch here.
How It Works
In the past, i-vector-based audio embedding techniques were used for speaker verification and diarization. However, with recent breakthroughs in deep learning, neural network-based audio embeddings (also known as d-vectors) have proven to be the best approach.
More specifically, LSTM-based d-vector audio embeddings with nonparametric clustering help reach a state-of-the-art speaker diarization system.
Infrastructure of Speaker Diarization
 Speech Detection — Use Voice Activity Detector (VAD) to identify speech and remove noise.
 Speech Segmentation — Extract short segments (sliding window) from the audio & run LSTM network to produce D vectors for each sliding window.
 Embedding Extraction — Aggregate the d-vector for each segment belonging to that segment to produce segment-wise embeddings.
 Clustering — Cluster the segment-wise embedding to produce diarization results. Determine the number of speakers with each speaker's time stamps using 2 clustering algorithms that we integrated into our diarization system.
- Offline Clustering - Speaker labels are produced after the embeddings of all segments are available
- K-Means Clustering: use KMeans++ for initialization to determine the number of speakers ek, we use the “elbow” of the derivatives of conditional Mean Squared Cosine Distances (MSCD) between each embedding to its cluster centroid
- Spectral Clustering: 1) construct the affinity matrix, 2) perform refinement operations 3) perform eigen-decomposition on the refined affinity matrix 4) use K-Means to cluster these new embeddings, and produce speaker labels
Active Areas of Research
- Neural network based clustering techniques - for example, UIS RNN
- Better handling of crosstalk when multiple speakers talk at the same time - for example, source separation
- Improving the ability to detect the number of speakers in the audio/video file when there are many speakers
- Improved handling of noisy audio files when there are high levels of background noises, music, or other channel disturbances
How to enable Speaker Diarization with AssemblyAI
AssemblyAI can automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.
The below code sample shows how to submit an audio or video file for transcription with Speaker Labels turned on in Python. You can view more code samples in more programming languages in our API docs here.
Once you submit your file for processing, you can get the transcription result by making a GET request.
You'll get a JSON response like the response below. The "utterances" key will contain a list of "turn-by-turn" utterances, as they appeared in the audio recording.
A "turn" is a "turn" in speakers during the conversation. For example, Speaker A says "hello" (turn 1) and then Speaker B says "hi" (turn 2).
Speaker Diarization Use Cases
Automatically label both the <patient> and <doctor> on appointment recordings to make them more readable and useful. Importing into a healthcare ERP or database is simplified with better tagging/indexing and the ability to trigger actions like follow-up visits and prescription refills.
Automatically label multiple speakers on a conference call recording to make transcriptions more useful. This allows sales and support coaching platforms to analyze and display transcripts split by the <agent> and the <customer> in their interface to make search and navigation more simple. This can also help trigger actions like follow-ups, ticket status changes, and more.
Automatically label the podcast <host> and <guest> on a recording to bring transcripts to life without needing to listen to the audio or video. This is especially important to podcasters, as most files are recorded on a mono-channel, and almost always include more than a single speaker. Podcast hosting platforms can use transcripts to drive better SEO, improve search/navigation, and provide insights podcasters may not otherwise have access to.
Automatically label the <recruiter> and <applicant> on a hiring interview recording. Applicant Tracking Systems have customers who rely heavily on phone and video calls to recruit their applicants. Speaker diarization allows these platforms to split what applicants are responding to and what recruiters are asking without having to listen to the audio or video. This can also help trigger actions like applicant follow-ups and moving them to the next stage in the hiring process.
Automatically label multiple <speakers> on a video recording to make automated captions more useful. Video hosting platforms can now better index files for better search, provide better accessibility and navigation for viewers, and creates more useful content for SEO.
Automatically label multiple <guests> and <hosts> on broadcast radio or TV recordings for more precise search and analytics around keywords. Media monitoring platforms can now provide more insights to their customers by labeling which speaker mentioned their keyword (e.g. Coca-Cola). This also allows them to provide better indexing, search, and navigation for individual recording playback.