Speaker Diarization: Speaker Labels for Mono Channel Files

Introduction

Speaker diarization (diarisation) is the process of splitting audio or video inputs automatically based on the speaker's identity. It helps you answer the question "who spoke when?".

Diarization overview.png
(Figure 1) Overview of diarization input/output on a monochannel file

With the recent application and advancement in deep learning over the last few years, the ability to verify and identify speakers automatically (with confidence) is now possible.

Industries like media monitoring, telephony, podcasting, telemedicine, and web conferencing almost always have audio and video with multiple speakers. These same industries, who are heavily impacted by the progression of automated transcription, rely on speaker diarization to fully replace human transcription from their workflows.

Speaker diarization, in combination with State-of-the-Art accuracy, has the potential to unlock a tremendous amount of value for any mono-channel recording. If you're interested in testing this out right now, check out our API Docs on Speaker Diarization.

For more information on how Speech-to-Text works, you can learn more about how to build an end-to-end model in PyTorch here.

How it Works

In the past, i-vector-based audio embedding techniques were used for speaker verification and diarization. However, with recent breakthroughs in deep learning, neural network-based audio embeddings (also known as d-vectors) have proven to be the best approach.

More specifically, LSTM-based d-vector audio embeddings with nonparametric clustering help reach a state-of-the-art speaker diarization system.

embeddings_and_clustering_algorithms.png
(Figure 2) DER (%) on two English-only datasets for i-vector and d-vector embeddings and clustering algorithms

Infrastructure of Speaker Diarization

Flowchart_of_d-vector_based_diarization_system.png
(Figure 3) Flowchart of a d-vector based diarization system (post VAD)

1) Speech Detection — Use Voice Activity Detector (VAD) to identify speech and remove noise.

VAD Example.png
(Figure 4) Voice Activity Detection (VAD) on noisy speech data

2) Speech Segmentation — Extract short segments (sliding window) from the audio & run LSTM network to produce D vectors for each sliding window.

d-vector from segments.png
(Figure 5) d-vector visualization from a fixed length segment

3) Embedding Extraction — Aggregate the d-vector for each segment belonging to that segment to produce segment-wise embeddings

4) Clustering — Cluster the segment-wise embedding to produce diarization results. Determine the number of speakers with each speaker's time stamps using 2 clustering algorithms that we integrated into our diarization system

  • Offline clustering: Speaker labels are produced after the embeddings of all segments are available
--> K-Means: use KMeans++ for initialization to determine the number of speakers ek, we use the “elbow” of the derivatives of conditional Mean Squared Cosine Distances (MSCD) between each embedding to its cluster centroid
--> Spectral: 1) construct the affinity matrix, 2) perform refinement operations 3) perform eigen-decomposition on the refined affinity matrix 4) use K-Means to cluster these new embeddings, and produce speaker labels

Spectral offline clustering.png

(Figure 6) refinement operations on the affinity matrix

Active areas of research

  • Neural network based clustering techniques - for example, UIS RNN
  • Better handling of crosstalk when multiple speakers talk at the same time - for example, source separation
  • Improving the ability to detect the number of speakers in the audio/video file when there are many speakers
  • Improved handling of noisy audio files when there are high levels of background noises, music, or other channel disturbances

How to enable Speaker Diarization with AssemblyAI

AssemblyAI can automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.

Submit an audio or video file for transcription with Speaker Labels (Python)

View code samples in more programming languages here

import requests

endpoint = "https://api.assemblyai.com/v2/transcript"

json = {
  "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
  "speaker_labels": True
}

headers = {
    "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}

response = requests.post(endpoint, json=json, headers=headers)

print(response.json())

Get the transcription result (Python)

import requests

endpoint = "https://api.assemblyai.com/v2/transcript/YOUR-TRANSCRIPT-ID-HERE"

headers = {
    "authorization": "YOUR-API-TOKEN",
}

response = requests.get(endpoint, headers=headers)

print(response.json())

You'll get a JSON response like the response below. The "utterances" key will contain a list of "turn-by-turn" utterances, as they appeared in the audio recording.

A "turn" is a "turn" in speakers during the conversation. For example, Speaker A says "hello" (turn 1) and then Speaker B says "hi" (turn 2).

{
    "acoustic_model": "assemblyai_default",
    "audio_duration": 150.766167800454,
    "audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
    "confidence": 0.922175805047867,
    "dual_channel": true,
    "format_text": true,
    "id": "5552830-d8b1-4e60-a2b4-bdfefb3130b3",
    "language_model": "assemblyai_default",
    "punctuate": true,
    "status": "completed",
    "text": "Hi, I'm joy. Hi, I'm sharon. Do you have kids in school. ...",
    # the "utterances" key below is a list of the turn-by-turn utterances found in the audio
    "utterances": [
        {
            # speakers will be marked as "A" through "Z"
            "speaker": "A",
            "confidence": 0.97,
            "end": 1380,
            "start": 0,
            # the text for the entire speaker "turn"
            "text": "Hi, I'm joy.",
            # the individual words from the speaker "turn"
            "words": [
                {
                    "speaker": "A",
                    "confidence": 1.0,
                    "end": 320,
                    "start": 0,
                    "text": "Hi,"
                },
                ...
            ]
        },
        # the next "turn" by speaker "B" - for example
        {
            "speaker": "B",
            "confidence": 0.94,
            "end": 3260,
            "start": 0,
            "text": "Hi, I'm sharon.",
            "words": [
                {
                    "speaker": "B",
                    "confidence": 1.0,
                    "end": 480,
                    "start": 0,
                    "text": "Hi,"
                },
                ...
            ]
        },
        {
            "speaker": "A",
            "confidence": 0.94,
            "end": 5420,
            "start": 2820,
            "text": "Do you have kids in school.",
            "words": [
                {
                    "speaker": "A",
                    "confidence": 1.0,
                    "end": 4300,
                    "start": 2820,
                    "text": "Do"
                },
                ...
            ]
        },
    ],
    # all of the words found in the audio across all speakers
    "words": [
        {
            "speaker": "A",
            "confidence": 1.0,
            "end": 320,
            "start": 0,
            "text": "Hi,"
        },
        {
            "speaker": "A",
            "confidence": 1.0,
            "end": 720,
            "start": 320,
            "text": "do"
        },
        ...
    ]
}

Speaker Diarization Use Cases

Telemedicine

Automatically label both the <patient> and <doctor> on appointment recordings to make them more readable and useful. Importing into a healthcare ERP or database is simplified with better tagging/indexing and the ability to trigger actions like follow-up visits and prescription refills.

Diarization example- telemed.png

(Figure 7) Speaker diarization example for a doctor's appointment recording

Conference calls

Automatically label multiple speakers on a conference call recording to make transcriptions more useful. This allows sales and support coaching platforms to analyze and display transcripts split by the <agent> and the <customer> in their interface to make search and navigation more simple. This can also help trigger actions like follow-ups, ticket status changes, and more.

Podcast hosting

Automatically label the podcast <host> and <guest> on a recording to bring transcripts to life without needing to listen to the audio or video. This is especially important to podcasters, as most files are recorded on a mono-channel, and almost always include more than a single speaker. Podcast hosting platforms can use transcripts to drive better SEO, improve search/navigation, and provide insights podcasters may not otherwise have access to.

Hiring platforms

Automatically label the <recruiter> and <applicant> on a hiring interview recording. Applicant Tracking Systems have customers who rely heavily on phone and video calls to recruit their applicants. Speaker diarization allows these platforms to split what applicants are responding to and what recruiters are asking without having to listen to the audio or video. This can also help trigger actions like applicant follow-ups and moving them to the next stage in the hiring process.

Video hosting

Automatically label multiple <speakers> on a video recording to make automated captions more useful. Video hosting platforms can now better index files for better search, provide better accessibility and navigation for viewers, and creates more useful content for SEO.

Broadcast media

Automatically label multiple <guests> and <hosts> on broadcast radio or TV recordings for more precise search and analytics around keywords. Media monitoring platforms can now provide more insights to their customers by labeling which speaker mentioned their keyword (e.g. Coca-Cola). This also allows them to provide better indexing, search, and navigation for individual recording playback.

Sources