How are individual speakers identified and how does the Speaker Label feature work?
When an audio file is submitted with speaker_labels
set to true
word timings are used to cut the audio into separate chunks of words. Those chunks are fed into a model to build a “speaker embedding”, which is a representation of a speaker. An algorithm is then used to cluster speaker embeddings that are similar to each other. If there are two speakers, there will be two distinct clusters of speaker embeddings. Those clusters are then used to assign speakers.
When the model is determining speakers, the entire audio is taken into consideration. It takes a certain amount of audio for a person to be identified as a unique speaker (we typically say 30 seconds). If a person doesn’t speak much over the course of a file or they reply in short phrases like “okay” and “right” they may not get identified as a unique speaker. If they aren’t identified as a unique speaker, their words will be attributed to the speaker embedding the model feels is most similar.
For more information on Speaker Labels, see this blog post.