Do you offer cross-file Speaker Identification? | AssemblyAI

Our API currently supports detecting and labeling different speakers within a single audio file (speaker diarization). We currently don’t offer native cross-file speaker identification or voice registration. By default, our system outputs speaker labels as “Speaker A,” “Speaker B,” “Speaker C,” etc. Without additional metadata or processing, these labels don’t maintain consistency across different recordings.

However, there are several effective approaches you can implement to achieve speaker identification across multiple recordings.

One approach is to use LeMUR to match speaker labels with the individuals within the recording. This approach is good for use-cases where the speaker names are spoken in the audio recording. We have a cookbook with detailed instructions on how to do this here: https://www.assemblyai.com/docs/guides/speaker-identification

For a more sophisticated approach, you can implement speaker identification using audio embeddings. For this you would first submit your audio file to AssemblyAI for diarization with speaker labels, and then use a model like Nvidia Titanet to generate speaker embeddings from the audio. Then, you would match these embeddings against a vector database of known speakers before replacing our generic labels (“Speaker A/B”) with actual names. Refer to our speaker identification cookbook here for more details: https://www.assemblyai.com/docs/guides/titanet-speaker-identification

For a deeper understanding on the differences between Speaker Diarization and Recognition, check out this blog post on the topic: https://www.assemblyai.com/blog/speaker-diarization-vs-recognition/