Skip to main content

Identifying speakers in audio recordings

In this guide, we'll walk you through the steps of identifying speakers in your audio recordings. We'll cover the prerequisites you need, how to set up the environment, and the step-by-step instructions to get you started.

Get started

Before we begin, make sure you have an AssemblyAI account and an API token. You can sign up for a free account and get your API token from your dashboard.

The entire source code of this guide can be viewed here.

Step-by-step instructions

  1. 1

    Create a new file and import the necessary libraries for making an HTTP request.

  2. 2

    Set up the API endpoint and headers. The headers should include your API token.

  3. 3

    Upload your local file to the AssemblyAI API.

  4. 4

    Use the upload_url returned by the AssemblyAI API to create a JSON payload containing the audio_url parameter.

  5. 5

    Make a POST request to the AssemblyAI API endpoint with the payload and headers.

  6. 6

    After making the request, you will receive an ID for the transcription. Use it to poll the API every few seconds to check the status of the transcript job. Once the status is completed, you can retrieve the transcript from the API response.

Understanding the response

The speaker label information will be included in the utterances key of the response. Each utterance object in the list will include a speaker field, which contains a string identifier for the speaker (e.g., "A", "B", etc.). The utterances list also contains a text field for each utterance containing the spoken text, and a confidence score for each word.

For more information, see the speaker diarization model documentation or refer to the API reference.


Automatically identifying different speakers from an audio recording, also called speaker diarization, is a multi-step process. It can unlock additional value from many genres of recording, including conference call transcripts, broadcast media, podcasts, and more. You can learn more about use cases for speaker diarization and the underlying research from the AssemblyAI blog.