Tutorials

Transcribe a phone call in real-time using Python with AssemblyAI and Twilio

Learn how to transcribe a phone call in real-time using Python, AssemblyAI, ngrok, and Twilio

Transcribe a phone call in real-time using Python with AssemblyAI and Twilio

In this tutorial, we’ll learn how to build a Flask application that transcribes phone calls in real-time. Here’s a look at the end result:

0:00
/0:25

Overview

Let’s start with a high-level overview of how our service will work. First, a user will call a Twilio number which we will configure to forward the incoming data stream to AssemblyAI. Then, AssemblyAI’s real-time transcription service will transcribe this audio stream as it comes in, and send us partial transcripts in quick succession, which we’ll print to the terminal as they come in.

Now let’s take a look at how this will work more technically.

  1. First, a user calls the phone number that we provision with Twilio.
  2. Twilio then calls the specific endpoint associated with this number.
  3. In our case, we will configure the endpoint to be an ngrok URL, which provides a tunnel to a port on our local machine from a publicly accessible URL. Ngrok therefore allows us to expose our application to Twilio without having to provision a cloud machine or modify firewall rules.
  4. Through the ngrok tunnel, Twilio calls an endpoint in a Flask application, which responds with TwiML (Twilio Markup Language) that instructs Twilio on how to handle the call.
  5. In our case, the TwiML will tell Twilio to pass the incoming audio stream from the phone call to a WebSocket in our Flask application.
  6. This WebSocket will receive the audio stream and send it to AssemblyAI for transcription, printing the corresponding transcript to the terminal as it is received in real-time

You can find all of the code for this tutorial in this GitHub repository.

Getting Started

To get started, you’ll need

  1. An AssemblyAI account with funds added
  2. A Twilio account (free account should be sufficient)
  3. An ngrok account and ngrok installed on your system
  4. Python installed on your system

Now create and navigate into a project directory

mkdir realtime-phone-transcription
cd realtime-phone-transcription

Step 1: Set up credentials and environment

We’ll be using python-dotenv to manage our credentials. Create a file called .env in your project directory, and add the below text:

NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this

Make sure to replace replace-this with your specific credential for each line. You can find your ngrok authtoken in the Getting Started > Your Authtoken tab on your ngrok dashboard, or by checking the file returned by running ngrok config check if you already have ngrok set up on your system. If you have not configured ngrok on your system, add your ngrok authtoken to your CLI by running the command

ngrok config add-authtoken YOUR-TOKEN-HERE

You can find your Twilio account SID (TWILIO_ACCOUNT_SID) in your Twilio console under Account > API keys & tokens. Here you can also create an API key for TWILIO_API_KEY_SID and TWILIO_API_SECRET. A Standard key type is sufficient to follow along with this tutorial.

You can find your AssemblyAI API key on your AssemblyAI dashboard.

Finally, create a file called .gitignore and copy the below text into it:

.env
venv
__pycache__

This will prevent you from accidentally tracking your .env file with git and potentially uploading it to a website like GitHub. Additionally, it will prevent you from tracking/uploading your virtual environment and cache files.

Now create and activate a virtual environment for the project:

# Mac/Linux
python3 -m venv venv
. venv/bin/activate

# Windows
python -m venv venv
.\venv\Scripts\activate.bat

Next, we’ll install all of the dependencies we will need for the projects. Execute the below command:

pip install Flask flask-sock assemblyai python-dotenv ngrok twilio

Step 2: Create the Flask application

Create a file in the project directory called main.py and add the following imports:

from flask import Flask
from flask_sock import Sock

These lines import Flask and Sock so that we can create a web application with WebSockets. Next, add these lines that define some settings for our application:

PORT = 5000
DEBUG = False
INCOMING_CALL_ROUTE = '/'
WEBSOCKET_ROUTE = '/realtime'

In particular, we set the port that the app should run on, set debugging to false, and then define the route for the HTTP endpoint that will be hit when our Twilio number is called, and the route for the WebSocket to which the audio data will be sent.

Now add the below lines to main.py, which instantiate our app and define the functions for these endpoints. Additionally, we run the app on the specified port and debugging mode when python main.py is executed.

app = Flask(__name__)
sock = Sock(app)

@app.route(INCOMING_CALL_ROUTE)
def receive_call():
    pass

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    pass


if __name__ == "__main__":
    app.run(port=PORT, debug=DEBUG)

Step 3: Define the root endpoint

Now that the basic structure of our Flask app is defined, we can start to define our endpoints. We’ll start by defining the root endpoint that Twilio will hit when our Twilio phone number is called.

Modify your receive_call function as follows:

@app.route(INCOMING_CALL_ROUTE)
def receive_call():
    return "Real-time phone call transcription app"

Now run python main.py from a terminal in the project directory and go to http://localhost:5000 in your browser. You will see the return message displayed:

By default, the only HTTP request method available for Flask routes is GET, and the endpoint will respond with the value returned by the corresponding function. In our case, Twilio will send a POST request to the endpoint that we associate with our Twilio number, so we need to modify this Python function accordingly.

Modify your imports and receive_call function as follows:

from flask import Flask, request, Response

# ...

@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
    if request.method == 'POST':
        xml = f"""
<Response>
    <Say>
        You have connected to the Flask application
    </Say>
</Response>
""".strip()
        return Response(xml, mimetype='text/xml')
    else:
        return f"Real-time phone call transcription app"

First, we update the app.route decorator to allow both GET and POST requests. Then, inside the receive_call function, we access the HTTP request information using request imported from flask to check what the request type/method is. 

If it is a POST request, then we return a block of TwiML, which is Twilio’s version of XML that instructs Twilio on what to do when this endpoint is called. In our case, we use <Say> tags that tell Twilio to speak the sentence between the tags to the caller. We then return an HTTP Response which contains the TwiML, and we set the MIME type to XML.

Finally, if the HTTP request method is not POST, then it is a GET so we return the text we did previously in the else block.

We now have a functional Flask application that will respond with TwiML if called by Twilio. The next step is to get a Twilio number and point it to this application.

Step 4: Get a Twilio number and open an ngrok tunnel

To get a Twilio number, go to your Twilio console and go to Phone Number > Manage > Buy a number. There, you will see a list of numbers you can purchase for a small monthly fee - select one and click Buy. Note that we only need Voice capabilities for this tutorial.

Next, we’ll open an ngrok tunnel on port 5000 (through which our Flask app will be served). In the terminal, execute the following command:

ngrok http http://localhost:5000/

In the terminal, you will see some information displayed about the tunnel. What we need is the public forwarding URL that ends in .ngrok-free.app, so copy this value now.

Back in your Twilio console, go to Phone Numbers > Manage > Active numbers and select the phone number you bought above. In the Voice Configuration, set a Webhook for when a call comes in, pasting the ngrok URL you just copied under URL and setting the HTTP method to HTTP POST:

Then scroll down and click Save Configuration to save this change.

You have now configured your Twilio number to send a POST request to the ngrok URL when your number is called, and opened a tunnel that forwards this ngrok URL to port 5000 on your local machine. With all of this in place, we can now test our application.

Open another terminal in your project directory and run python main.py. You can go to http://localhost:5000 again to confirm that the application is up and running - you will see a 200 response in the terminal if you do so. Now call your Twilio phone number - you will hear a voice say You have connected to the Flask application, and then the call will terminate.

We now have a working Flask application that can successfully receive and respond to a Twilio phone call - it’s time to add in the WebSocket that receives incoming speech.

Step 5: Set up a WebSocket to receive speech

Modify your receive_call function with the TwiML below:

@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
    if request.method == 'POST':
        xml = f"""
<Response>
    <Say>
        Speak to see your audio data printed to the console
    </Say>
    <Connect>
        <Stream url='wss://{request.host}{WEBSOCKET_ROUTE}' />
    </Connect>
</Response>
""".strip()
        return Response(xml, mimetype='text/xml')
    else:
        return f"Real-time phone call transcription app"

We’ve added <Connect> and <Stream> tags that tell Twilio to forward the incoming audio data to the specified WebSocket . In our case, we point it to a WebSocket in the same Flask app that we will define next.

Our WebSocket will be defined in the transcription_websocket function that we defined at the beginning of this tutorial. Import the json package, and then modify the transcription_websocket function as follows:

import json

# ...

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    while True:
        data = json.loads(ws.receive())
        match data['event']:
            case "connected":
                print('twilio connected')
            case "start":
                print('twilio started')
            case "media":
                payload = data['media']['payload']
                print(payload)
            case "stop":
                print('twilio stopped')

Our WebSocket will receive four possible types of messages from Twilio:

  1. connected when the WebSocket connection is established
  2. start when the data stream begins sending data
  3. media which contain the raw audio data, and
  4. stop when the stream is stopped or the call has ended

We receive each message with ws.receive(), and then load it to a dictionary with json.loads. We then handle each message according to its type stored in the event key. For now, we print the binary data for media messages and a simple message for each remaining case.

Start your Flask application by running python main.py from the terminal in a project directory, and then call your Twilio number. You will start seeing a stream of binary, base-64-encoded data printed in your console:

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==       
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////fn38fXx9e/19//79/n39ff58/nz+ff59e/5+ff1+e/19/n19fn79fA==
fPt8/P99fH7+fnx9/P3+/f39/Pz//f58/nx9//5+fv59/f5+/n1+fv99ev38ff3+/X18eX57fH5+e35+fn59/Xz9+n1+/n78/3x8+v59ffv8/X7+fn19/v55fv9+/f78fX36fH37/n79ff/+fv79/X16fvr9/v79fX17fXz+fnt+/n57fn59fXx+fXz8fv/+fv58ff5+ev1+fn39/H59fg==
/Xx9/P59/f59fv97fX1+ff5+fP57fX5+fn58ff99ff17ff7+/n7+fnv8/3x8fv/8//7+//5+/f3//H36ffp4fP58/X1++/7//P16/X38fn59/X59+3x+/Xr9+339fn79fH3+/n19fv58fX1+/P5+fnv8e//9//37ff7+/35+fvt9ff3+/f56/v37/X57ef9+fv/8en3//X78fv39/vx9fQ==

Step 6: Define a real-time transcriber

We now have our Flask application running, receiving calls to a Twilio number via an ngrok tunnel, and printing the speech data to the console. It’s time to add real-time transcription.

Create a new file in your project directory called twilio_transcriber.py. We will define the object that we use to perform the real-time transcription in this module. At the top of the file, add the following code for imports and to set the AssemblyAI API key and define the Twilio audio sample rate:

import os

import assemblyai as aai
from dotenv import load_dotenv
load_dotenv()

aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

TWILIO_SAMPLE_RATE = 8000 # Hz

Now we will add handlers for the four types of messages we will receive from AssemblyAI. Add the below functions to twilio_transcriber.py:

def on_open(session_opened: aai.RealtimeSessionOpened):
    "Called when the connection has been established."
    print("Session ID:", session_opened.session_id)


def on_data(transcript: aai.RealtimeTranscript):
    "Called when a new transcript has been received."
    if not transcript.text:
        return

    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(transcript.text, end="\r\n")
    else:
        print(transcript.text, end="\r")


def on_error(error: aai.RealtimeError):
    "Called when the connection has been closed."
    print("An error occured:", error)


def on_close():
    "Called when the connection has been closed."
    print("Closing Session")

on_open (on_close) is called when a connection has been established (terminated). on_error is called when there is an error. For each of these message types, we print a single line with related information. The on_data function is called when our Flask application receives data from AssemblyAI’s real-time transcription service. In this case, we do one of three things.

If there is no transcript (i.e. there was no speech in the audio data sent to AssemblyAI), then we simply return the default value None. If we receive a transcript, then we do one of two things based on the transcript type.

Every message that AssemblyAI’s server sends is one of two types - either a partial transcript or a final transcript. Partial transcripts are sent in real-time when someone is speaking, gradually building up the transcript of what is being uttered. Each time a partial transcript is sent, the entire partial transcript for that utterance is sent, and not just the new words that have been spoken since the last partial transcript was sent.

When the real-time model detects that an utterance is complete, the entire utterance is sent one final time (formatted and punctuated by default) as a final transcript rather than partial. Once this final transcript is sent, we start this process over with a blank slate for the next utterance.

We can see how this process works in the below diagram:

So to handle these incoming messages, we print the transcript each time with a carriage return in the case that the transcript type is partial. Adding the carriage return brings the cursor back to the beginning of the line so that each transcript is printed over the previous one, giving the visual effect that the newly-transcribed words are being printed over time.

Now that we have our handlers defined, we need to define the class that we will actually use to perform the transcription. Add the below class to twilio_transcriber.py

class TwilioTranscriber(aai.RealtimeTranscriber):
    def __init__(self):
        super().__init__(
            on_data=on_data,
            on_error=on_error,
            on_open=on_open, # optional
            on_close=on_close, # optional
            sample_rate=TWILIO_SAMPLE_RATE,
            encoding=aai.AudioEncoding.pcm_mulaw
        )

Our TwilioTranscriber is a subclass of the RealtimeTranscriber class in AssemblyAI’s Python SDK. We define the initializer of TwilioTranscriber by passing our handlers into the init function, as well as specifying a sample rate of 8000 Hz and the encoding as PCM Mulaw, which are the settings Twilio streams use.

Step 7: Add real-time transcription to the WebSocket

Now that we have defined TwilioTranscriber, we need to use it in our main application code. In main.py, import base64 and TwilioTranscriber, and then modify the transcription_websocket to match the below code:

import base64
from twilio_transcriber import TwilioTranscriber

# ...

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    print('called')
    while True:
        data = json.loads(ws.receive())
        match data['event']:
            case "connected":
                transcriber = TwilioTranscriber()
                transcriber.connect()
                print('transcriber connected')
            case "start":
                print('twilio started')
            case "media":
                payload_b64 = data['media']['payload']
                payload_mulaw = base64.b64decode(payload_b64)
                transcriber.stream(payload_mulaw)
            case "stop":
                print('twilio stopped')
                transcriber.close()
                print('transcriber closed')

We’ve updated our connected handler to instantiate a TwilioTranscriber and connect to AssemblyAI’s servers, updated the media handler to decode the binary audio data and then pass it to the transcriber’s stream method, and updated the stop handler to close the transcriber’s connection to AssemblyAI’s servers.

Finally, update the <Say> tags in the receive_call function to contain a fitting phrase now that our console will print the audio transcription rather than just the audio data:

    <Say>
        Speak to see your speech transcribed in the console
    </Say>

Run python main.py in a terminal from the project directory, and call your Twilio number. As you speak, you will see your speech transcribed in the console.

Step 8: Automatically set the Twilio webhook and ngrok tunnel

Our application is running and fully functional, but we can further improve it. Currently, every time we want to run the application, we must open an ngrok tunnel in a separate terminal and then copy the forwarding URL from this terminal into Twilio’s console a the browser.

This is fairly laborious, so it’s time to automate these steps. First, update your .env file to include your Twilio number as a TWILIO_NUMBER environment variable:

NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this
TWILIO_NUMBER=replace-this

The number should be represented as a sequence of digits including a country area code. For example, +1234567891 would be a valid number for the United States.

Now, update the top of your main.py file as follows:

import base64
import json
import os

from flask import Flask, request, Response
from flask_sock import Sock
import ngrok
from twilio.rest import Client
from dotenv import load_dotenv
load_dotenv()

from twilio_transcriber import TwilioTranscriber


# ...

# Twilio authentication
account_sid = os.environ['TWILIO_ACCOUNT_SID']
api_key = os.environ['TWILIO_API_KEY_SID']
api_secret = os.environ['TWILIO_API_SECRET']
client = Client(api_key, api_secret, account_sid)

# Twilio phone number to call
TWILIO_NUMBER = os.environ['TWILIO_NUMBER']

# ngrok authentication
ngrok.set_auth_token(os.getenv("NGROK_AUTHTOKEN"))

We’ve added authentication variables to instantiate a Twilio Client, and imported our Twilio phone number environment variable. Finally, we’ve set our ngrok auth token through ngrok.set_auth_token.

Next, update the script’s main block as follows:

if __name__ == "__main__":
    try:
        # Open Ngrok tunnel
        listener = ngrok.forward(f"http://localhost:{PORT}")
        print(f"Ngrok tunnel opened at {listener.url()} for port {PORT}")
        NGROK_URL = listener.url()

        # Set ngrok URL to be the webhook for the appropriate Twilio number
        twilio_numbers = client.incoming_phone_numbers.list()
        twilio_number_sid = [num.sid for num in twilio_numbers if num.phone_number == TWILIO_NUMBER][0]
        client.incoming_phone_numbers(twilio_number_sid).update(account_sid, voice_url=f"{NGROK_URL}{INCOMING_CALL_ROUTE}")

        # run the app
        app.run(port=PORT, debug=DEBUG)
    finally:
        # Always disconnect the ngrok tunnel
        ngrok.disconnect()

First, we open up an ngrok tunnel with ngrok.forward, and then use the twilio library to programmatically set our Twilio number’s voice webhook to the URL of the tunnel. It appears that it is not possible to call the incoming_phone_numbers method directly on our Twilio number, so we first have to isolate its SID with a list comprehension and then pass the SID into this method. Finally, we run our app as before with app.run(). All of this code is wrapped in a try…except block that ensures our ngrok tunnel is always terminated properly.

If you have a free ngrok account you can only have one tunnel open at a time, so close your previous tunnel if it is still open, and then run python main.py in order to execute our program. Call your Twilio number and speak to see your speech transcriber to the console without having to manually open an ngrok tunnel and update the Twilio console:

0:00
/0:25

Final words

In this tutorial, we learned how to use Twilio and AssemblyAI to transcribe phone calls in real-time. Check out our other tutorials or our documentation to learn more about how you can build AI-powered features to analyze speech.

Alternatively, check out other content on our Blog or YouTube channel to learn more about AI, or feel free to join us on Twitter or Discord to stay in the loop when we release new content.