Build & Learn
August 6, 2025

Transcribe a phone call in real-time using Python with AssemblyAI and Twilio

Learn how to build real-time phone call transcription with Python using AssemblyAI's Universal-Streaming speech-to-text API and Twilio.

Ryan O'Connor
Senior Developer Educator
Ryan O'Connor
Senior Developer Educator
Reviewed by
No items found.
No items found.
No items found.
No items found.
Table of contents

In this tutorial, we'll learn how to build a Flask application that transcribes phone calls in real-time. 

Here's a look at the end result:

Overview

Let's start with a high-level overview of how our service will work. First, a user will call a Twilio number which we will configure to forward the incoming data stream to AssemblyAI. Then, AssemblyAI's real-time transcription service will transcribe this audio stream as it comes in, and send us partial transcripts in quick succession, which we'll print to the terminal as they come in.

Now let's take a look at how this will work more technically.

  • First, a user calls the phone number that we provision with Twilio.
  • Twilio then calls the specific endpoint associated with this number.
  • In our case, we will configure the endpoint to be an ngrok URL, which provides a tunnel to a port on our local machine from a publicly accessible URL. Ngrok therefore allows us to expose our application to Twilio without having to provision a cloud machine or modify firewall rules.
  • Through the ngrok tunnel, Twilio calls an endpoint in a Flask application, which responds with TwiML (Twilio Markup Language) that instructs Twilio on how to handle the call.
  • In our case, the TwiML will tell Twilio to pass the incoming audio stream from the phone call to a WebSocket in our Flask application.
  • This WebSocket will receive the audio stream and send it to AssemblyAI for transcription, printing the corresponding transcript to the terminal as it is received in real-time

You can find all of the code for this tutorial in this GitHub repository.

Getting Started

To get started, you'll need:

Now create and navigate into a project directory:

mkdir realtime-phone-transcription
cd realtime-phone-transcription

Step 1: Set up credentials and environment

We'll be using python-dotenv to manage our credentials. Create a file called .env in your project directory, and add the below text:

NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this

Make sure to replace replace-this with your specific credential for each line. You can find your ngrok authtoken in the Getting Started > Your Authtoken tab on your ngrok dashboard, or by checking the file returned by running ngrok config check if you already have ngrok set up on your system. If you have not configured ngrok on your system, add your ngrok authtoken to your CLI by running the command:

ngrok config add-authtoken YOUR-TOKEN-HERE

You can find your Twilio account SID (TWILIO_ACCOUNT_SID) in your Twilio console under Account > API keys & tokens. Here you can also create an API key for TWILIO_API_KEY_SID and TWILIO_API_SECRET. A Standard key type is sufficient to follow along with this tutorial.

You can find your AssemblyAI API key on your AssemblyAI dashboard.

Finally, create a file called .gitignore and copy the below text into it:

.env
venv
__pycache__

This will prevent you from accidentally tracking your .env file with git and potentially uploading it to a website like GitHub. Additionally, it will prevent you from tracking/uploading your virtual environment and cache files.

Now create and activate a virtual environment for the project:

# Mac/Linux
python3 -m venv venv
. venv/bin/activate

# Windows
python -m venv venv
.\venv\Scripts\activate.bat

Next, we'll install all of the dependencies we will need for the projects. Execute the below command:

pip install Flask flask-sock assemblyai python-dotenv ngrok
twilio

Step 2: Create the Flask application

Create a file in the project directory called main.py and add the following imports:

from flask import Flask
from flask_sock import Sock

These lines import Flask and Sock so that we can create a web application with WebSockets. Next, add these lines that define some settings for our application:

PORT = 5000
DEBUG = False
INCOMING_CALL_ROUTE = '/'
WEBSOCKET_ROUTE = '/'

In particular, we set the port that the app should run on, set debugging to false, and then define the route for the HTTP endpoint that will be hit when our Twilio number is called, and the route for the WebSocket to which the audio data will be sent.

Now add the below lines to main.py, which instantiate our app and define the functions for these endpoints. Additionally, we run the app on the specified port and debugging mode when python main.py is executed.

app = Flask(__name__)
sock = Sock(app)

@app.route(INCOMING_CALL_ROUTE)
def receive_call():
    pass

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    pass

if __name__ == "__main__":
    app.run(port=PORT, debug=DEBUG)

Step 3: Define the root endpoint

Now that the basic structure of our Flask app is defined, we can start to define our endpoints. We'll start by defining the root endpoint that Twilio will hit when our Twilio phone number is called.

Modify your receive_call function as follows:

@app.route(INCOMING_CALL_ROUTE)
def receive_call():
    return "Real-time phone call transcription app"

Now run python main.py from a terminal in the project directory and go to http://localhost:5000 in your browser. You will see the return message displayed.

By default, the only HTTP request method available for Flask routes is GET, and the endpoint will respond with the value returned by the corresponding function. In our case, Twilio will send a POST request to the endpoint that we associate with our Twilio number, so we need to modify this Python function accordingly.

Modify your imports and receive_call function as follows:

from flask import Flask, request, Response

# ...

@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
    if request.method == 'POST':
        xml = f"""
<Response>
    <Say>
        You have connected to the Flask application
    </Say>
</Response>
""".strip()
        return Response(xml, mimetype='text/xml')
    else:
        return f"Real-time phone call transcription app"

First, we update the app.route decorator to allow both GET and POST requests. Then, inside the receive_call function, we access the HTTP request information using request imported from flask to check what the request type/method is.

If it is a POST request, then we return a block of TwiML, which is Twilio's version of XML that instructs Twilio on what to do when this endpoint is called. In our case, we use "<Say>" tags that tell Twilio to speak the sentence between the tags to the caller. We then return an HTTP Response which contains the TwiML, and we set the MIME type to XML.

Finally, if the HTTP request method is not POST, then it is a GET so we return the text we did previously in the else block.

We now have a functional Flask application that will respond with TwiML if called by Twilio. The next step is to get a Twilio number and point it to this application.

Step 4: Get a Twilio number and open an ngrok tunnel

To get a Twilio number, go to your Twilio console and go to Phone Number > Manage > Buy a number. There, you will see a list of numbers you can purchase for a small monthly fee - select one and click Buy. Note that we only need Voice capabilities for this tutorial.

Next, we'll open an ngrok tunnel on port 5000 (through which our Flask app will be served). In the terminal, execute the following command:

ngrok http http://localhost:5000/

In the terminal, you will see some information displayed about the tunnel. What we need is the public forwarding URL that ends in .ngrok-free.app, so copy this value now.

Back in your Twilio console, go to Phone Numbers > Manage > Active numbers and select the phone number you bought above. In the Voice Configuration, set a Webhook for when a call comes in, pasting the ngrok URL you just copied under URL and setting the HTTP method to HTTP POST.

Then scroll down and click Save Configuration to save this change.

You have now configured your Twilio number to send a POST request to the ngrok URL when your number is called, and opened a tunnel that forwards this ngrok URL to port 5000 on your local machine. With all of this in place, we can now test our application.

Open another terminal in your project directory and run python main.py. You can go to http://localhost:5000 again to confirm that the application is up and running - you will see a 200 response in the terminal if you do so. Now call your Twilio phone number - you will hear a voice say "You have connected to the Flask application", and then the call will terminate.

We now have a working Flask application that can successfully receive and respond to a Twilio phone call - it's time to add in the WebSocket that receives incoming speech.

Step 5: Set up a WebSocket to receive speech

Modify your receive_call function with the TwiML below:

@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
    if request.method == 'POST':
        xml = f"""
<Response>
    <Say>
        Speak to see your audio data printed to the console
    </Say>
    <Connect>
        <Stream url='wss://{request.host}{WEBSOCKET_ROUTE}' />
    </Connect>
</Response>
""".strip()
        return Response(xml, mimetype='text/xml')
    else:
        return f"Real-time phone call transcription app"

We've added "<Connect>" and "<Stream>" tags that tell Twilio to forward the incoming audio data to the specified WebSocket. In our case, we point it to a WebSocket in the same Flask app that we will define next.

Our WebSocket will be defined in the transcription_websocket function that we defined at the beginning of this tutorial. Import the json package, and then modify the transcription_websocket function as follows:

import json

# ...

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    while True:
        data = json.loads(ws.receive())
        match data['event']:
            case "connected":
                print('twilio connected')
            case "start":
                print('twilio started')
            case "media":
                payload = data['media']['payload']
                print(payload)
            case "stop":
                print('twilio stopped')

Our WebSocket will receive four possible types of messages from Twilio:

  • connected when the WebSocket connection is established
  • start when the data stream begins sending data
  • media which contain the raw audio data, and
  • stop when the stream is stopped or the call has ended

We receive each message with ws.receive(), and then load it to a dictionary with json.loads. We then handle each message according to its type stored in the event key. For now, we print the binary data for media messages and a simple message for each remaining case.

Start your Flask application by running python main.py from the terminal in a project directory, and then call your Twilio number. You will start seeing a stream of binary, base-64-encoded data printed in your console.

Step 6: Define a real-time transcriber

We now have our Flask application running, receiving calls to a Twilio number via an ngrok tunnel, and printing the speech data to the console. It's time to add real-time transcription.

Create a new file in your project directory called twilio_transcriber.py. We will define the object that we use to perform the real-time transcription in this module. At the top of the file, add the following code for imports and to set the AssemblyAI API key and define the Twilio audio sample rate:

import os
import threading
from typing import Type
import assemblyai as aai
from assemblyai.streaming.v3 import (
    BeginEvent,
    StreamingClient,
    StreamingClientOptions,
    StreamingError,
    StreamingEvents,
    StreamingParameters,
    TerminationEvent,
    TurnEvent,
)
from dotenv import load_dotenv

load_dotenv()

aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
TWILIO_SAMPLE_RATE = 8000  # Hz

Now we will add handlers for the four types of messages we will receive from AssemblyAI. Add the below functions to twilio_transcriber.py:

def on_begin(client: StreamingClient, event: BeginEvent):
"Called when the connection has been established."
    print(f"Session ID: {event.id}\n")

def on_turn(client: StreamingClient, event: TurnEvent):
"Called when a new transcript has been received."
    if event.transcript.strip():
        is_formatted = hasattr(event, 'turn_is_formatted') and
event.turn_is_formatted
        
        if is_formatted:
            transcript_display.add_final(event.transcript)
        elif event.end_of_turn:
            pass
        else:
            transcript_display.update_partial(event.transcript)

def on_error(client: StreamingClient, error: StreamingError):
		"Called when the connection has been closed."
    	print(f"\nError: {error}")

def on_terminated(client: StreamingClient, event:
TerminationEvent):
	"Called when the connection has been closed."
    print(f"\nSession ended - {event.audio_duration_seconds}
seconds processed")

We will also create a TranscriptDisplay class that manages the visual output of the transcript text.

class TranscriptDisplay:
    def __init__(self):
        self.current_partial = ""
        self.lock = threading.Lock()
        self.current_line_printed = False
    
    def update_partial(self, text):
        with self.lock:
            self.current_partial = text
            self._display_partial()
    
    def add_final(self, text):
        with self.lock:
            if self.current_partial:
                print("\r" + " " * (len(self.current_partial) +
20) + "\r", end="", flush=True)
            
            print(f"{text}")
            
            self.current_partial = ""
            self.current_line_printed = False
    
    def _display_partial(self):
        print("\r" + " " * 100 + "\r", end="", flush=True)
        if self.current_partial:
            print(f"{self.current_partial}", end="", flush=True)


transcript_display = TranscriptDisplay()

on_begin (on_terminated)is called when a connection has been established (terminated). on_error is called when there is an error. For each of these message types, we print relevant information - most importantly, on_begin displays the unique session ID that AssemblyAI assigns to each streaming session.

The on_turn function is called when our Flask application receives transcription data from AssemblyAI's Universal-Streaming service. This function handles the core transcription logic and does one of several things based on the transcript content and type.

If there is no transcript (i.e., the transcript is empty), we simply skip processing and return early. If we receive a transcript, we handle it based on the transcript properties provided by the Universal-Streaming API.

AssemblyAI's Universal-Streaming API sends three types of turn events: partial transcripts, final unformatted transcripts, and final formatted transcripts. Partial transcripts are sent in real-time as someone is speaking, gradually building up the transcript of the current utterance. Each partial transcript contains the complete text for the current utterance up to that point, not just the new words since the last update.

When the model detects that an utterance is complete, it first sends a final unformatted transcript with end_of_turn set to true. Shortly after (since we enabled format_turns=True), it sends a final formatted transcript with turn_is_formatted set to true. This formatted version includes proper punctuation, capitalization, and formatting for entities like numbers and dates.

To handle these incoming messages and create a smooth user experience, we use a TranscriptDisplay class that manages the visual output. For partial transcripts, we update the current line using carriage returns (\r) to overwrite the previous partial text, creating the visual effect of words appearing and updating in real-time as they're being spoken. When we receive the final formatted transcript, we clear any partial text and print the complete, properly formatted utterance on a new line. 

Now that we have our handlers defined, we need to define the class that we will actually use to perform the transcription. Add the below class to twilio_transcriber.py:

class TwilioTranscriber(StreamingClient):
    def __init__(self):
        options = StreamingClientOptions(
            api_key=aai.settings.api_key,
            api_host="streaming.assemblyai.com"
        )


        super().__init__(options)
        
        self.on(StreamingEvents.Begin, on_begin)
        self.on(StreamingEvents.Turn, on_turn)  
        self.on(StreamingEvents.Termination, on_terminated)
        self.on(StreamingEvents.Error, on_error)
        
        self.audio_buffer = bytearray()
        self.buffer_size_bytes = BUFFER_SIZE_MS * BYTES_PER_MS  
        self.buffer_lock = threading.Lock()
        self.is_active = False
        
        global transcript_display
        transcript_display = TranscriptDisplay()
    
    def start_transcription(self):
        params = StreamingParameters(
            sample_rate=TWILIO_SAMPLE_RATE,
            encoding=aai.AudioEncoding.pcm_mulaw,
            format_turns=True
        )
        self.is_active = True
        self.connect(params)
    
    def stream_audio(self, audio_data: bytes):
        if not self.is_active:
            return
            
        with self.buffer_lock:
            self.audio_buffer.extend(audio_data)
            
            if len(self.audio_buffer) >= self.buffer_size_bytes:
                self._flush_buffer()
    
    def _flush_buffer(self):
        if len(self.audio_buffer) > 0:
            buffered_audio = bytes(self.audio_buffer)
            try:
                self.stream(buffered_audio)
            except Exception as e:
                print(f"\nError sending audio: {e}")
            
            self.audio_buffer.clear()
    
    def stop_transcription(self):
        self.is_active = False
        
        with self.buffer_lock:
            if len(self.audio_buffer) > 0:
                self._flush_buffer()
        
        self.disconnect(terminate=True)

Our TwilioTranscriber is a subclass of the StreamingClient class from AssemblyAI's Universal-Streaming v3 API. We define the initializer of TwilioTranscriber by first creating a StreamingClientOptions object that contains our API key and the correct streaming host (streaming.assemblyai.com), then passing these options to the parent StreamingClient constructor.

After initialization, we register our event handlers using the .on() method for different streaming events: Begin, Turn, Termination, and Error. 

The class also implements audio buffering functionality, since Twilio sends very small audio chunks (around 20ms each) but AssemblyAI's Universal-Streaming API requires chunks between 50-1000ms. We buffer incoming audio data until we have approximately 100ms worth (800 bytes at 8kHz), then send it to AssemblyAI.

When starting transcription, we create StreamingParameters that specify a sample rate of 8000 Hz and PCM μ-law encoding (which are the settings Twilio streams use), and we enable format_turns=True to receive properly formatted final transcripts with punctuation and capitalization. The connection is established by calling the connect() method with these parameters.

Step 7: Add real-time transcription to the WebSocket

Now that we have defined TwilioTranscriber, we need to use it in our main application code. In main.py, import base64 and TwilioTranscriber, and then modify the transcription_websocket to match the below code:

import base64
from twilio_transcriber import TwilioTranscriber

# ...

@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
    transcriber = None
    
    try:
        while True:
            data = json.loads(ws.receive())
            match data['event']:
                case "connected":
                    print('Twilio connected, starting
transcriber...')
                    transcriber = TwilioTranscriber()
                    transcriber.start_transcription()
                case "start":
                    print('Call started')
                case "media": 
                    if transcriber:
                        payload_b64 = data['media']['payload']
                        payload_mulaw =
base64.b64decode(payload_b64)
                        transcriber.stream_audio(payload_mulaw)
                case "stop":
                    print('Call ended')
                    if transcriber:
                        transcriber.stop_transcription()
                    break
    except Exception as e:
        print(f"Error in websocket: {e}")
        import traceback
        traceback.print_exc()
    finally:
        if transcriber:
            try:
                transcriber.stop_transcription()
            except Exception as cleanup_error:
                print(f"Cleanup error: {cleanup_error}")

We've updated our connected handler to instantiate a TwilioTranscriber and connect to AssemblyAI's servers, updated the media handler to decode the binary audio data and then pass it to the transcriber's stream_audio method, and updated the stop handler to close the transcriber's connection to AssemblyAI's servers.

Finally, update the "<Say>" tags in the receive_call function to contain a fitting phrase now that our console will print the audio transcription rather than just the audio data:

<Say>
    Speak to see your speech transcribed in the console
</Say>

Run python main.py in a terminal from the project directory, and call your Twilio number. As you speak, you will see your speech transcribed in the console.

Step 8: Automatically set the Twilio webhook and ngrok tunnel

Our application is running and fully functional, but we can further improve it. Currently, every time we want to run the application, we must open an ngrok tunnel in a separate terminal and then copy the forwarding URL from this terminal into Twilio's console in the browser.

This is fairly laborious, so it's time to automate these steps. First, update your .env file to include your Twilio number as a TWILIO_NUMBER environment variable:

NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this
TWILIO_NUMBER=replace-this

The number should be represented as a sequence of digits including a country area code. For example, +1234567891 would be a valid number for the United States.

Now, update the top of your main.py file as follows:

import base64
import json
import os
from flask import Flask, request, Response
from flask_sock import Sock
import ngrok
from twilio.rest import Client
from dotenv import load_dotenv

load_dotenv()

from twilio_transcriber import TwilioTranscriber

# ...

# Twilio authentication
account_sid = os.environ['TWILIO_ACCOUNT_SID']
api_key = os.environ['TWILIO_API_KEY_SID']
api_secret = os.environ['TWILIO_API_SECRET']
client = Client(api_key, api_secret, account_sid)

# Twilio phone number to call
TWILIO_NUMBER = os.environ['TWILIO_NUMBER']

# ngrok authentication
ngrok.set_auth_token(os.getenv("NGROK_AUTHTOKEN"))

We've added authentication variables to instantiate a Twilio Client, and imported our Twilio phone number environment variable. Finally, we've set our ngrok auth token through ngrok.set_auth_token.

Next, update the script's main block as follows:

if __name__ == "__main__":
    try:
      listener = ngrok.forward(f"http://localhost:{PORT}")
      print(f"Ngrok tunnel opened at {listener.url()} for port {PORT}")
      NGROK_URL = listener.url()


      twilio_numbers = client.incoming_phone_numbers.list()
      twilio_number_sid = [num.sid for num in twilio_numbers if
num.phone_number == TWILIO_NUMBER][0]        
	client.incoming_phone_numbers(twilio_number_sid).update(acco
unt_sid, voice_url=f"{NGROK_URL}{INCOMING_CALL_ROUTE}")


      app.run(port=PORT, debug=DEBUG)
    finally:
      ngrok.disconnect()

First, we open up an ngrok tunnel with ngrok.forward, and then use the twilio library to programmatically set our Twilio number's voice webhook to the URL of the tunnel. It appears that it is not possible to call the incoming_phone_numbers method directly on our Twilio number, so we first have to isolate its SID with a list comprehension and then pass the SID into this method. Finally, we run our app as before with app.run(). All of this code is wrapped in a try…finally block that ensures our ngrok tunnel is always terminated properly.

If you have a free ngrok account you can only have one tunnel open at a time, so close your previous tunnel if it is still open, and then run python main.py in order to execute our program. Call your Twilio number and speak to see your speech transcribed to the console without having to manually open an ngrok tunnel and update the Twilio console.

<div class="blog-cta_component">
  <div class="blog-cta_title">Try AssemblyAI for free</div>
  <div class="blog-cta_rt w-richtext">
    <p>Test speech recognition, speaker diarization, and more in our no-code AI playground.</p>
  </div>
  <a href="#https://www.assemblyai.com/playground" class="button w-button">Try Free AI Playground</a>
</div>
Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Tutorial
Streaming Speech-to-Text
Python