Tutorials

Transcribe phone calls in real-time in Go with Twilio and AssemblyAI

Learn how to transcribe a Twilio phone call in real-time using Go in this tutorial.

Transcribe phone calls in real-time in Go with Twilio and AssemblyAI

In this tutorial, you'll build a server application in Go that transcribes an incoming Twilio phone call into text in real-time.

You'll use a Twilio MediaStream to stream the voice data to your local server application. You'll pass the voice data to AssemblyAI to transcribe the call into text, and then print it in your terminal in real-time.

By the end of this tutorial, you'll be able to:

  • Record a phone call in real-time using a Twilio MediaStream connection.
  • Transcribe audio in real-time using the AssemblyAI Go SDK.

Before you get started

To complete this tutorial, you'll need:

Step 1: Set up ngrok

Twilio will need to access your server through a publicly available URL. In this tutorial, you'll use ngrok to create a publicly available URL for an application running on your local computer.

💡
If you already have a preferred way to expose your application publicly, you may skip this step.
  1. Sign up for an ngrok account.

  2. Install ngrok for your platform.

  3. Authenticate your ngrok agent using Your Authtoken.

    ngrok config add-authtoken <YOUR_TOKEN>
    
  4. Open an ngrok tunnel for port 8080. ngrok will only tunnel connections while the following command is running.

    ngrok http 8080
    

You'll see something similar to the output below, where the URL next to Forwarding is the publicly available URL that forwards to your local 8080 port (https://84c5df474.ngrok-free.dev in the example output).

ngrok                                                                   (Ctrl+C to quit)

Session Status                online
Account                       inconshreveable (Plan: Free)
Version                       3.0.0
Region                        United States (us)
Latency                       78ms
Web Interface                 http://127.0.0.1:4040
Forwarding                    https://84c5df474.ngrok-free.dev -> http://localhost:8080

Connections                   ttl     opn     rt1     rt5     p50     p90
                              0       0       0.00    0.00    0.00    0.00

Sample output for ngrok http 8080

Copy the Forwarding URL in your terminal output and save it for the next step.

Step 2: Set up Twilio

You'll need to register a phone number with Twilio and configure it to call your server application whenever someone calls that number. You can also use the Twilio console to update the voice URL for your phone number.

  1. Sign up for a Twilio account.

  2. Download Twilio CLI.

  3. In a new terminal, log in using Twilio CLI. You'll be asked to enter an identifier for your new profile, for example dev.

    twilio login
    
  4. Select the profile you created.

    twilio profiles:use <YOUR_PROFILE_ID>
    
  5. Update the voice URL for your phone number.

    twilio phone-number:update <YOUR_TWILIO_NUMBER> --voice-url <YOUR_NGROK_URL>
    
💡
You'll find the Account SID, Auth Token, and phone number under Account info on your Twilio console.

Now, when someone calls your phone number, they'll be forwarded to port 8080 on your local computer. Not having to deploy every change to a cloud instance is going to speed up the development process.

Next up, you'll build the server application to handle the phone call.

Step 3: Create a WebSocket server for Twilio media streams

In this step, you'll set up the Go server application to accept the Twilio MediaStream.

💡
This tutorial walks you through the code, step-by-step, but you can also find the complete source code at the end of this page.
  1. Create and navigate into a new project directory.

    mkdir realtime-transcription-go
    cd realtime-transcription-go
    
  2. Initialize your Go module.

    go mod init realtime-transcription-go
    
  3. Install the AssemblyAI Go SDK.

    go get github.com/AssemblyAI/assemblyai-go-sdk
    
  4. Install the WebSocket module by Quinn Rivenwell. You'll need this to handle the incoming Twilio MediaStream connection.

    go get nhooyr.io/websocket
    
  5. Create a new file called main.go with the following content:

    package main
    
    import (
        "log"
        "net/http"
        "os"
    )
    
    var apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
    
    func main() {
        http.HandleFunc("/", twilio)
        http.HandleFunc("/media", media)
        
        log.Println("Server is running on port 8080")
    
        if err := http.ListenAndServe(":8080", nil); err != nil {
            log.Fatal(err)
        }
    }
    
    func twilio(w http.ResponseWriter, r *http.Request) {
        // Respond to Twilio request and initiate a Twilio MediaStream.
    }
    
    func media(w http.ResponseWriter, r *http.Request) {
        // Serve the incoming WebSocket connection from Twilio.
    }
    
  6. Set the ASSEMBLYAI_API_KEY environment variable. Replace <YOUR_API_KEY> with your AssemblyAI API key.

    export ASSEMBLYAI_API_KEY=<YOUR_API_KEY>
    
  7. To start the server, enter the following in your terminal:

    go run main.go
    

You now have a running server. Next, you'll implement the twilio and media HTTP handler functions to accept the incoming request from Twilio.

Step 4: Initiate Twilio media stream

When someone calls the phone number, Twilio makes an HTTP request to an endpoint in which you define how you want to respond.

In this step, you'll use TwiML to define the instructions that tell Twilio what to do when you receive an incoming call:

  1. Instruct the caller on how to use the transcriber.
  2. Ask Twilio to stream the audio to the Go server using a WebSocket.

To issue your TwiML instructions, add the following code to your twilio function:

func twilio(w http.ResponseWriter, r *http.Request) {
	if r.Method != "POST" {
		w.WriteHeader(http.StatusMethodNotAllowed)
		return
	}

	twiML := `<?xml version="1.0" encoding="UTF-8"?>
<Response>
   <Say>
      Speak to see your audio transcribed in the console.
   </Say>
   <Connect>
      <Stream url='%s' />
   </Connect>
</Response>`

	w.Header().Add("Content-Type", "application/xml")
	fmt.Fprintln(w, fmt.Sprintf(twiML, "wss://"+r.Host+"/media"))
}

When Twilio receives your instructions, they'll attempt to establish a WebSocket connection to the /media path of your server application.

Step 5: Handle incoming Twilio media stream

In this step, you'll define how to accept the incoming WebSocket connection from Twilio and process incoming messages. Twilio sends four types of messages:

  • connected tells you that the media stream is connected.
  • start tells you that Twilio started the media stream. After this you'll receive several media messages.
  • media contains audio samples for the incoming phone call.
  • stop tells you that Twilio stopped the media stream.

To accept the WebSocket connection, and handle the messages:

  1. Define a struct to represent each incoming WebSocket message. Messages are JSON encoded, so you'll use wsjson from the websocket module to decode them into structs.

    type TwilioMessage struct {
        Event string `json:"event"`
        Media struct {
            // Contains audio samples.
            Payload []byte `json:"payload"`
        } `json:"media"`
    }
    
  2. Import the WebSocket modules:

    import (
       // ...
       
       "nhooyr.io/websocket"
       "nhooyr.io/websocket/wsjson"
    )
    
  3. Add the following code to the media HTTP handler function to print each message:

    func media(w http.ResponseWriter, r *http.Request) {
    	// Upgrade HTTP request to WebSocket.
    	c, err := websocket.Accept(w, r, nil)
    	if err != nil {
    		log.Println("unable to upgrade connection to websocket:", err)
    		w.WriteHeader(http.StatusInternalServerError)
    		return
    	}
    	defer c.CloseNow()
    
    	ctx, cancel := context.WithCancel(r.Context())
    	defer cancel()
    
    	for {
    		var message TwilioMessage
    		err = wsjson.Read(ctx, c, &message)
    		if err != nil {
    			log.Println("unable to read twilio message:", err)
    			c.Close(websocket.StatusInternalError, err.Error())
    			return
    		}
    
    		switch message.Event {
    		case "connected":
    			log.Println("twilio mediastream connected")
    		case "start":
    			log.Println("twilio mediastream started")
    		case "media":
    			// TODO: Send audio to AssemblyAI for transcription.
    		case "stop":
    			log.Println("twilio mediastream stopped")
    
    			c.Close(websocket.StatusNormalClosure, "")
    
    			return
    		}
    	}
    }
    

Next, you'll transcribe the incoming audio samples using AssemblyAI's Real-Time Transcription.

Step 6: Transcribe media stream using AssemblyAI real-time transcription

AssemblyAI Real-Time Transcription lets you transcribe voice data while it's being spoken.

  1. Create a transcriber that implements RealTimeHandler to handle real-time messages from AssemblyAI.

    type realtimeTranscriber struct {
    }
    
    func (t *realtimeTranscriber) SessionBegins(ev aai.SessionBegins) {
        log.Println("session begins")
    }
    
    func (t *realtimeTranscriber) SessionTerminated(ev aai.SessionTerminated) {
        log.Println("session terminated")
    }
    
    func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) {
        fmt.Printf("%s\r\n", transcript.Text)
    }
    
    func (t *realtimeTranscriber) PartialTranscript(transcript aai.PartialTranscript) {
        // Ignore silence.
        if transcript.Text == "" {
            return
        }
    
        fmt.Printf("%s\r", transcript.Text)
    }
    
    func (t *realtimeTranscriber) Error(err error) {
        log.Println("something bad happened:", err)
    }
    
  2. Import the AssemblyAI SDK.

    import (
       // ...
       
       aai "github.com/AssemblyAI/assemblyai-go-sdk"
    )
    
  3. In the media HTTP handler function, create and connect the real-time client just before the for loop.

    handler := new(realtimeTranscriber)
    
    client := aai.NewRealTimeClientWithOptions(
        aai.WithRealTimeAPIKey(apiKey),
    
        // Twilio MediaStream sends audio in mu-law format.
        aai.WithRealTimeEncoding(aai.RealTimeEncodingPCMMulaw),
    
        // Twilio MediaStream sends audio at 8000 samples per second.
        aai.WithRealTimeSampleRate(8000),
    
        aai.WithHandler(handler),
    )
    
    if err := client.Connect(ctx); err != nil {
        log.Println("unable to connect to real-time transcription:", err)
        c.Close(websocket.StatusInternalError, err.Error())
        return
    }
    
    log.Println("connected to real-time transcription")
    
  4. In the switch statement, forward the incoming audio samples to AssemblyAI.

    case "media":
        if err := client.Send(ctx, message.Media.Payload); err != nil {
            log.Println("unable to send audio for real-time transcription:", err)
            c.Close(websocket.StatusInternalError, err.Error())
            return
         }
    
  5. Disconnect the transcriber once the media stream has stopped.

    case "stop":
            log.Println("twilio mediastream stopped")
    
            if err := client.Disconnect(ctx, true); err != nil {
                log.Println("unable to disconnect from real-time transcription:", err)
                c.Close(websocket.StatusInternalError, err.Error())
                return
            }
    
            log.Println("disconnected from real-time transcription")
            c.Close(websocket.StatusNormalClosure, "")
    
            return
    

Step 7: Test your application

Start the server with go run main.go and call your phone number. If prompted by your operating system, allow the application to access the network.

Once instructed by the voice, start speaking to see your call transcribed in your server log.

session begins
connected to real-time transcription
twilio mediastream connected
twilio mediastream started
Hi. I've arrived at the gate with your food delivery.
twilio mediastream stopped
session terminated
disconnected from real-time transcription

Learn more

In this tutorial, you built a Go application that transcribes incoming phone calls in real-time, using Twilio and AssemblyAI.

To learn more about Real-Time Transcription, check out the following resources:

To keep up with more content like this, make sure you subscribe to our newsletter and join our Discord server.

Complete source code

package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"

	"nhooyr.io/websocket"
	"nhooyr.io/websocket/wsjson"

	aai "github.com/AssemblyAI/assemblyai-go-sdk"
)

type realtimeTranscriber struct {
}

func (t *realtimeTranscriber) SessionBegins(ev aai.SessionBegins) {
	log.Println("session begins")
}

func (t *realtimeTranscriber) SessionTerminated(ev aai.SessionTerminated) {
	log.Println("session terminated")
}

func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) {
	fmt.Printf("%s\r\n", transcript.Text)
}

func (t *realtimeTranscriber) PartialTranscript(transcript aai.PartialTranscript) {
	// Ignore silence.
	if transcript.Text == "" {
		return
	}

	fmt.Printf("%s\r", transcript.Text)
}

func (t *realtimeTranscriber) Error(err error) {
	log.Println("something bad happened:", err)
}

var apiKey = os.Getenv("ASSEMBLYAI_API_KEY")

func main() {
	http.HandleFunc("/", twilio)
	http.HandleFunc("/media", media)

	log.Println("Server is running on port 8080")

	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatal(err)
	}
}

func twilio(w http.ResponseWriter, r *http.Request) {
	if r.Method != "POST" {
		w.WriteHeader(http.StatusMethodNotAllowed)
		return
	}

	twiML := `<?xml version="1.0" encoding="UTF-8"?>
<Response>
   <Say>
      Speak to see your audio transcribed in the console.
   </Say>
   <Connect>
      <Stream url='%s' />
   </Connect>
</Response>`

	w.Header().Add("Content-Type", "application/xml")
	fmt.Fprintln(w, fmt.Sprintf(twiML, "wss://"+r.Host+"/media"))
}

type TwilioMessage struct {
	Event string `json:"event"`
	Media struct {
		// Contains audio samples.
		Payload []byte `json:"payload"`
	} `json:"media"`
}

func media(w http.ResponseWriter, r *http.Request) {
	// Upgrade HTTP request to WebSocket.
	c, err := websocket.Accept(w, r, nil)
	if err != nil {
		log.Println("unable to upgrade connection to websocket:", err)
		w.WriteHeader(http.StatusInternalServerError)
		return
	}
	defer c.CloseNow()

	ctx, cancel := context.WithCancel(r.Context())
	defer cancel()

	handler := new(realtimeTranscriber)

	client := aai.NewRealTimeClientWithOptions(
		aai.WithRealTimeAPIKey(apiKey),

		// Twilio MediaStream sends audio in mu-law format.
		aai.WithRealTimeEncoding(aai.RealTimeEncodingPCMMulaw),

		// Twilio MediaStream sends audio at 8000 samples per second.
		aai.WithRealTimeSampleRate(8000),

		aai.WithHandler(handler),
	)

	if err := client.Connect(ctx); err != nil {
		log.Println("unable to connect to real-time transcription:", err)
		c.Close(websocket.StatusInternalError, err.Error())
		return
	}

	log.Println("connected to real-time transcription")

	for {
		var message TwilioMessage
		err = wsjson.Read(ctx, c, &message)
		if err != nil {
			log.Println("unable to read twilio message:", err)
			c.Close(websocket.StatusInternalError, err.Error())
			return
		}

		switch message.Event {
		case "connected":
			log.Println("twilio mediastream connected")
		case "start":
			log.Println("twilio mediastream started")
		case "media":
			if err := client.Send(ctx, message.Media.Payload); err != nil {
				log.Println("unable to send audio for real-time transcription:", err)
				c.Close(websocket.StatusInternalError, err.Error())
				return
			}
		case "stop":
			log.Println("twilio mediastream stopped")

			if err := client.Disconnect(ctx, true); err != nil {
				log.Println("unable to disconnect from real-time transcription:", err)
				c.Close(websocket.StatusInternalError, err.Error())
				return
			}

			log.Println("disconnected from real-time transcription")
			c.Close(websocket.StatusNormalClosure, "")

			return
		}
	}
}

main.go