Livekit

This guide assumes prior knowledge of LiveKit. If you haven’t used LiveKit before and are unfamiliar with LiveKit, please check out our Building a Voice Agent with LiveKit and AssemblyAI guide.

Overview

LiveKit is an open source platform for developers building realtime media applications. In this guide, we’ll show you how to integrate AssemblyAI’s streaming speech-to-text model into your Livekit voice agent using the Agents framework.

Quick start

Installation

Install the plugin from PyPI:

$pip install "livekit-agents[assemblyai]"

Authentication

The AssemblyAI plugin requires an AssemblyAI API key. Set ASSEMBLYAI_API_KEY in your .env file.

You can obtain an AssemblyAI API key by signing up here.

Basic usage

Use AssemblyAI STT in an AgentSession or as a standalone transcription service:

1from livekit.plugins import assemblyai
2
3session = AgentSession(
4 stt = assemblyai.STT(
5 end_of_turn_confidence_threshold=0.7,
6 min_end_of_turn_silence_when_confident=160,
7 max_turn_silence=2400,
8 ),
9 # ... llm, tts, etc.
10 vad=silero.VAD.load(), # VAD Enabled for Interruptions
11 turn_detection="stt", # Enable Turn Detection
12)

Configuration

Turn Detection (Key Feature)

AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.

This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.

Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.

1# STT-based turn detection (recommended)
2turn_detection="stt"
3
4stt=assemblyai.STT(
5 end_of_turn_confidence_threshold=0.7,
6 min_end_of_turn_silence_when_confident=160, # in ms
7 max_turn_silence=2400, # in ms
8)

Parameter tuning:

  • end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
  • min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
  • max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score

You can also set turn_detection="vad" if you’d like turn detection to be based on Silero VAD instead of our advanced turn detection model.

For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.

Parameters

api_key
str

Your AssemblyAI API key.

sample_rate
intDefaults to 16000

The sample rate of the audio stream

encoding
strDefaults to pcm_s16le

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw

format_turns
boolDefaults to True

Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.

end_of_turn_confidence_threshold
floatDefaults to 0.7

The confidence threshold to use when determining if the end of a turn has been reached.
In our API the default is 0.4, but the default in LiveKit is set to 0.65.

min_end_of_turn_silence_when_confident
intDefaults to 160

The minimum amount of silence required to detect end of turn when confident.

max_turn_silence
intDefaults to 2400

The maximum amount of silence allowed in a turn before end of turn is triggered.