Livekit
This guide assumes prior knowledge of LiveKit. If you haven’t used LiveKit before and are unfamiliar with LiveKit, please check out our Building a Voice Agent with LiveKit and AssemblyAI guide.
Overview
LiveKit is an open source platform for developers building realtime media applications. In this guide, we’ll show you how to integrate AssemblyAI’s streaming speech-to-text model into your Livekit voice agent using the Agents framework.
Quick start
Installation
Install the plugin from PyPI:
Authentication
The AssemblyAI plugin requires an AssemblyAI API key. Set ASSEMBLYAI_API_KEY
in your .env
file.
You can obtain an AssemblyAI API key by signing up here.
Basic usage
Use AssemblyAI STT in an AgentSession
or as a standalone transcription service:
Configuration
Turn Detection (Key Feature)
AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.
This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.
Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.
Parameter tuning:
- end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
- min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
- max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score
You can also set turn_detection="vad"
if you’d like turn detection to be based on Silero VAD instead of our advanced turn detection model.
For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.
Parameters
Your AssemblyAI API key.
The sample rate of the audio stream
The encoding of the audio stream. Allowed values: pcm_s16le
, pcm_mulaw
Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.
The confidence threshold to use when determining if the end of a turn has been
reached.
In our API the default is 0.4, but the default in LiveKit is set to 0.65.
The minimum amount of silence required to detect end of turn when confident.
The maximum amount of silence allowed in a turn before end of turn is triggered.