Pipecat
This guide assumes prior knowledge of Pipecat. If you haven’t used Pipecat before and are unfamiliar with Pipecat, please check out our Building a Voice Agent with Pipecat and AssemblyAI guide.
Overview
Pipecat is an open source platform for developers building realtime media applications. In this guide, we’ll show you how to integrate AssemblyAI’s streaming speech-to-text model into your Pipecat voice agent using the Pipeline framework.
Quick start
Installation
Install the AssemblyAI service from PyPI:
Authentication
The AssemblyAI service requires an AssemblyAI API key. Set ASSEMBLYAI_API_KEY
in your .env
file.
You can obtain an AssemblyAI API key by signing up here.
Basic usage
Use AssemblyAI STT in a Pipeline
:
Configuration
Turn Detection (Key Feature)
AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.
This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.
Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.
You can set the vad_force_turn_endpoint
parameter within the AssemblyAISTTService
constructor:
Parameter tuning:
- end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
- min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
- max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score
You can also set vad_force_turn_endpoint=True
if you’d like turn detection to be based on VAD instead of our advanced turn detection model.
For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.
Parameters
Constructor Parameters
Your AssemblyAI API key.
Connection parameters for the AssemblyAI WebSocket connection. See below for details.
When true, sends a ForceEndpoint
event to AssemblyAI when a
UserStoppedSpeakingFrame
is received. Requires a VAD (Voice Activity
Detection) processor in the pipeline to generate these frames.
Language for transcription. AssemblyAI currently only supports English Streaming transcription.
Connection Parameters
The sample rate of the audio stream
The encoding of the audio stream. Allowed values: pcm_s16le
, pcm_mulaw
Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.
The confidence threshold to use when determining if the end of a turn has been reached.
The minimum amount of silence required to detect end of turn when confident.
The maximum amount of silence allowed in a turn before end of turn is triggered.