Pipecat

This guide assumes prior knowledge of Pipecat. If you haven’t used Pipecat before and are unfamiliar with Pipecat, please check out our Building a Voice Agent with Pipecat and AssemblyAI guide.

Overview

Pipecat is an open source platform for developers building realtime media applications. In this guide, we’ll show you how to integrate AssemblyAI’s streaming speech-to-text model into your Pipecat voice agent using the Pipeline framework.

Quick start

Installation

Install the AssemblyAI service from PyPI:

$pip install "pipecat-ai[assemblyai]"

Authentication

The AssemblyAI service requires an AssemblyAI API key. Set ASSEMBLYAI_API_KEY in your .env file.

You can obtain an AssemblyAI API key by signing up here.

Basic usage

Use AssemblyAI STT in a Pipeline:

1from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAIConnectionParams
2
3# Configure service
4stt = AssemblyAISTTService(
5 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
6 vad_force_turn_endpoint=False, # Use AssemblyAI's STT-based turn detection
7 connection_params=AssemblyAIConnectionParams(
8 end_of_turn_confidence_threshold=0.7,
9 min_end_of_turn_silence_when_confident=160,
10 max_turn_silence=2400,
11 )
12)
13
14# Use in pipeline
15pipeline = Pipeline([
16 transport.input(),
17 stt,
18 llm,
19 tts,
20 transport.output(),
21])

Configuration

Turn Detection (Key Feature)

AssemblyAI’s new turn detection model was built specifically for voice agents and you can tweak it to fit your use case. It processes both audio and linguistic information to determine an end of turn confidence score on every inference, and if that confidence score is past the set threshold, it triggers end of turn.

This custom model was designed to address 2 major issues with voice agents. With traditional VAD (voice activity detection) approaches based on silence alone, there are situations where the agent wouldn’t wait for a user to finish their turn even if the audio data suggested it. Think of a situation like “My credit card number is____” - if someone is looking that up, traditional VAD may not wait for the user, where our turn detection model is far better in these situations.

Additionally, in situations where we are certain that the user is done speaking like “What is my credit score?”, a high end of turn confidence is returned, greater than the threshold, and triggering end of turn, allowing for minimal turnaround latency in those scenarios.

You can set the vad_force_turn_endpoint parameter within the AssemblyAISTTService constructor:

1stt = AssemblyAISTTService(
2 api_key=os.getenv("ASSEMBLYAI_API_KEY"),
3 vad_force_turn_endpoint=False, # Use AssemblyAI's STT-based turn detection
4 connection_params=AssemblyAIConnectionParams(
5 end_of_turn_confidence_threshold=0.7,
6 min_end_of_turn_silence_when_confident=160, # in ms
7 max_turn_silence=2400, # in ms
8 )
9)

Parameter tuning:

  • end_of_turn_confidence_threshold: Raise or lower the threshold based on how confident you’d like us to be before triggering end of turn based on confidence score
  • min_end_of_turn_silence_when_confident: Increase or decrease the amount of time we wait to trigger end of turn when confident
  • max_turn_silence: Lower or raise the amount of time needed to trigger end of turn when end of turn isn’t triggered by a high confidence score

You can also set vad_force_turn_endpoint=True if you’d like turn detection to be based on VAD instead of our advanced turn detection model.

For more information, see our Universal-Streaming end-of-turn detection guide and message-by-message breakdown.

Parameters

Constructor Parameters

api_key
strRequired

Your AssemblyAI API key.

connection_params
AssemblyAIConnectionParams

Connection parameters for the AssemblyAI WebSocket connection. See below for details.

vad_force_turn_endpoint
boolDefaults to True

When true, sends a ForceEndpoint event to AssemblyAI when a UserStoppedSpeakingFrame is received. Requires a VAD (Voice Activity Detection) processor in the pipeline to generate these frames.

language
LanguageDefaults to Language.EN

Language for transcription. AssemblyAI currently only supports English Streaming transcription.

Connection Parameters

sample_rate
intDefaults to 16000

The sample rate of the audio stream

encoding
strDefaults to pcm_s16le

The encoding of the audio stream. Allowed values: pcm_s16le, pcm_mulaw

format_turns
boolDefaults to True

Whether to return formatted final transcripts. If enabled, formatted final transcripts will be emitted shortly following an end-of-turn detection.

end_of_turn_confidence_threshold
floatDefaults to 0.7

The confidence threshold to use when determining if the end of a turn has been reached.

min_end_of_turn_silence_when_confident
intDefaults to 160

The minimum amount of silence required to detect end of turn when confident.

max_turn_silence
intDefaults to 2400

The maximum amount of silence allowed in a turn before end of turn is triggered.