Turn detection
Overview
AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. Unlike traditional voice activity detection that only listens for silence, our model understands the meaning and flow of speech to make better decisions about when a turn has ended.
The model has two ways to detect end-of-turn:
- Semantic detection - The model predicts when speech naturally ends based on meaning and context
- Acoustic detection - Traditional silence-based detection as a backup using VAD
When either method detects an end-of-turn, the model returns end_of_turn=True
in the response.
This approach solves common voice agent problems:
- No more awkward long pauses waiting for silence thresholds
- No more cutting people off mid-sentence during natural pauses
- Better handling of “um” and thinking pauses
- Easily fine tune the model to your use case
Quick start configurations
Aggressive
Ends turns very quickly, optimized for short responses and rapid back-and-forth.
Balanced
Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.
Conservative
Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.
These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.
How it works
The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:
Semantic Detection
Triggers when all conditions are met:
Model confidence threshold
- Model predicts semantic end-of-turn confidence greater than
end_of_turn_confidence_threshold
- Default:
0.7
(user configurable)
Minimum silence duration
- After the end of speech detected by VAD,
min_end_of_turn_silence_when_confident
milliseconds must pass - Default:
160 ms
(user configurable)
Minimum speech duration
- The user must speak for at least
80 ms
since the last end-of-turn (ensures at least one word) - Set to
80 ms
(internal)
Word finalized
- Last word in
turn.words
has been finalized - Internal configuration
Acoustic Detection
Triggers when all conditions are met:
Model confidence threshold
- Model predicts semantic end-of-turn confidence less than
end_of_turn_confidence_threshold
- Default:
0.7
(user configurable)
Maximum silence duration
- After the end of speech detected by VAD,
max_turn_silence
milliseconds must pass - Default:
2400 ms
(user configurable)
Minimum speech duration
- The user must speak for at least
80 ms
since the last end-of-turn (ensures at least one word) - Set to
80 ms
(internal)
Word finalized
- Last word in
turn.words
has been finalized - Internal configuration
Disable turn detection
To disable model-based turn detection, you have 2 options:
- Set a VAD-based silence latency for each turn: Set
end_of_turn_confidence_threshold
to 1. This will cause the model to end a turn after a pre-determined amount of silence (based onmax_turn_silence
).- Most useful when you are using the model as your VAD for silence-based turns.
- Return turns as fast as possible on silence: Set
end_of_turn_confidence_threshold
to 0. This will cause the model to force end-of-turn as soon as silence is detected (based onmin_end_of_turn_silence_when_confident
).- Most useful when you are using a custom turn detection model on top of the transcript results.
If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint
event to the server to force the end of a turn and receive the final turn transcript.
Important notes
- Silence-based detection can override model-based detection even with high EOT confidence thresholds
- Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
- We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context