Turn detection
Overview
AssemblyAI’s end-of-turn detection functionality is integrated into our Streaming STT model, leveraging both acoustic and semantic features, and is coupled with a traditional silence-based heuristic approach. Both mechanisms work jointly and either can trigger end-of-turn detection throughout the audio stream. This joint approach significantly enhances the speed and accuracy of end-of-turn detection while allowing this functionality to fall back to the traditional method when the model makes a misprediction.
This functionality is built natively into the STT model, making it ideal for addressing two common problems voice agent developers face: awkward pauses (long silence) or the risk of making premature interruptions (short silence), both of which disrupt the natural flow of conversation.
This approach allows for:
- Semantic understanding of natural speech patterns rather than simple silence thresholds
- Support for natural pauses and thinking time without premature responses
- Smooth conversation flow without awkward interruptions or artificial delays
End-of-turn and end-of-utterances refer to the same thing and may be used interchangeably in these docs.
Model-based detection
Triggers when all conditions are met:
EOT token predicted
- Model predicts semantic end-of-turn with a probability greater than
end_of_turn_confidence_threshold
- Default: 0.5 (user configurable)
Minimum silence duration has passed
- After the last non-silence word token,
min_end_of_turn_silence_when_confident
milliseconds must pass - Default: 400ms (user configurable)
Minimum speech duration spoken
- The user must speak for at least 80ms since the last end-of-turn (ensures at least one word)
- Set to 80 ms (internal)
Word finalized
- Last word in
turn.words
has been finalized - Internal configuration
Silence-based detection
Triggers when all conditions are met:
Minimum speech duration spoken
- The user must speak for at least 80ms since the last end-of-turn (ensures at least one word)
- Set to 80 ms (internal)
Maximum silence duration has passed
- After the last non-silence word token,
max_turn_silence
milliseconds must pass - Default: 2400ms (user configurable)
Disable turn detection
To disable model-based turn detection, set end_of_turn_confidence_threshold
to 1. This will stop the model from predicting end-of-turns and will only use silence-based detection.
If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint
event to the server to force the end of a turn and receive the final turn transcript.
Important notes
- Silence-based detection can override model-based detection even with high EOT confidence thresholds
- Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
- We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context