Turn detection | AssemblyAI

Overview

AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. Unlike traditional voice activity detection that only listens for silence, our model understands the meaning and flow of speech to make better decisions about when a turn has ended.

The model has two ways to detect end-of-turn:

Semantic detection - The model predicts when speech naturally ends based on meaning and context
Acoustic detection - Traditional silence-based detection as a backup using VAD

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

This approach solves common voice agent problems:

No more awkward long pauses waiting for silence thresholds
No more cutting people off mid-sentence during natural pauses
Better handling of “um” and thinking pauses
Easily fine tune the model to your use case

Quick start configurations

Aggressive

Ends turns very quickly, optimized for short responses and rapid back-and-forth.

1 "end_of_turn_confidence_threshold": 0.4,
2 "min_end_of_turn_silence_when_confident": 160,
3 "max_turn_silence": 400

Recommended use cases: Agent Assist, IVR replacements, Retail/E-commerce (order confirmations, delivery status), Telecom (outage reporting, yes/no checks)

Balanced

Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.

1 "end_of_turn_confidence_threshold": 0.4,
2 "min_end_of_turn_silence_when_confident": 400,
3 "max_turn_silence": 1280

Recommended use cases: Customer Support, Tech Support/SaaS, Financial Services (account inquiries, balance checks), Travel & Hospitality, Education, Government Services

Conservative

Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.

1 "end_of_turn_confidence_threshold": 0.7,
2 "min_end_of_turn_silence_when_confident": 800,
3 "max_turn_silence": 3600

Recommended use cases: Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance, Language Learning

These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.

How it works

The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:

Semantic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence greater than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Minimum silence duration

After the end of speech detected by VAD, min_end_of_turn_silence_when_confident milliseconds must pass
Default: 400 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Acoustic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence less than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Maximum silence duration

After the end of speech detected by VAD, max_turn_silence milliseconds must pass
Default: 1280 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Disable turn detection

To disable model-based turn detection, you have 2 options:

Set a VAD-based silence latency for each turn: Set end_of_turn_confidence_threshold to 1. This will cause the model to end a turn after a pre-determined amount of silence (based on max_turn_silence).
- Most useful when you are using the model as your VAD for silence-based turns.
Return turns as fast as possible on silence: Set end_of_turn_confidence_threshold to 0. This will cause the model to force end-of-turn as soon as silence is detected (based on min_end_of_turn_silence_when_confident).
- Most useful when you are using a custom turn detection model on top of the transcript results.

If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint event to the server to force the end of a turn and receive the final turn transcript.

1 ws.send(json.dumps({"type": "ForceEndpoint"}))

Important notes

Silence-based detection can override model-based detection even with high EOT confidence thresholds
Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context