Turn detection

Overview

AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. Unlike traditional voice activity detection that only listens for silence, our model understands the meaning and flow of speech to make better decisions about when a turn has ended.

The model has two ways to detect end-of-turn:

  1. Semantic detection - The model predicts when speech naturally ends based on meaning and context
  2. Acoustic detection - Traditional silence-based detection as a backup using VAD

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

This approach solves common voice agent problems:

  • No more awkward long pauses waiting for silence thresholds
  • No more cutting people off mid-sentence during natural pauses
  • Better handling of “um” and thinking pauses
  • Easily fine tune the model to your use case

Quick start configurations

Aggressive

Ends turns very quickly, optimized for short responses and rapid back-and-forth.

1"end_of_turn_confidence_threshold": 0.5,
2"min_end_of_turn_silence_when_confident": 160,
3"max_turn_silence": 400
Recommended use cases: Agent Assist, IVR replacements, Retail/E-commerce (order confirmations, delivery status), Telecom (outage reporting, yes/no checks)

Balanced

Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.

1"end_of_turn_confidence_threshold": 0.7,
2"min_end_of_turn_silence_when_confident": 400,
3"max_turn_silence": 1200
Recommended use cases: Customer Support, Tech Support/SaaS, Financial Services (account inquiries, balance checks), Travel & Hospitality, Education, Government Services

Conservative

Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.

1"end_of_turn_confidence_threshold": 0.9,
2"min_end_of_turn_silence_when_confident": 800,
3"max_turn_silence": 3600
Recommended use cases: Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance, Language Learning

These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.

How it works

The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:

1

Semantic Detection

Triggers when all conditions are met:

Model confidence threshold

  • Model predicts semantic end-of-turn confidence greater than end_of_turn_confidence_threshold
  • Default: 0.7 (user configurable)

Minimum silence duration

  • After the end of speech detected by VAD, min_end_of_turn_silence_when_confident milliseconds must pass
  • Default: 160 ms (user configurable)

Minimum speech duration

  • The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
  • Set to 80 ms (internal)

Word finalized

  • Last word in turn.words has been finalized
  • Internal configuration
2

Acoustic Detection

Triggers when all conditions are met:

Model confidence threshold

  • Model predicts semantic end-of-turn confidence less than end_of_turn_confidence_threshold
  • Default: 0.7 (user configurable)

Maximum silence duration

  • After the end of speech detected by VAD, max_turn_silence milliseconds must pass
  • Default: 2400 ms (user configurable)

Minimum speech duration

  • The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
  • Set to 80 ms (internal)

Word finalized

  • Last word in turn.words has been finalized
  • Internal configuration

Disable turn detection

To disable model-based turn detection, you have 2 options:

  • Set a VAD-based silence latency for each turn: Set end_of_turn_confidence_threshold to 1. This will cause the model to end a turn after a pre-determined amount of silence (based on max_turn_silence).
    • Most useful when you are using the model as your VAD for silence-based turns.
  • Return turns as fast as possible on silence: Set end_of_turn_confidence_threshold to 0. This will cause the model to force end-of-turn as soon as silence is detected (based on min_end_of_turn_silence_when_confident).
    • Most useful when you are using a custom turn detection model on top of the transcript results.

If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint event to the server to force the end of a turn and receive the final turn transcript.

1ws.send(json.dumps({"type": "ForceEndpoint"}))

Important notes

  • Silence-based detection can override model-based detection even with high EOT confidence thresholds
  • Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
  • We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context