Turn detection

Overview

AssemblyAI’s end-of-turn detection functionality is integrated into our Streaming STT model, leveraging both acoustic and semantic features, and is coupled with a traditional silence-based heuristic approach. Both mechanisms work jointly and either can trigger end-of-turn detection throughout the audio stream. This joint approach significantly enhances the speed and accuracy of end-of-turn detection while allowing this functionality to fall back to the traditional method when the model makes a misprediction.

This functionality is built natively into the STT model, making it ideal for addressing two common problems voice agent developers face: awkward pauses (long silence) or the risk of making premature interruptions (short silence), both of which disrupt the natural flow of conversation.

This approach allows for:

  • Semantic understanding of natural speech patterns rather than simple silence thresholds
  • Support for natural pauses and thinking time without premature responses
  • Smooth conversation flow without awkward interruptions or artificial delays

End-of-turn and end-of-utterances refer to the same thing and may be used interchangeably in these docs.

Model-based detection

Triggers when all conditions are met:

EOT token predicted

  • Model predicts semantic end-of-turn with a probability greater than end_of_turn_confidence_threshold
  • Default: 0.5 (user configurable)

Minimum silence duration has passed

  • After the last non-silence word token, min_end_of_turn_silence_when_confident milliseconds must pass
  • Default: 400ms (user configurable)

Minimum speech duration spoken

  • The user must speak for at least 80ms since the last end-of-turn (ensures at least one word)
  • Set to 80 ms (internal)

Word finalized

  • Last word in turn.words has been finalized
  • Internal configuration

Silence-based detection

Triggers when all conditions are met:

Minimum speech duration spoken

  • The user must speak for at least 80ms since the last end-of-turn (ensures at least one word)
  • Set to 80 ms (internal)

Maximum silence duration has passed

  • After the last non-silence word token, max_turn_silence milliseconds must pass
  • Default: 2400ms (user configurable)

Disable turn detection

To disable model-based turn detection, set end_of_turn_confidence_threshold to 1. This will stop the model from predicting end-of-turns and will only use silence-based detection. If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint event to the server to force the end of a turn and receive the final turn transcript.

1ws.send(json.dumps({"type": "ForceEndpoint"}))

Important notes

  • Silence-based detection can override model-based detection even with high EOT confidence thresholds
  • Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
  • We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context