Skip to main content
Turn detection and barge-in are on by default and based on what the user actually said, not just silence or volume. There’s nothing to wire up.
Leave it on default. With no turn_detection config, the agent adapts to each speaker’s pace and automatically slows down to capture values your tools need, like a phone number or email. The way to make turn-taking better is good tool descriptions, not VAD knobs. Tune the raw thresholds only as a last resort.

What you get by default

  • Semantic end-of-turn. The agent decides you’re done from the meaning of what you said, so it won’t cut you off mid-thought or sit on a long pause once you’ve clearly finished.
  • Smart waiting for tool values. When a tool parameter expects a phone number, email, date, or other entity, the agent waits for the whole value before ending your turn, instead of jumping in after the first pause.
  • Adaptive pacing. If a speaker pauses a lot, the agent gives them more room; if they’re crisp, it replies faster. This gets better over the call.
  • Semantic barge-in. Back-channels like “uh-huh” or “makes sense” don’t interrupt; “wait, stop” does.
A normal user turn emits input.speech.startedtranscript.user.delta (partials) → input.speech.stoppedtranscript.user (final) → reply.started. You never signal end-of-turn yourself.

Get better turn-taking with tool descriptions

The agent waits for a complete value when it knows one is coming. It learns that from your tool’s parameters, so describe the important ones well. No turn-detection config involved:
{
  "name": "verify_account",
  "description": "Look up a customer by phone number.",
  "parameters": {
    "type": "object",
    "properties": {
      "phone": {
        "type": "string",
        "description": "The caller's phone number, including country code.",
        "examples": ["+14155552671"]
      }
    },
    "required": ["phone"]
  }
}
With this, when the agent asks for the phone number it waits for all the digits instead of cutting in after “my number is four one five…”. A clear description (plus optional examples or pattern) is what drives it. See parameter hints.

Interruptions

When the user truly interrupts, the server emits: On that signal, stop and clear your queued audio so the user doesn’t keep hearing stale speech:
ws.onmessage = ({ data }) => {
  const m = JSON.parse(data);
  if (m.type === "input.speech.started") flushPlayback();   // snappiest barge-in
  if (m.type === "reply.done" && m.status === "interrupted") flushPlayback();
};
See Handling interruptions for the full audio-flush pattern.

Override the defaults (rarely needed)

If you must tune sensitivity, set the VAD knobs in session.input.turn_detection. All fields are optional; send only what you change.
{
  "type": "session.update",
  "session": {
    "input": {
      "turn_detection": {
        "vad_threshold": 0.5,
        "min_silence": 1000,
        "max_silence": 3000,
        "interrupt_response": true
      }
    }
  }
}
FieldDescription
vad_thresholdSpeech detection sensitivity (0.0–1.0). Lower is more sensitive.
min_silenceMinimum silence for a confident end-of-turn, in ms.
max_silenceMaximum silence before forcing end-of-turn, in ms.
interrupt_responseSet false to disable barge-in entirely.
Setting min_silence or max_silence turns off the adaptive pacing and entity-aware waiting described above for the rest of the session. Prefer leaving them unset.
If the agent keeps interrupting itself, the mic is picking up its own TTS. Use headphones or a browser-based client (which has echo cancellation). See Troubleshooting.