Turn detection and interruptions

Turn detection and barge-in are on by default and based on what the user actually said, not just silence or volume. There’s nothing to wire up.

Leave it on default. With no turn_detection config, the agent adapts to each speaker’s pace and automatically slows down to capture values your tools need, like a phone number or email. The way to make turn-taking better is good tool descriptions, not VAD knobs. Tune the raw thresholds only as a last resort.

What you get by default

Semantic end-of-turn. The agent decides you’re done from the meaning of what you said, so it won’t cut you off mid-thought or sit on a long pause once you’ve clearly finished.
Smart waiting for tool values. When a tool parameter expects a phone number, email, date, or other entity, the agent waits for the whole value before ending your turn, instead of jumping in after the first pause.
Adaptive pacing. If a speaker pauses a lot, the agent gives them more room; if they’re crisp, it replies faster. This gets better over the call.
Semantic barge-in. Back-channels like “uh-huh” or “makes sense” don’t interrupt; “wait, stop” does.

A normal user turn emits input.speech.started → transcript.user.delta (partials) → input.speech.stopped → transcript.user (final) → reply.started. You never signal end-of-turn yourself.

Get better turn-taking with tool descriptions

The agent waits for a complete value when it knows one is coming. It learns that from your tool’s parameters, so describe the important ones well. No turn-detection config involved:

{
  "name": "verify_account",
  "description": "Look up a customer by phone number.",
  "parameters": {
    "type": "object",
    "properties": {
      "phone": {
        "type": "string",
        "description": "The caller's phone number, including country code.",
        "examples": ["+14155552671"]
      }
    },
    "required": ["phone"]
  }
}

With this, when the agent asks for the phone number it waits for all the digits instead of cutting in after “my number is four one five…”. A clear description (plus optional examples or pattern) is what drives it. See parameter hints.

Interruptions

When the user truly interrupts, the server emits:

reply.done with status: "interrupted"
transcript.agent with interrupted: true and text trimmed to what the user actually heard.

On that signal, stop and clear your queued audio so the user doesn’t keep hearing stale speech:

ws.onmessage = ({ data }) => {
  const m = JSON.parse(data);
  if (m.type === "input.speech.started") flushPlayback();   // snappiest barge-in
  if (m.type === "reply.done" && m.status === "interrupted") flushPlayback();
};

See Handling interruptions for the full audio-flush pattern.

Override the defaults (rarely needed)

If you must tune sensitivity, set the VAD knobs in session.input.turn_detection. All fields are optional; send only what you change.

{
  "type": "session.update",
  "session": {
    "input": {
      "turn_detection": {
        "vad_threshold": 0.5,
        "min_silence": 1000,
        "max_silence": 3000,
        "interrupt_response": true
      }
    }
  }
}

Field	Description
`vad_threshold`	Speech detection sensitivity (0.0–1.0). Lower is more sensitive.
`min_silence`	Minimum silence for a confident end-of-turn, in ms.
`max_silence`	Maximum silence before forcing end-of-turn, in ms.
`interrupt_response`	Set `false` to disable barge-in entirely.

Setting min_silence or max_silence turns off the adaptive pacing and entity-aware waiting described above for the rest of the session. Prefer leaving them unset.

If the agent keeps interrupting itself, the mic is picking up its own TTS. Use headphones or a browser-based client (which has echo cancellation). See Troubleshooting.

​What you get by default

​Get better turn-taking with tool descriptions

​Interruptions

​Override the defaults (rarely needed)

What you get by default

Get better turn-taking with tool descriptions

Interruptions

Override the defaults (rarely needed)