Leave it on default. With no
turn_detection config, the agent adapts to each speaker’s pace and automatically slows down to capture values your tools need, like a phone number or email. The way to make turn-taking better is good tool descriptions, not VAD knobs. Tune the raw thresholds only as a last resort.What you get by default
- Semantic end-of-turn. The agent decides you’re done from the meaning of what you said, so it won’t cut you off mid-thought or sit on a long pause once you’ve clearly finished.
- Smart waiting for tool values. When a tool parameter expects a phone number, email, date, or other entity, the agent waits for the whole value before ending your turn, instead of jumping in after the first pause.
- Adaptive pacing. If a speaker pauses a lot, the agent gives them more room; if they’re crisp, it replies faster. This gets better over the call.
- Semantic barge-in. Back-channels like “uh-huh” or “makes sense” don’t interrupt; “wait, stop” does.
input.speech.started → transcript.user.delta (partials) → input.speech.stopped → transcript.user (final) → reply.started. You never signal end-of-turn yourself.
Get better turn-taking with tool descriptions
The agent waits for a complete value when it knows one is coming. It learns that from your tool’s parameters, so describe the important ones well. No turn-detection config involved:description (plus optional examples or pattern) is what drives it. See parameter hints.
Interruptions
When the user truly interrupts, the server emits:reply.donewithstatus: "interrupted"transcript.agentwithinterrupted: trueandtexttrimmed to what the user actually heard.
Override the defaults (rarely needed)
If you must tune sensitivity, set the VAD knobs insession.input.turn_detection. All fields are optional; send only what you change.
| Field | Description |
|---|---|
vad_threshold | Speech detection sensitivity (0.0–1.0). Lower is more sensitive. |
min_silence | Minimum silence for a confident end-of-turn, in ms. |
max_silence | Maximum silence before forcing end-of-turn, in ms. |
interrupt_response | Set false to disable barge-in entirely. |
If the agent keeps interrupting itself, the mic is picking up its own TTS. Use headphones or a browser-based client (which has echo cancellation). See Troubleshooting.