> ## Documentation Index
> Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Message Sequence

This page is the canonical reference for the streaming WebSocket protocol: every message the client sends, every message the server emits, and the order they appear in. For the field-level schema of each message, see the [Streaming API reference](/api-reference/streaming-api/universal-streaming).

## Sequence diagram

```mermaid theme={null}
sequenceDiagram
  participant C as Client
  participant S as Server
  C->>S: WebSocket connect (token, speech_model, sample_rate, ...)
  S->>C: Begin
  loop while streaming
    C->>S: binary audio frames (raw PCM, ~50 ms)
    S->>C: SpeechStarted
    S->>C: Turn (end_of_turn: false) xN (partials)
    S->>C: Turn (end_of_turn: true)
  end
  C->>S: Terminate
  S-->>C: in-flight Turn(s)
  S->>C: Termination
  S->>C: close (1000)
```

## Messages at a glance

**Client → server**

| Message               | Frame type  | Purpose                                                                |
| --------------------- | ----------- | ---------------------------------------------------------------------- |
| Audio                 | **Binary**  | Raw audio bytes (PCM, no JSON or base64 wrapper). Send \~50 ms chunks. |
| `UpdateConfiguration` | Text (JSON) | Change transcription settings mid-session.                             |
| `ForceEndpoint`       | Text (JSON) | Immediately end the current turn.                                      |
| `KeepAlive`           | Text (JSON) | Reset the inactivity timer (only needed with `inactivity_timeout`).    |
| `Terminate`           | Text (JSON) | End the session.                                                       |

**Server → client**

| Message           | When                                                  | Purpose                                                                         |
| ----------------- | ----------------------------------------------------- | ------------------------------------------------------------------------------- |
| `Begin`           | Once, on connect                                      | Session ID, expiration time, and the configuration the server actually applied. |
| `SpeechStarted`   | Per turn (Universal-3 Pro and Universal-3.5 Pro only) | Speech detected; precedes the turn's first `Turn` message.                      |
| `Turn`            | Repeatedly                                            | Partial and final transcripts; see the walkthrough below.                       |
| `SpeakerRevision` | If `speaker_labels=true`                              | Revised speaker labels for earlier turns.                                       |
| `Termination`     | Once, last message                                    | Session totals; nothing follows it.                                             |
| `Error`           | On failure, before close                              | Error code and detail; the connection then closes.                              |

## Session initialization

When the session begins, you receive a `Begin` message with the session ID, expiration time, and the configuration the server applied to your session.

```json theme={null}
{
  "type": "Begin",
  "id": "3207b601-2054-48df-ba77-8784dfcf9fb8",
  "expires_at": 1772570132,
  "configuration": {
    "model": "universal-3-5-pro",
    "mode": "balanced",
    "api_version": "1.0.0"
  }
}
```

`expires_at` is a Unix timestamp (seconds). When it is reached, the server closes the session with error `3008`.

<Tip>
  Always check that `configuration.model` matches the `speech_model` you requested. Unrecognized or misspelled query parameters are ignored rather than rejected, so this echo is the fastest way to catch a typo (for example, `speechModel` instead of `speech_model`).
</Tip>

## Sending audio

Audio is sent as **binary WebSocket frames** containing raw audio bytes in the encoding and sample rate you set at connection time (default: 16 kHz, 16-bit mono PCM). Do not wrap audio in JSON and do not base64-encode it. A text frame that isn't valid JSON is rejected and closes the session.

Send audio in chunks of roughly 50 ms (800 samples at 16 kHz). You can send audio faster than real time (for example when streaming from a file), but throughput is throttled at \~1.25× real time. If more than five minutes of audio is buffered ahead of processing, the session closes with error `3007`.

<Tabs>
  <Tab language="python" title="Python" default>
    ```python theme={null}
    import websocket

    # audio_chunk is raw PCM bytes (e.g. 1600 bytes = 800 samples ≈ 50 ms at 16 kHz mono)
    ws.send(audio_chunk, websocket.ABNF.OPCODE_BINARY)
    ```
  </Tab>

  <Tab language="python-sdk" title="Python SDK">
    ```python theme={null}
    # audio_iterator yields raw PCM byte chunks; the SDK handles binary framing.
    client.stream(audio_iterator)
    ```
  </Tab>

  <Tab language="javascript" title="Javascript">
    ```javascript theme={null}
    // audioChunk is a Buffer/Uint8Array of raw PCM bytes
    ws.send(audioChunk);
    ```
  </Tab>

  <Tab language="javascript-sdk" title="JavaScript SDK">
    ```javascript theme={null}
    // Pipe any readable byte stream into transcriber.stream(); the SDK handles framing.
    Readable.toWeb(audioSource).pipeTo(transcriber.stream());
    ```
  </Tab>
</Tabs>

## Speech started

Immediately before the **first `Turn` message of each turn**, the server sends a `SpeechStarted` message. The `timestamp` field is the start of the turn, in milliseconds relative to the beginning of the audio stream. The `confidence` field is the average word confidence of the initial transcript.

```json theme={null}
{
  "type": "SpeechStarted",
  "timestamp": 1216,
  "confidence": 0.987654
}
```

<Note>
  Universal Streaming skips `SpeechStarted` and goes straight to the first `Turn`.
</Note>

## Partial transcript

As the speaker is talking, the server emits one or more `Turn` messages with `end_of_turn: false`. These are partial transcripts.

```json expandable theme={null}
{
  "turn_order": 0,
  "turn_is_formatted": false,
  "end_of_turn": false,
  "transcript": "My name is",
  "end_of_turn_confidence": 0,
  "words": [
    {
      "start": 1216,
      "end": 1627,
      "text": "My",
      "confidence": 0.956314,
      "word_is_final": false
    },
    {
      "start": 1668,
      "end": 2490,
      "text": "name",
      "confidence": 0.999393,
      "word_is_final": false
    },
    {
      "start": 2531,
      "end": 3067,
      "text": "is",
      "confidence": 0.753325,
      "word_is_final": false
    }
  ],
  "utterance": "",
  "type": "Turn"
}
```

The cadence and shape of partials depends on the model. See [Universal-3 Pro Streaming](/streaming/getting-started/transcribe-streaming-audio) and [Universal Streaming](/streaming/getting-started/transcribe-streaming-audio) for the details of how each model produces partials.

## End of turn

When the turn ends, the server emits a `Turn` message with `end_of_turn: true` and `turn_is_formatted: true`. This is the last message for this `turn_order` and the transcript is fully formatted.

```json expandable theme={null}
{
  "turn_order": 0,
  "turn_is_formatted": true,
  "end_of_turn": true,
  "transcript": "My name is Sonny.",
  "end_of_turn_confidence": 1,
  "words": [
    {
      "start": 1216,
      "end": 1635,
      "text": "My",
      "confidence": 0.956583,
      "word_is_final": true
    },
    {
      "start": 1676,
      "end": 2515,
      "text": "name",
      "confidence": 0.999199,
      "word_is_final": true
    },
    {
      "start": 2556,
      "end": 2975,
      "text": "is",
      "confidence": 0.999535,
      "word_is_final": true
    },
    {
      "start": 3016,
      "end": 4155,
      "text": "Sonny.",
      "confidence": 0.316031,
      "word_is_final": true
    }
  ],
  "utterance": "My name is Sonny.",
  "type": "Turn"
}
```

<Note>
  **Universal Streaming differs.** By default (`format_turns=false`), the final arrives with `turn_is_formatted: false`. With `format_turns=true`, you receive two `end_of_turn: true` messages for the same `turn_order`: the unformatted final first, then a formatted final right after. Treat a turn as complete only when both `end_of_turn` and `turn_is_formatted` are `true`, or you'll process every turn twice.
</Note>

<Accordion title="Universal Streaming end-of-turn examples">
  Default (`format_turns=false`) — single unformatted final:

  ```json expandable theme={null}
  {
    "turn_order": 0,
    "turn_is_formatted": false,
    "end_of_turn": true,
    "transcript": "my name is sonny",
    "end_of_turn_confidence": 0.812345,
    "words": [
      { "start": 1216, "end": 1635, "text": "my",    "confidence": 0.956583, "word_is_final": true },
      { "start": 1676, "end": 2515, "text": "name",  "confidence": 0.999199, "word_is_final": true },
      { "start": 2556, "end": 2975, "text": "is",    "confidence": 0.999535, "word_is_final": true },
      { "start": 3016, "end": 4155, "text": "sonny", "confidence": 0.316031, "word_is_final": true }
    ],
    "type": "Turn"
  }
  ```

  With `format_turns=true` — the unformatted final above, then a formatted final right after:

  ```json expandable theme={null}
  {
    "turn_order": 0,
    "turn_is_formatted": true,
    "end_of_turn": true,
    "transcript": "My name is Sonny.",
    "end_of_turn_confidence": 0.812345,
    "words": [
      { "start": 1216, "end": 1635, "text": "My",     "confidence": 0.956583, "word_is_final": true },
      { "start": 1676, "end": 2515, "text": "name",   "confidence": 0.999199, "word_is_final": true },
      { "start": 2556, "end": 2975, "text": "is",     "confidence": 0.999535, "word_is_final": true },
      { "start": 3016, "end": 4155, "text": "Sonny.", "confidence": 0.316031, "word_is_final": true }
    ],
    "type": "Turn"
  }
  ```
</Accordion>

## Forcing an endpoint

To end the current turn immediately, for example when your own VAD or a push-to-talk button decides the user is done, send a `ForceEndpoint` message:

<Tabs>
  <Tab language="python" title="Python" default>
    ```python theme={null}
    ws.send(json.dumps({"type": "ForceEndpoint"}))
    ```
  </Tab>

  <Tab language="python-sdk" title="Python SDK">
    ```python theme={null}
    client.force_endpoint()
    ```
  </Tab>

  <Tab language="javascript" title="Javascript">
    ```javascript theme={null}
    ws.send(JSON.stringify({ type: "ForceEndpoint" }));
    ```
  </Tab>

  <Tab language="javascript-sdk" title="JavaScript SDK">
    ```javascript theme={null}
    transcriber.forceEndpoint();
    ```
  </Tab>
</Tabs>

The server responds with the `end_of_turn: true` message(s) for the current turn right away, without waiting for silence or punctuation.

## Updating configuration mid-session

You can change transcription settings at any point after `Begin` without reconnecting. Only the fields you include are changed:

<Tabs>
  <Tab language="python" title="Python" default>
    ```python theme={null}
    ws.send(json.dumps({
        "type": "UpdateConfiguration",
        "end_of_turn_confidence_threshold": 0.8,
        "min_end_of_turn_silence_when_confident": 400,
        "max_turn_silence": 1500,
        "keyterms_prompt": ["Keanu Reeves", "AssemblyAI"],
    }))
    ```
  </Tab>

  <Tab language="python-sdk" title="Python SDK">
    ```python theme={null}
    client.update_configuration(
        end_of_turn_confidence_threshold=0.8,
        min_end_of_turn_silence_when_confident=400,
        max_turn_silence=1500,
        keyterms_prompt=["Keanu Reeves", "AssemblyAI"],
    )
    ```
  </Tab>

  <Tab language="javascript" title="Javascript">
    ```javascript theme={null}
    ws.send(JSON.stringify({
      type: "UpdateConfiguration",
      end_of_turn_confidence_threshold: 0.8,
      min_end_of_turn_silence_when_confident: 400,
      max_turn_silence: 1500,
      keyterms_prompt: ["Keanu Reeves", "AssemblyAI"],
    }));
    ```
  </Tab>

  <Tab language="javascript-sdk" title="JavaScript SDK">
    ```javascript theme={null}
    transcriber.updateConfiguration({
      endOfTurnConfidenceThreshold: 0.8,
      minEndOfTurnSilenceWhenConfident: 400,
      maxTurnSilence: 1500,
      keytermsPrompt: ["Keanu Reeves", "AssemblyAI"],
    });
    ```
  </Tab>
</Tabs>

There is no acknowledgement message; the new settings apply to audio processed after the update. They do not retroactively affect the turn in progress. Some fields are model-specific (for example `mode`, `prompt`, and `agent_context` are Universal-3 Pro and Universal-3.5 Pro only). See [Updating configuration mid-stream](/streaming/updating-configuration-mid-stream) for the full list.

## Keep alive

**`KeepAlive` messages are not required.** By default, sessions remain open until [explicitly terminated](#session-termination) or until the 3-hour maximum session duration is reached.

`KeepAlive` is only relevant if you have configured the `inactivity_timeout` connection parameter, which closes the session after a period of no audio or messages being sent. If you are using `inactivity_timeout` and want to keep the session open during periods where no audio is being sent, send a `KeepAlive` message to reset the inactivity timer:

<Tabs>
  <Tab language="python" title="Python" default>
    ```python theme={null}
    ws.send(json.dumps({"type": "KeepAlive"}))
    ```
  </Tab>

  <Tab language="python-sdk" title="Python SDK">
    ```python theme={null}
    client.keep_alive()
    ```
  </Tab>

  <Tab language="javascript" title="Javascript">
    ```javascript theme={null}
    ws.send(JSON.stringify({ type: "KeepAlive" }));
    ```
  </Tab>

  <Tab language="javascript-sdk" title="JavaScript SDK">
    ```javascript theme={null}
    transcriber.keepAlive();
    ```
  </Tab>
</Tabs>

If the inactivity timeout elapses, the session closes with error `3006` and the message `Session terminated due to inactivity: No messages received for N seconds`.

<Note>
  **Ordering guarantees.** `turn_order` is monotonically increasing, and all messages for a turn are delivered before the first message of the next turn. Within a turn, each `Turn` message supersedes the previous one. Render the latest `transcript`; do not append. A turn is complete on the message where both `end_of_turn` and `turn_is_formatted` are `true`.
</Note>

## Session termination

To end a session, send a `Terminate` message:

<Tabs>
  <Tab language="python" title="Python" default>
    ```python theme={null}
    ws.send(json.dumps({"type": "Terminate"}))
    ```
  </Tab>

  <Tab language="python-sdk" title="Python SDK">
    ```python theme={null}
    client.disconnect(terminate=True)
    ```
  </Tab>

  <Tab language="javascript" title="Javascript">
    ```javascript theme={null}
    ws.send(JSON.stringify({ type: "Terminate" }));
    ```
  </Tab>

  <Tab language="javascript-sdk" title="JavaScript SDK">
    ```javascript theme={null}
    await transcriber.close();
    ```
  </Tab>
</Tabs>

After you send `Terminate`, **keep reading from the WebSocket until you receive the `Termination` message**. The server first flushes any in-flight messages, which can include:

* the final (and formatted) `Turn` for audio you already sent;
* on Universal Streaming, a closing `Turn` with an empty `transcript` for an unfinished turn;
* a `SpeakerRevision` if `speaker_labels=true`. The end-of-session refinement adds approximately 400 ms of latency at session close. See [Revised speaker labels](/streaming/label-speakers-and-separate-channels#revised-speaker-labels) for the message schema and consumption guidance.

Closing the socket as soon as you send `Terminate` silently discards your last transcript.

The `Termination` message is always the last message:

```json theme={null}
{
  "type": "Termination",
  "audio_duration_seconds": 13,
  "session_duration_seconds": 14
}
```

* `audio_duration_seconds` is the total audio processed in the session.
* `session_duration_seconds` is the total wall-clock time the connection was open. **This is the duration the session is billed on.**

After `Termination`, no further messages are sent and the server closes the connection with code `1000`.

<Warning>
  Always terminate sessions explicitly. Streaming is [billed per session](/billing-and-pricing#streaming-speech-to-text-billing). Sessions that are not terminated remain open and continue to accrue charges until the server auto-closes them after 3 hours (error code `3008`). See [Common errors](/streaming/common-session-errors-and-closures) for more details.
</Warning>

## Errors

If the session fails at any point, the server sends an `Error` message as a text frame and then closes the connection:

```json theme={null}
{
  "type": "Error",
  "error_code": 3007,
  "error": "Audio transmission rate exceeded: too much audio buffered"
}
```

See [Common errors](/streaming/common-session-errors-and-closures) for the full list of error codes and how to handle reconnection.