Skip to main content

Documentation Index

Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba
Whisper streaming allows you to transcribe audio streams in 99+ languages using the WhisperLiveKit model.
Streaming is billed per sessionWhisper Streaming is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. Always send a Terminate message when you’re done with a stream — sessions that aren’t closed auto-close after 3 hours and are billed for the full duration. See Billing and pricing for details.

Configuration

To utilize Whisper streaming, you need to include "speech_model":"whisper-rt" as a query parameter in the WebSocket URL.
The whisper-rt model does not support the language parameter. The model automatically detects the language being spoken. Do not include a language parameter when using this model.

Supported languages

Whisper streaming supports 99+ languages:
CodeLanguage
afAfrikaans
amAmharic
arArabic
asAssamese
azAzerbaijani
baBashkir
beBelarusian
bgBulgarian
bnBengali
boTibetan
brBreton
bsBosnian
caCatalan
csCzech
cyWelsh
daDanish
deGerman
elGreek
enEnglish
esSpanish
etEstonian
euBasque
faPersian
fiFinnish
foFaroese
frFrench
glGalician
guGujarati
haHausa
hawHawaiian
heHebrew
hiHindi
hrCroatian
htHaitian Creole
huHungarian
hyArmenian
idIndonesian
isIcelandic
itItalian
jaJapanese
jwJavanese
kaGeorgian
kkKazakh
kmKhmer
knKannada
koKorean
laLatin
lbLuxembourgish
lnLingala
loLao
ltLithuanian
lvLatvian
mgMalagasy
miMaori
mkMacedonian
mlMalayalam
mnMongolian
mrMarathi
msMalay
mtMaltese
myMyanmar
neNepali
nlDutch
nnNynorsk
noNorwegian
ocOccitan
paPunjabi
plPolish
psPashto
ptPortuguese
roRomanian
ruRussian
saSanskrit
sdSindhi
siSinhala
skSlovak
slSlovenian
snShona
soSomali
sqAlbanian
srSerbian
suSundanese
svSwedish
swSwahili
taTamil
teTelugu
tgTajik
thThai
tkTurkmen
tlTagalog
trTurkish
ttTatar
ukUkrainian
urUrdu
uzUzbek
viVietnamese
yiYiddish
yoYoruba
yueCantonese
zhChinese

Language detection

The Whisper streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. When enabled, the model returns the detected language code and confidence score with each complete utterance and final turn.

Configuration

To enable language detection, include language_detection=true as a query parameter in the WebSocket URL:
wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&language_detection=true

Output format

When language detection is enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:
  • language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)
  • language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection
The language_code and language_confidence fields only appear when either:
  • The utterance field is non-empty and contains a complete utterance - The end_of_turn field is true

Example response

Here’s an example Turn message with language detection enabled, showing Spanish being detected:
{
  "turn_order": 0,
  "turn_is_formatted": false,
  "end_of_turn": true,
  "transcript": "buenos días",
  "end_of_turn_confidence": 1.0,
  "words": [
    {
      "start": 1200,
      "end": 2596,
      "text": "buenos",
      "confidence": 0.0,
      "word_is_final": true
    },
    {
      "start": 2828,
      "end": 3760,
      "text": "días",
      "confidence": 0.0,
      "word_is_final": true
    }
  ],
  "utterance": "Buenos días.",
  "language_code": "es",
  "language_confidence": 0.846999,
  "type": "Turn"
}
In this example, the model detected Spanish ("es") with a confidence of 0.846999.

Non-speech tags

The Whisper streaming model can detect and transcribe non-speech audio events. These are returned as bracketed tags in the utterance field. Common non-speech tags include:
  • [Silence] - Periods of silence or no speech
  • [Música] / [Music] - Background music detected
  • Other audio events may appear in similar bracketed format

Example response with non-speech

Here’s an example Turn message showing silence detection:
{
  "turn_order": 1,
  "turn_is_formatted": false,
  "end_of_turn": true,
  "transcript": " silence  silence",
  "end_of_turn_confidence": 1.0,
  "words": [
    {
      "start": 6300,
      "end": 6338,
      "text": "",
      "confidence": 0.0,
      "word_is_final": true
    },
    {
      "start": 6376,
      "end": 6687,
      "text": "silence",
      "confidence": 0.0,
      "word_is_final": true
    }
  ],
  "utterance": "[ Silence] [ Silence]",
  "language_code": "fr",
  "language_confidence": 0.480619,
  "type": "Turn"
}
Non-speech tags appear in the utterance field with brackets. The transcript field contains the raw text without formatting. You can filter out non-speech turns by checking if the utterance contains bracketed tags like [Silence] or [Music].

Understanding formatting

By default, the Whisper streaming model returns unformatted transcripts. To receive formatted transcripts with proper punctuation and capitalization, you must set format_turns=true as a query parameter.
For voice agent pipelines, formatting is not required since LLMs process unformatted text directly. For notetaking and closed captioning applications, enable format_turns to make output human-readable.

Configuration

To enable formatted transcripts, include format_turns=true in the WebSocket URL:
wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&format_turns=true

Example comparison

Here’s how the same Spanish phrase appears with and without formatting: Unformatted (format_turns=false, default):
{
  "transcript": "buenos días",
  "turn_is_formatted": false
}
Formatted (format_turns=true):
{
  "transcript": "Buenos días.",
  "turn_is_formatted": true
}
When formatting is enabled, the transcript includes proper capitalization and punctuation.

Quickstart

Firstly, install the required dependencies.
pip install websockets pyaudio
The Python example uses the websockets library. If you’re using websockets version 13.0 or later, use additional_headers parameter. For older versions (< 13.0), use extra_headers instead.
import websockets
import asyncio
import json
from urllib.parse import urlencode

import pyaudio

FRAMES_PER_BUFFER = 3200
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 48000
p = pyaudio.PyAudio()

stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=FRAMES_PER_BUFFER
)

BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
CONNECTION_PARAMS = {
    "sample_rate": RATE,
    "speech_model": "whisper-rt",
    "language_detection": True,
}
URL = f"{BASE_URL}?{urlencode(CONNECTION_PARAMS)}"

async def send_receive():

    print(f'Connecting websocket to url ${URL}')

    async with websockets.connect(
        URL,
        additional_headers={"Authorization": "YOUR-API-KEY"},
        ping_interval=5,
        ping_timeout=20
    ) as _ws:
        await asyncio.sleep(0.1)
        print("Receiving SessionBegins ...")

        session_begins = await _ws.recv()
        print(session_begins)
        print("Sending messages ...")

        async def send():
            while True:
                try:
                    data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
                    await _ws.send(data)
                except websockets.exceptions.ConnectionClosedError as e:
                    print(e)
                except Exception as e:
                    print(e)
                await asyncio.sleep(0.01)

        async def receive():
            while True:
                try:
                    result_str = await _ws.recv()
                    data = json.loads(result_str)
                    transcript = data['transcript']
                    utterance = data['utterance']

                    if data['type'] == 'Turn':
                        if not data.get('end_of_turn') and transcript:
                            print(f"[PARTIAL TURN TRANSCRIPT]: {transcript}")
                        if data.get('utterance'):
                            print(f"[PARTIAL TURN UTTERANCE]: {utterance}")
                            # Display language detection info if available
                            if 'language_code' in data:
                                print(f"[UTTERANCE LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
                        if data.get('end_of_turn'):
                            print(f"[FULL TURN TRANSCRIPT]: {transcript}")
                            # Display language detection info if available
                            if 'language_code' in data:
                                print(f"[END OF TURN LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
                    else:
                        pass

                except websockets.exceptions.ConnectionClosed:
                    break
                except Exception as e:
                    print(f"\nError receiving data: {e}")
                    break

        try:
            await asyncio.gather(send(), receive())
        except KeyboardInterrupt:
            await _ws.send({"type": "Terminate"})
            # Wait for the server to close the connection after receiving the message
            await _ws.wait_closed()
            print("Session terminated and connection closed.")

if __name__ == "__main__":
    try:
        asyncio.run(send_receive())
    finally:
        stream.stop_stream()
        stream.close()
        p.terminate()