Prompting | AssemblyAI | Documentation

Use prompt engineering to control transcription style and improve accuracy for domain-specific terminology. This guide documents best practices for crafting effective prompts for Universal-3-Pro speech transcription.

How prompting works

Universal-3-Pro is a Speech-augmented Large Language Model (SpeechLLM). The architecture is a multi-modal LLM with an audio encoder and LLM decoder designed to understand and process speech, audio, and text inputs in the same workflow.

SpeechLLM prompting works more like selecting modes and knobs than open-ended instruction following. The model is trained primarily to transcribe, then fine-tuned to respond to common transcription instructions for style, speakers, and speech events.

What prompts can do

Capability	Description	Reliability
Verbatim transcription and disfluencies	Include um, uh, false starts, repetitions, stutters	High
Output style and formatting	Control punctuation, capitalization, number formatting	High
Context aware clues	Help with jargon, names, and domain expectations	Medium
Entity accuracy and spelling	Improve accuracy for proper nouns, brands, technical terms	Medium
Speaker attribution	Mark speaker turns and add labels	High
Audio event tags	Mark laughter, music, applause, background sounds	Medium
Code-switching and multilingual	Handle multilingual audio in same transcript	Medium
Numbers and measurements	Control how numbers, percentages, and measurements are formatted	Medium
Difficult audio handling	Guidance for unclear audio, overlapping speech, interruptions	Medium

What prompts cannot reliably do

Limitation	Why
Invent correct spellings	Unknown rare words need hotwords or context injection
Resolve ambiguous audio	The model transcribes what it hears, not what it infers
Perfect proper nouns	Rare names and entities may need a post-processing LLM correction step
Code-switch and Translate	Code-switching preserves language or translates, not both

For consistent proper-noun accuracy or domain vocabulary, you’ll usually need context injection into the prompt, use keyterms prompt to highlight known keyterms, or a post-processing LLM correction step.

Example prompts and behavior

Sample prompt 1: Simple, readable transcription

Transcribe this audio

Characteristics of prompt output:

Without much instruction, the model follows its training data to transcribe a human readable transcript
Audio is transcribed as a human would read it, not including disfluencies, false starts, hesitations, and other speech patterns
No extra weight is given to contextual accuracy of keyterms or in-domain data (i.e. medical), resulting in strong overall WER but higher error rate on rare words

Useful for:

General transcription
Readability and sentence structure
Balanced WER vs contextual accuracy

Sample prompt 2: Enhanced speech patterns and entity accuracy

Transcribe accurately with attention to proper nouns, technical terms,
and natural speech patterns.

Characteristics of prompt output:

With more instruction, the model attempts to add contextual correctness to rare words and entities
Audio is transcribed with more natural speech patterns, including basic disfluencies, false starts, hesitations, and other speech patterns
Extra weight is given to contextual accuracy of keyterms or in-domain data (i.e. medical), resulting in small degradation of WER for lower error rate on rare words

Useful for:

In-domain transcription where you know the general context (i.e. medical, legal, finance, technical)
Basic capture of speech patterns
Higher WER in exchange for contextual accuracy

Sample prompt 3: Maximum speech patterns, context, and multilingual accuracy

Transcribe this audio with beautiful punctuation and formatting.
Include spoken filler words, hesitations, plus repetitions and false starts when clearly spoken.
Use standard spelling and the most contextually correct spelling of all words and names,
brands, drug names, medical terms, person names, and all proper nouns.
Transcribe in the original language mix (code-switching), preserving the words in the language they are spoken.

Characteristics of prompt output:

With many instructions, the model attempts to become as verbatim as possible across a number of different domains - punctuation, formatting, speech patterns, contextual correctness, entity accuracy, and code-switching/multilingual transcription
Audio is transcribed with as many natural speech patterns as possible, including all potential disfluencies, false starts, hesitations, and other speech patterns
Because the model is searching for all audio patterns, prompts like this are more likely to also pick up things like cross talk or background noise
Extra weight is given to contextual accuracy across not just rare words and entities but also mixed languages in the audio

Useful for:

Maximum capture of speech patterns - will generate the most words transcribed
Higher WER in exchange for contextual accuracy and multilingual accuracy
Across a variety of transcripts where you think these different patterns will be encountered

Prompt capabilities

Each capability below acts as a “knob” you can turn. Combine 3-6 capabilities maximum for best results.

1. Verbatim transcription and disfluencies

What it does: Preserves natural speech patterns including filler words, false starts, repetitions, and self-corrections.

Reliability: High

Example prompts:

Include spoken filler words like "um," "uh," "you know," "like," plus repetitions
and false starts when clearly spoken.

Preserve all disfluencies exactly as spoken including verbal hesitations,
restarts, and self-corrections (um, uh, I—I mean).

Transcribe verbatim:
- Fillers: yes (um, uh, like, you know)
- Repetitions: yes (I I, the the the)
- Stutters: yes (th-that, b-but)
- False starts: yes (I was— I went)
- Colloquial: yes (gonna, wanna, gotta)

2. Output style and formatting

What it does: Controls punctuation, capitalization, and readability without changing words.

Reliability: High

Example prompts:

Transcribe this audio with beautiful punctuation and formatting.

Use expressive punctuation to reflect emotion and prosody.

Use standard punctuation and sentence breaks for readability.

3. Context aware clues

What it does: Helps with jargon, names, and domain expectations that are known from the audio file.

Reliability: Medium

Example prompts:

Transcribe this audio. Context: a customer testimonial about contact center software.

Transcribe this audio. Context: a medical consultation discussing medications and symptoms.

Transcribe this audio. Context: a technical lecture about GPUs, CUDA, and inference.

4. Entity accuracy and spelling

What it does: Improves accuracy for proper nouns, brands, technical terms, and domain vocabulary.

Reliability: Medium

Example prompts:

Use standard spelling and the most contextually correct spelling of all words
including names, brands, drug names, medical terms, and proper nouns.

Non-negotiable: Pharmaceutical accuracy required across all medications and drug names

Preserve acronyms and capitalization of company names and legal entities

Caution: Over instructing the model to follow specific examples that occur in a file can cause hallucinations when these examples are encountered. We recommend listing the pattern vs the specific error (i.e. Pharmaceutical accuracy required across all medications and drug names vs Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)).

5. Speaker attribution

What it does: Marks speaker turns and adds identifying labels.

Reliability: High

Example prompts:

Mark speaker turns clearly.

Tag speaker changes with context like name, role, or gender based on speech content.

Label speakers by role when identifiable (Justice Roberts:, Petitioner's Counsel:).

Speaker labels can be tagged with names, roles, genders, and more from the audio file. Simply add the desired category for the labels into your prompt.

6. Audio event tags

What it does: Marks non-speech sounds like music, laughter, applause, and background noise.

Reliability: Medium

Example prompts:

Preserve non-speech audio in tags to indicate when the audio occurred.

Tag sounds: [laughter], [silence], [noise], [cough], [sigh].

Include audio event markers for music, laughter, and applause.

7. Code-switching and multilingual

What it does: Handles audio where speakers switch between languages.

Reliability: Medium

Example prompts:

Transcribe in the original language mix (code-switching), preserving words
in the language they are spoken.

Preserve natural code-switching between English and Spanish. Retain spoken language as-is with mixed language words.

Note: Requires language_detection: true on your request. If a single language code is specified, the model will try to transcribe only that language.

8. Numbers and measurements

What it does: Controls how numbers, percentages, and measurements are formatted.

Reliability: Medium

Example prompts:

Convert spoken numbers to digits.

Use digits for numbers, percentages, and measurements.

Format financial figures with standard notation and format numbers for maximum readability.

9. Difficult audio handling

What it does: Provides guidance for unclear audio, overlapping speech, and interruptions.

Reliability: Medium (YMMV)

Example prompts:

If unintelligible, write (unclear).

Mark inaudible segments. Preserve overlapping speech and crosstalk.

Include pause markers where the speaker hesitates significantly.

Best practices

What helps

Practice	Impact	Example	Why it helps
Authoritative language	Massive	`Mandatory:`, `Non-negotiable:`, `Required:`	Model understands to pay excess attention to desired instruction
3-6 instructions maximum	Massive	`Transcribe verbatim. Include all disfluencies. Pay attention to rare words and entities. Preserve natural speech patterns.`	Prevents conflicting instructions
Desired output format	High	`Pharmaceutical accuracy required across all medications and drug names`	Model learns the domain context and entities to pay closer attention to transcribing
Explicit disfluency examples	High	`Include all disfluencies (um, uh, like, you know)`	Model sees the speech patterns and linguistic cues to pay extra attention to

What hurts

Anti-pattern	Impact	Example	Why it hurts
Explicit examples of errors from the file	Potential Hallucinations	`Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)`	Model is over eager to correct exact phrases in the transcript
Negative language	Severe	`Don't`, `Avoid`, `Never`, `Not`	Model does not process negative instructions and gets confused
Conflicting instructions	Severe	`Include disfluencies. Maximum readability`	Model has to make a decision which instruction to process leading to less determinate results
Short, vague instructions	High	`Be accurate`, `Best transcript ever`, `Superhero human transcriptionist`	Model doesn’t understand the instruction pattern to identify, pay attention to, and correct
Missing disfluency instructions	Medium	`Transcribe verbatim`, `Transcribe this audio`	Not necessarily a failure but the model by default will not be expressive with disfluencies unless instructed

Prompt generator

This prompt generator helps you create a starting prompt based on your selected transcription style. Paste a sample of your transcript and select your preferred style to get a customized prompt recommendation.

Paste your transcript sample

Describe how you want the transcript to look

Generate a prompt with AI

Click a button to open your preferred AI assistant with your transcript sample and instructions pre-loaded. The AI will generate an optimized prompt based on our prompt engineering best practices.

Prompt library

Browse community-submitted prompts, vote on the ones that work best, and share your own.

Loading prompts...

Search prompts

Top prompts by community votes

Submit your own prompt

Your prompt

0 / 1000 characters (minimum 20)

Domain-specific sample prompts

Legal transcription

Best for: Court proceedings, depositions, legal hearings

Mandatory: Transcribe legal proceedings with precise terminology intact.
Required: Preserve all disfluencies including verbal pauses, false starts, and
self-corrections (uh, um, I—I mean over cleaned speech).
Non-negotiable: Distinguish between speakers through clear attribution
(Justice Roberts: over generic labels).
Capture Latin legal phrases exactly as spoken (certiorari, habeas corpus,
amicus curiae).

Why it works: Combines authoritative language (Mandatory, Required, Non-negotiable), explicit disfluency examples with contrast format, speaker attribution guidance, and domain terminology.

Medical transcription

Best for: Clinical documentation, medical dictation, patient-provider conversations

Mandatory: Preserve all clinical terminology exactly as spoken including drug names, dosages, and diagnostic terms.
Required: Capture every hesitation, false start, and filler word (um, uh, ah) as spoken.
Label physician and patient speech clearly when identifiable.

Why it works: Combines authoritative language (Mandatory, Required) with explicit disfluency examples, while ensuring clinical terminology accuracy and clear speaker attribution for medical documentation.

Financial/Earnings calls

Best for: Quarterly earnings calls, investor presentations, financial meetings

Mandatory: Corporate earnings call transcription with precise financial
terminology.
Required: Preserve all speaker disfluencies including hesitations, false starts,
and filler words (um, uh, you know) exactly as spoken.
Non-negotiable: Financial term accuracy (EBITDA over ebitda, year-over-year
over y-o-y, basis points over bps).
Format numerical data with standard notation. Label executive speakers by role
when identifiable (CEO:, CFO:, Analyst:).

Why it works: Balances financial terminology precision with verbatim capture of executive speech patterns important for tone analysis.

Software/Technical meetings

Best for: Engineering standups, code reviews, technical discussions

Mandatory: Technical meeting transcription with multiple participants.
Required: Preserve all verbal disfluencies including hesitations, false starts,
and thinking sounds (um, uh, hmm, so) exactly as spoken.
Non-negotiable: Technical terminology accuracy (Kubernetes over kubernetties,
PostgreSQL over postgres, API over A.P.I.).
Mark speaker transitions explicitly. Capture self-corrections and restarts
(I was— I went, th-that).

Why it works: Preserves natural developer speech patterns while ensuring technical terms are spelled correctly.

Code-switching (Bilingual)

Best for: Multilingual conversations, Spanglish, language mixing

Mandatory: Transcribe verbatim, preserving natural code-switching between
English and Spanish.
Required: Retain spoken language as-is without translation
(correct "I was hablando con mi manager").
Non-negotiable: Preserve fillers, repetitions, and false starts across both
languages (eh, o sea— I mean, you know).
Resolve sound-alike errors using bilingual context (pero over perro,
meeting over mitin).

Why it works: Explicitly instructs preservation over translation, handles cross-language disfluencies, and addresses common bilingual transcription errors.

Customer support call

Best for: Contact center calls, customer service interactions, agent-customer conversations

Customer support call between agent and customer.
Mandatory: Transcribe any overlapping speech across channels including crosstalk.
Required: Pay attention to proper nouns like names, balance amounts, and bank name
being correct.
Non-negotiable: Preserve all disfluencies exactly as spoken including verbal
hesitations, restarts, and self-corrections (um, uh, I—I mean).

Why it works: Combines multichannel awareness for overlapping speech with entity accuracy for critical customer data (names, amounts, institutions), while preserving verbatim speech patterns essential for quality assurance and compliance review.

How to build your prompt

Step 1: Start with your base need

Choose your primary transcription goal:

Goal	Base instruction
Verbatim/disfluencies	`Include spoken filler words, hesitations, plus repetitions and false starts when clearly spoken.`
Output style/formatting	`Transcribe this audio with beautiful punctuation and formatting.`
Context aware clues	`Transcribe this audio. Context: [describe the audio content and domain].`
Entity accuracy	`Use standard spelling and contextually correct spelling of all proper nouns.`
Speaker attribution	`Mark speaker turns clearly. Tag speaker changes with context like name, role, or gender.`
Audio event tags	`Preserve non-speech audio in tags to indicate when the audio occurred.`
Code-switching	`Transcribe in the original language mix, preserving words in the language spoken.`
Numbers/measurements	`Use digits for numbers, percentages, and measurements.`
Difficult audio	`If unintelligible, write (unclear). Mark inaudible segments.`

Step 2: Add authoritative language

Prefix each instruction with:

Non-negotiable:
Mandatory:
Required:
Strict requirement:

Step 3: Add instructions one by one

We recommend you layer on each instruction one by one to see the impact on the transcription output. Since conflicting instructions can cause outputs to degrade, adding each instruction one by one allows you to test and evaluate how each instruction improves/degrades your transcription output.

Step 4: Iterate and test

Identify target terms - What words/phrases are being transcribed incorrectly?
Find the error pattern - Vowel substitution? Sound-alike? Phonetic spelling?
Choose example terms - Pick 2-3 common terms with the SAME error pattern
Test and verify - Listen to the audio to confirm correctness
Measure success rate - Test variations on sample files

Final prompt structure

[Authoritative language]: [Specific instruction] + [Explicit examples]
[Authoritative language]: [Specific instruction] + [Explicit examples]
[Authoritative language]: [Specific instruction] + [Explicit examples]

Need help?

Prompt engineering is a new and evolving concept with SpeechLLM models. If you need help generating a prompt, our Engineering team is happy to support. Feel free to open a new live chat or send an email in the widget in the bottom right hand corner (more contact info here).