Prompting
Use prompt engineering to control transcription style and improve accuracy for domain-specific terminology. This guide documents best practices for crafting effective prompts for Universal-3-Pro speech transcription.
How prompting works
Universal-3-Pro is a Speech-augmented Large Language Model (SpeechLLM). The architecture is a multi-modal LLM with an audio encoder and LLM decoder designed to understand and process speech, audio, and text inputs in the same workflow.
SpeechLLM prompting works more like selecting modes and knobs than open-ended instruction following. The model is trained primarily to transcribe, then fine-tuned to respond to common transcription instructions for style, speakers, and speech events.
What prompts can do
What prompts cannot reliably do
For consistent proper-noun accuracy or domain vocabulary, you’ll usually need context injection into the prompt, use keyterms prompt to highlight known keyterms, or a post-processing LLM correction step.
Example prompts and behavior
Sample prompt 1: Simple, readable transcription
Characteristics of prompt output:
- Without much instruction, the model follows its training data to transcribe a human readable transcript
- Audio is transcribed as a human would read it, not including disfluencies, false starts, hesitations, and other speech patterns
- No extra weight is given to contextual accuracy of keyterms or in-domain data (i.e. medical), resulting in strong overall WER but higher error rate on rare words
Useful for:
- General transcription
- Readability and sentence structure
- Balanced WER vs contextual accuracy
Sample prompt 2: Enhanced speech patterns and entity accuracy
Characteristics of prompt output:
- With more instruction, the model attempts to add contextual correctness to rare words and entities
- Audio is transcribed with more natural speech patterns, including basic disfluencies, false starts, hesitations, and other speech patterns
- Extra weight is given to contextual accuracy of keyterms or in-domain data (i.e. medical), resulting in small degradation of WER for lower error rate on rare words
Useful for:
- In-domain transcription where you know the general context (i.e. medical, legal, finance, technical)
- Basic capture of speech patterns
- Higher WER in exchange for contextual accuracy
Sample prompt 3: Maximum speech patterns, context, and multilingual accuracy
Characteristics of prompt output:
- With many instructions, the model attempts to become as verbatim as possible across a number of different domains - punctuation, formatting, speech patterns, contextual correctness, entity accuracy, and code-switching/multilingual transcription
- Audio is transcribed with as many natural speech patterns as possible, including all potential disfluencies, false starts, hesitations, and other speech patterns
- Because the model is searching for all audio patterns, prompts like this are more likely to also pick up things like cross talk or background noise
- Extra weight is given to contextual accuracy across not just rare words and entities but also mixed languages in the audio
Useful for:
- Maximum capture of speech patterns - will generate the most words transcribed
- Higher WER in exchange for contextual accuracy and multilingual accuracy
- Across a variety of transcripts where you think these different patterns will be encountered
Prompt capabilities
Each capability below acts as a “knob” you can turn. Combine 3-6 capabilities maximum for best results.
1. Verbatim transcription and disfluencies
What it does: Preserves natural speech patterns including filler words, false starts, repetitions, and self-corrections.
Reliability: High
Example prompts:
2. Output style and formatting
What it does: Controls punctuation, capitalization, and readability without changing words.
Reliability: High
Example prompts:
3. Context aware clues
What it does: Helps with jargon, names, and domain expectations that are known from the audio file.
Reliability: Medium
Example prompts:
4. Entity accuracy and spelling
What it does: Improves accuracy for proper nouns, brands, technical terms, and domain vocabulary.
Reliability: Medium
Example prompts:
Caution: Over instructing the model to follow specific examples that occur in a file can cause hallucinations when these examples are encountered. We recommend listing the pattern vs the specific error (i.e. Pharmaceutical accuracy required across all medications and drug names vs Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)).
5. Speaker attribution
What it does: Marks speaker turns and adds identifying labels.
Reliability: High
Example prompts:
Speaker labels can be tagged with names, roles, genders, and more from the audio file. Simply add the desired category for the labels into your prompt.
6. Audio event tags
What it does: Marks non-speech sounds like music, laughter, applause, and background noise.
Reliability: Medium
Example prompts:
7. Code-switching and multilingual
What it does: Handles audio where speakers switch between languages.
Reliability: Medium
Example prompts:
Note: Requires language_detection: true on your request. If a single language code is specified, the model will try to transcribe only that language.
8. Numbers and measurements
What it does: Controls how numbers, percentages, and measurements are formatted.
Reliability: Medium
Example prompts:
9. Difficult audio handling
What it does: Provides guidance for unclear audio, overlapping speech, and interruptions.
Reliability: Medium (YMMV)
Example prompts:
Best practices
What helps
What hurts
Prompt generator
This prompt generator helps you create a starting prompt based on your selected transcription style. Paste a sample of your transcript and select your preferred style to get a customized prompt recommendation.
Click a button to open your preferred AI assistant with your transcript sample and instructions pre-loaded. The AI will generate an optimized prompt based on our prompt engineering best practices.
Prompt library
Browse community-submitted prompts, vote on the ones that work best, and share your own.
Domain-specific sample prompts
Legal transcription
Best for: Court proceedings, depositions, legal hearings
Why it works: Combines authoritative language (Mandatory, Required, Non-negotiable), explicit disfluency examples with contrast format, speaker attribution guidance, and domain terminology.
Medical transcription
Best for: Clinical documentation, medical dictation, patient-provider conversations
Why it works: Combines authoritative language (Mandatory, Required) with explicit disfluency examples, while ensuring clinical terminology accuracy and clear speaker attribution for medical documentation.
Financial/Earnings calls
Best for: Quarterly earnings calls, investor presentations, financial meetings
Why it works: Balances financial terminology precision with verbatim capture of executive speech patterns important for tone analysis.
Software/Technical meetings
Best for: Engineering standups, code reviews, technical discussions
Why it works: Preserves natural developer speech patterns while ensuring technical terms are spelled correctly.
Code-switching (Bilingual)
Best for: Multilingual conversations, Spanglish, language mixing
Why it works: Explicitly instructs preservation over translation, handles cross-language disfluencies, and addresses common bilingual transcription errors.
Customer support call
Best for: Contact center calls, customer service interactions, agent-customer conversations
Why it works: Combines multichannel awareness for overlapping speech with entity accuracy for critical customer data (names, amounts, institutions), while preserving verbatim speech patterns essential for quality assurance and compliance review.
How to build your prompt
Step 1: Start with your base need
Choose your primary transcription goal:
Step 2: Add authoritative language
Prefix each instruction with:
Non-negotiable:Mandatory:Required:Strict requirement:
Step 3: Add instructions one by one
We recommend you layer on each instruction one by one to see the impact on the transcription output. Since conflicting instructions can cause outputs to degrade, adding each instruction one by one allows you to test and evaluate how each instruction improves/degrades your transcription output.
Step 4: Iterate and test
- Identify target terms - What words/phrases are being transcribed incorrectly?
- Find the error pattern - Vowel substitution? Sound-alike? Phonetic spelling?
- Choose example terms - Pick 2-3 common terms with the SAME error pattern
- Test and verify - Listen to the audio to confirm correctness
- Measure success rate - Test variations on sample files
Final prompt structure
Need help?
Prompt engineering is a new and evolving concept with SpeechLLM models. If you need help generating a prompt, our Engineering team is happy to support. Feel free to open a new live chat or send an email in the widget in the bottom right hand corner (more contact info here).