Changelog

Follow along to see weekly accuracy and product improvements.

Introducing Universal-2

Last week we released Universal-2, our latest Speech-to-Text model. Universal-2 builds upon our previous model Universal-1 to make significant improvements in "last mile" challenges critical to real-world use cases - proper nouns, formatting, and alphanumerics.

Comparison of error rates for Universal-2 vs Universal-1 across overall performance (Standard ASR) and four last-mile areas, each measured by the appropriate metric

Universal-2 is now the default model for English files sent to our `v2/transcript` endpoint for async processing. You can read more about Universal-2 in our announcement blog or research blog, or you can try it out now on our Playground.

Claude Instant 1.2 removed from LeMUR

The following models were removed from LeMUR: anthropic/claude-instant-1-2 and basic (legacy, equivalent to anthropic/claude-instant-1-2), which will now return a 400 validation error if called.

These models were removed due to Anthropic sunsetting legacy models in favor of newer models which are more performant, faster, and cheaper. We recommend users who were using the removed models switch to Claude 3 Haiku (anthropic/claude-3-haiku).

French performance patch; bugfix

We recently observed a degradation in accuracy when transcribing French files through our API. We have since pushed a bugfix to restore performance to prior levels.

We've improved error messaging for greater clarity for both our file download service and Invalid LLM response errors from LeMUR.

We've released a fix to ensure that rate limit headers are always returned from LeMUR requests, and not just 200 and 400 responses.

New and improved - AssemblyAI Q3 recap

Check out our quarterly wrap-up for a summary of the new features and integrations we launched this quarter, as well as improvements we made to existing models and functionality.

Claude 3 in LeMUR

We added support for Claude 3 in LeMUR, allowing users to prompt the following LLMs in relation to their transcripts:

  • Claude 3.5 Sonnet
  • Claude 3 Opus
  • Claude 3 Sonnet
  • Claude 3 Haiku

Check out our related blog post to learn more.

Automatic Language Detection

We made significant improvements to our Automatic Language Detection (ALD) Model, supporting 10 new languages for a total of 17, with best in-class accuracy in 15 of those 17 languages. We also added a customizable confidence threshold for ALD.

Learn more about these improvements in our announcement post.

We released the AssemblyAI Ruby SDK and the AssemblyAI C# SDK, allowing Ruby and C# developers to easily add SpeechAI to their applications with AssemblyAI. The SDKs let developers use our asynchronous Speech-to-Text and Audio Intelligence models, as well as LeMUR through a simple interface.

Learn more in our Ruby SDK announcement post and our C# SDK announcement post.

This quarter, we shipped two new integrations:

Activepieces šŸ¤ AssemblyAI

The AssemblyAI integration for Activepieces allows no-code and low-code builders to incorporate AssemblyAI's powerful SpeechAI in Activepieces automations. Learn how to use AssemblyAI in Activepieces in our Docs.

Langflow šŸ¤ AssemblyAI

We've released the AssemblyAI integration for Langflow, allowing users to build with AssemblyAI in Langflow - a popular open-source, low-code app builder for RAG and multi-agent AI applications. Check out the Langflow docs to learn how to use AssemblyAI in Langflow.

Assembly Required

This quarter we launched Assembly Required - a series of candid conversations with AI founders sharing insights, learnings, and the highs and lows of building a company.

Click here to check out the first conversation in the series, between Edo Liberty, founder and CEO of Pinecone, and Dylan Fox, founder and CEO of AssemblyAI.

We released the AssemblyAI API Postman Collection, which provides a convenient way for Postman users to try our API, featuring endpoints for Speech-to-Text, Audio Intelligence, LeMUR, and Streaming for you to use. Similar to our API reference, the Postman collection also provides example responses so you can quickly browse endpoint results.

Free offer improvements

This quarter, we improved our free offer with:

  • $50 in free credits upon signing up
  • Access to usage dashboard, billing rates, and concurrency limit information
  • Transfer of unused free credits to account balance upon upgrading to Pay as you go

We released 36 new blogs this quarter, from tutorials to projects to technical deep dives. Here are some of the blogs we released this quarter:

  1. Build an AI-powered video conferencing app with Next.js and Stream
  2. Decoding Strategies: How LLMs Choose The Next Word
  3. Florence-2: How it works and how to use it
  4. Speaker diarization vs speaker recognition - what's the difference?
  5. Analyze Audio from Zoom Calls with AssemblyAI and Node.js

We also released 10 new YouTube videos, demonstrating how to build SpeechAI applications and more, including:

  1. Best AI Tools and Helpers Apps for Software Developers in 2024
  2. Build a Chatbot with Claude 3.5 Sonnet and Audio Data
  3. How to build an AI Voice Translator
  4. Real-Time Medical Transcription Analysis Using AI - Python Tutorial

We also made improvements to a range of other features, including:

  1. Timestamps accuracy, with 86% of timestamps accuracy to within 0.1s and 96% of timestamps accurate to within 0.2s
  2. Enhancements to the AssemblyAI app for Zapier, supporting 5 new events. Check out our tutorial on generating subtitles with Zapier to see it in action.
  3. Various upgrades to our API, including more improved error messaging and scaling improvements to improve p90 latency
  4. Improvements to billing, now alerting users upon auto-refill failures
  5. Speaker Diarization improvements, especially robustness in distinguishing speakers with similar voices
  6. A range of new and improved Docs

And more!

We can't wait for you to see what we have in store to close out the year šŸš€

Claude 1 & 2 sunset

Recently, Anthropic announced that they will be deprecating legacy LLM models that are usable via LeMUR. We will therefore be sunsetting these models in advance of Anthropic's end-of-life for them:

  • Claude Instant 1.2 (ā€œLeMUR Basicā€) will be sunset on October 28th, 2024
  • Claude 2.0 and 2.1 (ā€œLeMUR Defaultā€) will be sunset on February 6th, 2025

You will receive API errors rejecting your LeMUR requests if you attempt to use either of the above models after the sunset dates. Users who have used these models recently have been alerted via email with notice to select an alternative model to use via LeMUR.

We have a number of newer models to choose from, which are not only more performant but also ~50% more cost-effective than the legacy models. 

  • If you are using Claude Instant 1.2 (ā€œLeMUR Basicā€), we recommend switching to Claude 3 Haiku.
  • If you are using Claude 2.0 (ā€œLeMUR Defaultā€) or Claude 2.1, we recommend switching to Claude 3.5 Sonnet.

Check out our docs to learn how to select which model you use via LeMUR.

Langflow šŸ¤ AssemblyAI

We've released the AssemblyAI integration for Langflow, allowing low-code builders to incorporate Speech AI into their workflows.

Langflow is a popular open-source, low-code app builder for RAG and multi-agent AI applications. Using Langflow, you can easily connect different components via drag and drop and build your AI flow. Check out the Langflow docs for AssemblyAI's integration here to learn more.

Speaker Labels bugfix

We've fixed an edge-case issue that would cause requests using Speaker Labels to fail for some files.

Activepieces šŸ¤ AssemblyAI

We've released the AssemblyAI integration for Activepieces, allowing no-code and low-code builders to incorporate Speech AI into their workflows.

Activepieces is an open-source, no-code automation platform that allows users to build workflows that connect various applications. Now, you can use AssemblyAI's powerful models to transcribe speech, analyze audio, and build generative features in Activepieces.

Read more about how you can use AssemblyAI in Activepieces in our Docs.

Language confidence threshold bugfix

We've fixed an edge-case which would sometimes occur due to language fallback when Automatic Language Detection (ALD) was used in conjunction with language_confidence_threshold, resulting in executed transcriptions that violated the user-set language_confidence_threshold. Now such transcriptions will not execute, and instead return an error to the user.

Automatic Language Detection improvements

We've made improvements to our Automatic Language Detection (ALD) model, yielding increased accuracy, expanded language support, and customizable confidence thresholds.

In particular, we have added support for 10 new languages, including Chinese, Finnish, and Hindi, to support a total of 17 languages in our Best tier. Additionally, we've achieved best in-class accuracy in 15 of those 17 languages when benchmarked against four leading providers.

Finally, we've added a customizable confidence threshold for ALD, allowing you to set a minimum confidence threshold for the detected language and be alerted if this threshold is not satisfied.

Read more about these recent improvements in our announcement post.

Free Offer improvements

We've made a series of improvements to our Free Offer:

  1. All new and existing users will get $50 in free credits (equivalent to 135 hours of Best transcription, or 417 hours of Nano transcription)
  2. All unused free credits will be automatically transferred to a user's account balance after upgrade to pay-as-you-go pricing.
  3. Free Offer users will now see a tracker in their dashboard to see how many credits they have remaining
  4. Free Offer users will now have access to the usage dashboard, their billing rates, concurrency limit, and billing alerts

Learn more about our Free Offer on our Pricing page, and then check out our Quickstart in our Docs to get started.

Speaker Diarization improvements

We've made improvements to our Speaker Diarization model, especially robustness in distinguishing between speakers with similar voices.

We've fixed an error in which the last word in a transcript was always attributed to the same speaker as the second-to-last word.

File upload improvements and more

We've made improvements to error handling for file uploads that fail. Now if there is an error, such as a file containing no audio, the following 422 error will be returned:

Upload failed, please try again. If you continue to have issues please reach out to support@assemblyai.com

We've made scaling improvements that reduce p90 latency for some non-English languages when using the Best tier

We've made improvements to notifications for auto-refill failures. Now, users will be alerted more rapidly when their automatic payments are unsuccessful.

New endpoints for LeMUR Claude 3

Last month, we announced support for Claude 3 in LeMUR. Today, we are adding support for two new endpoints - Question & Answer and Summary (in addition to the pre-existing Task endpoint) - for these newest models:

  • Claude 3 Opus
  • Claude 3.5 Sonnet
  • Claude 3 Sonnet
  • Claude 3 Haiku

Here's how you can use Claude 3.5 Sonnet to summarize a virtual meeting with LeMUR:

import assemblyai as aai

aai.settings.api_key = "YOUR-KEY-HERE"

audio_url = "https://storage.googleapis.com/aai-web-samples/meeting.mp4"
transcript = aai.Transcriber().transcribe(audio_url)

result = transcript.lemur.summarize(
    final_model=aai.LemurModel.claude3_5_sonnet,
    context="A GitLab meeting to discuss logistics",
    answer_format="TLDR"
)

print(result.response)

Learn more about these specialized endpoints and how to use them in our Docs.

Enhanced AssemblyAI app for Zapier

We've launched our Zapier integration v2.0, which makes it easy to use our API in a no-code way. The enhanced app is more flexible, supports more Speech AI features, and integrates more closely into the Zap editor.

The Transcribe event (formerly Get Transcript) now supports all of the options available in our transcript API, making all of our Speech Recognition and Audio Intelligence features available to Zapier users, including asynchronous transcription. In addition, we've added 5 new events to the AssemblyAI app for Zapier:

  • Get Transcript: Retrieve a transcript that you have previously created.
  • Get Transcript Subtitles: Generate STT or VTT subtitles for the transcript.
  • Get Transcript Paragraphs: Retrieve the transcript segmented into paragraphs.
  • Get Transcript Sentences: Retrieve the transcript segmented into sentences.
  • Get Transcript Redacted Audio Result: Retrieve the result of the PII audio redaction model. The result contains the status and the URL to the redacted audio file.

Read more about how to use the new app in our Docs, or check out our tutorial to see how you can generate subtitles with Zapier and AssemblyAI.

LeMUR browser support

LeMUR can now be used from browsers, either via our JavaScript SDK or fetch.

LeMUR - Claude 3 support

Last week, we released Anthropic's Claude 3 model family into LeMUR, our LLM framework for speech.

  • Claude 3.5 Sonnet
  • Claude 3 Opus
  • Claude 3 Sonnet
  • Claude 3 Haiku

You can now easily apply any of these models to your audio data. Learn more about how to get started in our docs or try out the new models in a no-code way through our playground.

For more information, check out our blog post about the release.

import assemblyai as aai

# Step 1: Transcribe an audio file
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("./common_sports_injuries.mp3")

# Step 2: Define a prompt
prompt = "Provide a brief summary of the transcript."

# Step 3: Choose an LLM to use with LeMUR
result = transcript.lemur.task(
    prompt,
    final_model=aai.LemurModel.claude3_5_sonnet
)

print(result.response)
JavaScript SDK fix

We've fixed an issue which was causing the JavaScript SDK to surface the following error when using the SDK in the browser:

Access to fetch at 'https://api.assemblyai.com/v2/transcript' from origin 'https://exampleurl.com' has been blocked by CORS policy: Request header field assemblyai-agent is not allowed by Access-Control-Allow-Headers in preflight response.

Timestamps improvement; bugfixes

We've made significant improvements to the timestamp accuracy of our Speech-to-Text Best tier for English, Spanish, and German. 96% of timestamps are accurate within 200ms, and 86% of timestamps are now accurate within 100ms.

We've fixed a bug in which confidence scores of transcribed words for the Nano tier would sometimes be outside of the range [0, 1]

We've fixed a rare issue in which the speech for only one channel in a short dual channel file would be transcribed when disfluencies was also enabled.

Streaming (formerly Real-time) improvements

We've made model improvements that significantly improve the accuracy of timestamps when using our Streaming Speech-to-Text service. Most timestamps are now accurate within 100 ms.

Our Streaming Speech-to-Text service will now return a new error 'Audio too small to be transcoded' (code 4034) when a client submits an audio chunk that is too small to be transcoded (less than 10 ms).

Variable-bitrate video support; bugfix

We've deployed changes which now permit variable-bitrate video files to be submitted to our API.

We've fixed a recent bug in which audio files with a large amount of silence at the beginning of the file would fail to transcribe.

LeMUR improvements

We have added two new keys to the LeMUR response, input_tokens and output_tokens, which can help users track usage.

We've implemented a new fallback system to further boost the reliability of LeMUR.

We have addressed an edge case issue affecting LeMUR and certain XML tags. In particular, when LeMUR responds with a <question> XML tag, it will now always close it with a </question> tag rather than erroneous tags which would sometimes be returned (e.g. </answer>).

PII Redaction and Entity Detection improvements

We've improved our PII Text Redaction and Entity Detection models, yielding more accurate detection and removal of PII and other entities from transcripts.

We've added 16 new entities, including vehicle_id and account_number, and updated 4 of our existing entities. Users may need to update to the latest version of our SDKs to use these new entities.

We've added PII Text Redaction and Entity Detection support in 4 new languages:

  • Chinese
  • Dutch
  • Japanese
  • Georgian

PII Text Redaction and Entity Detection now support a total of 47 languages between our Best and Nano tiers.

Usage and spend alerts

Users can now set up billing alerts in their user portals. Billing alerts notify you when your monthly spend or account balance reaches a threshold.

To set up a billing alert, go to the billing page of your portal, and click Set up a new alert under the Your alerts widget:

You can then set up an alert by specifying whether to alert on monthly spend or account balance, as well as the specific threshold at which to send an alert.

Universal-1 now available in German

Universal-1, our most powerful and accurate multilingual Speech-to-Text model, is now available in German.

No special action is needed to utilize Universal-1 on German audio - all requests sent to our /v2/transcript endpoint with German audio files will now use Universal-1 by default. Learn more about how to integrate Universal-1 into your apps in our Getting Started guides.

New languages for Speaker Diarization

Speaker Diarization is now available in five additional languages for both the Best and Nano tiers:

  • Chinese
  • Hindi
  • Japanese 
  • Korean 
  • Vietnamese
New API Reference, Timestamps improvements

Weā€™ve released a new version of the API Reference section of our docs for an improved developer experience. Hereā€™s whatā€™s new:

  1. New API Reference pages with exhaustive endpoint documentation for transcription, LeMUR, and streaming
  2. cURL examples for every endpoint
  3. Interactive Playground: Test our API endpoints with the interactive playground. It includes a form-builder for generating requests and corresponding code examples in cURL, Python, and TypeScript
  4. Always up to date: The new API Reference is autogenerated based on our Open-Source OpenAPI and AsyncAPI specs

Weā€™ve made improvements to Universal-1ā€™s timestamps for both the Best and Nano tiers, yielding improved timestamp accuracy and a reduced incidence of overlapping timestamps.

Weā€™ve fixed an issue in which users could receive an `Unable to create transcription. Developers have been alerted` error that would be surfaced when using long files with Sentiment Analysis.

New codec support; account deletion support

Weā€™ve upgraded our transcoding library and now support the following new codecs:

  • Bonk, APAC, Mi-SC4, 100i, VQC, FTR PHM, WBMP, XMD ADPCM, WADY DPCM, CBD2 DPCM
  • HEVC, VP9, AV1 codec in enhanced flv format

Users can now delete their accounts by selecting the Delete account option on the Account page of their AssemblyAI Dashboards.

Users will now receive a 400 error when using an invalid tier and language code combination, with an error message such as The selected language_code is supported by the following speech_models: best, conformer-2. See https://www.assemblyai.com/docs/concepts/supported-languages..

Weā€™ve fixed an issue in which nested JSON responses from LeMUR would cause Invalid LLM response, unable to fulfill request. Please try again. errors.

Weā€™ve fixed a bug in which very long files would sometimes fail to transcribe, leading to timeout errors.

AssemblyAI app for Make.com

Make (formerly Integromat) is a no-code automation platform that makes it easy to build tasks and workflows that synthesize many different services.

Weā€™ve released the AssemblyAI app for Make that allows Make users to incorporate AssemblyAI into their workflows, or scenarios. In other words, in Make you can now use our AI models to

  1. Transcribe audio data with speech recognition models
  2. Analyze audio data with audio intelligence models
  3. Build generative features on top of audio data with LLMs

For example, in our tutorial on Redacting PII with Make, we demonstrate how to build a Make scenario that automatically creates a redacted audio file and redacted transcription for any audio file uploaded to a Google Drive folder.

GDPR and PCI DSS compliance

AssemblyAI is now officially PCI Compliant. The Payment Card Industry Data Security Standard Requirements and Security Assessment Procedures (PCI DSS) certification is a rigorous assessment that ensures card holder data is being properly and securely handled and stored. You can read more about PCI DSS here.

Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our PCI attestation of compliance, as well as other security-related documents.

AssemblyAI is also GDPR Compliant. The General Data Protection Regulation (GDPR) is regulation regarding privacy and security for the European Union that applies to businesses that serve customers within the EU. You can read more about GDPR here.

Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our GDPR assessment on compliance, as well as other security-related documents.

Self-serve invoices; dual-channel improvement

Users of our API can now view and download their self-serve invoices in their dashboards under Billing > Your invoices.

Weā€™ve made readability improvements to the formatting of utterances for dual-channel transcription by combining sequential utterances from the same channel.

Weā€™ve added a patch to improve stability in turnaround times for our async transcription and LeMUR services.

Weā€™ve fixed an issue in which timestamp accuracy would be degraded in certain edge cases when using our async transcription service.

Introducing Universal-1

Last week we released Universal-1, a state-of-the-art multimodal speech recognition model. Universal-1 is trained on 12.5M hours of multilingual audio data, yielding impressive performance across the four key languages for which it was trained - English, Spanish, German, and French.

Word Error Rate across four languages for several providers. Lower is better.

Universal-1 is now the default model for English and Spanish audio files sent to our v2/transcript endpoint for async processing, while German and French will be rolled out in the coming weeks.

You can read more about Universal-1 in our announcement blog or research blog, or you can try it out now on our Playground.

New Streaming STT features

Weā€™ve added a new message type to our Streaming Speech-to-Text (STT) service. This new message type SessionInformation is sent immediately before the final SessionTerminated message when closing a Streaming session, and it contains a field called audio_duration_seconds which contains the total audio duration processed during the session. This feature allows customers to run end-user-specific billing calculations.

To enable this feature, set the enable_extra_session_information query parameter to true when connecting to a Streaming WebSocket.

endpoint_str = 'wss://api.assemblyai.com
/v2/realtime/ws?sample_rate=8000&enable_extra_session_information=true'

This feature will be rolled out in our SDKs soon.

Weā€™ve added a new feature to our Streaming STT service, allowing users to disable Partial Transcripts in a Streaming session. Our Streaming API sends two types of transcripts - Partial Transcripts (unformatted and unpunctuated) that gradually build up the current utterance, and Final Transcripts which are sent when an utterance is complete, containing the entire utterance punctuated and formatted.

Users can now set the disable_partial_transcripts query parameter to true when connecting to a Streaming WebSocket to disable the sending of Partial Transcript messages.

endpoint_str = 'wss://api.assemblyai.com
/v2/realtime/ws?sample_rate=8000&disable_partial_transcripts=true'

This feature will be rolled out in our SDKs soon.

We have fixed a bug in our async transcription service, eliminating File does not appear to contain audio errors. Previously, this error would be surfaced in edge cases where our transcoding pipeline would not have enough resources to transcode a given file, thus failing due to resource starvation.

Dual channel transcription improvements

Weā€™ve made improvements to how utterances are handled during dual-channel transcription. In particular, the transcription service now has elevated sensitivity when detecting utterances, leading to improved utterance insertions when there is overlapping speech on the two channels.

LeMUR concurrency fix

Weā€™ve fixed a temporary issue in which users with low account balances would occasionally be rate-limited to a value less than 30 when using LeMUR.

Fewer "File does not appear to contain audio" errors

Weā€™ve fixed an edge-case bug in our async API, leading to a significant reduction in errors that say File does not appear to contain audio. Users can expect to see an immediate reduction in this type of error. If this error does occur, users should retry their requests given that retries are generally successful.

Weā€™ve made improvements to our transcription service autoscaling, leading to improved turnaround times for requests that use Word Boost when there is a spike in requests to our API.

New developer controls for real-time end-of-utterance

We have released developer controls for real-time end-of-utterance detection, providing developers control over when an utterance is considered complete. Developers can now either manually force the end of an utterance, or set a threshold for time of silence before an utterance is considered complete. 

We have made changes to our English async transcription service that improve sentence segmentation for our Sentiment Analysis, Topic Detection, and Content Moderation models. The improvements fix a bug in which these models would sometimes delineate sentences on titles that end in periods like Dr. and Mrs.

We have fixed an issue in which transcriptions of very long files (8h+) with disfluencies enabled would error out.

PII Redaction and Entity Detection available in 13 additional languages

We have launched PII Text Redaction and Entity Detection for 13 new languages:

  1. Spanish
  2. Finnish
  3. French
  4. German
  5. Hindi
  6. Italian
  7. Korean
  8. Polish
  9. Portuguese
  10. Russian
  11. Turkish
  12. Ukrainian
  13. Vietnamese

We have increased the memory of our transcoding service workers, leading to a significant reduction in errors that say File does not appear to contain audio.

Fewer LeMUR 500 errors

Weā€™ve made improvements to our LeMUR service to reduce the number of 500 errors.

Weā€™ve made improvements to our real-time service, which provides a small increase to the accuracy of timestamps in some edge cases.

Free tier limit increase; Real-time concurrency increase

We have increased the usage limit for our free tier to 100 hours. New users can now use our async API to transcribe up to 100 hours of audio, with a concurrency limit of 5, before needing to upgrade their accounts.

We have rolled out the concurrency limit increase for our real-time service. Users now have access to up to 100 concurrent streams by default when using our real-time service.

Higher concurrency is available upon request with no limit to what our API can support. If you need a higher concurrency limit, please either contact our Sales team or reach out to us at support@assemblyai.com. Note that our real-time service is only available for upgraded accounts.

Latency and cost reductions, concurrency increase

We introduced major improvements to our APIā€™s inference latency, with the majority of audio files now completing in well under 45 seconds regardless of audio duration, with a Real-Time Factor (RTF) of up to .008.

To put an RTF of .008x into perspective, this means you can now convert a:

  • 1h3min (75MB) meeting in 35 seconds
  • 3h15min (191MB) podcast in 133 seconds
  • 8h21min (464MB) video course in 300 seconds

In addition to these latency improvements, we have reduced our Speech-to-Text pricing. You can now access our Speech AI models with the following pricing:

  • Async Speech-to-Text for $0.37 per hour (previously $0.65) 
  • Real-time Speech-to-Text for $0.47 per hour (previously $0.75)

Weā€™ve also reduced our pricing for the following Audio Intelligence models: Key Phrases, Sentiment Analysis, Summarization, PII Audio Redaction, PII Redaction, Auto Chapters, Entity Detection, Content Moderation, and Topic Detection. You can view the complete list of pricing updates on our Pricing page.

Finally, we've increased the default concurrency limits for both our async and real-time services. The increase is immediate for async, and will be rolled out soon for real-time. These new limits are now:

  • 200 for async (up from 32)
  • 100 for real-time (up from 32)

These new changes stem from the efficiencies that our incredible research and engineering teams drive at every level of our inference pipeline, including optimized model compilation, intelligent mini batching, hardware parallelization, and optimized serving infrastructure.

Learn more about these changes and our inference pipeline in our blog post.

Claude 2.1 available through LeMUR

Anthropicā€™s Claude 2.1 is now generally available through LeMUR. Claude 2.1 is similar to our Default model and has reduced hallucinations, a larger context window, and performs better in citations.

Claude 2.1 can be used by setting the final_model parameter to anthropic/claude-2-1 in API requests to LeMUR. Here's an example of how to do this through our Python SDK:

import assemblyai as aai

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://example.org/customer.mp3")

result = transcript.lemur.task(
  "Summarize the following transcript in three to five sentences.",
  final_model=aai.LemurModel.claude2_1,
)


print(result.response)

You can learn more about setting the model used with LeMUR in our docs.

Real-time Binary support, improved async timestamps

Our real-time service now supports binary mode for sending audio segments. Users no longer need to encode audio segments as base64 sequences inside of JSON objects - the raw binary audio segment can now be directly sent to our API.

Moving forward, sending audio segments through websockets via the audio_data field is considered a deprecated functionality, although it remains the default for now to avoid breaking changes. We plan to support the audio_data field until 2025.

If you are using our SDKs, no changes are required on your end.

We have fixed a bug that would yield a degradation to timestamp accuracy at the end of very long files with many disfluencies.

New Node/JavaScript SDK works in multiple runtimes

Weā€™ve released v4 of our Node JavaScript SDK. Previously, the SDK was developed specifically for Node, but the latest version now works in additional runtimes without any extra steps. The SDK can now be used in the browser, Deno, Bun, Cloudflare Workers, etc.

Check out the SDKā€™s GitHub repository for additional information.

New Punctuation Restoration and Truecasing models, PCM Mu-law support

Weā€™ve released new Punctuation and Truecasing models, achieving significant improvements for acronyms, mixed-case words, and more.

Below is a visual comparison between our previous Punctuation Restoration and Truecasing models (red) and the new models (green):

Going forward, the new Punctuation Restoration and Truecasing models will automatically be used for async and real-time transcriptions, with no need to upgrade for special access. Use the parameters punctuate and format_text, respectively, to enable/disable the models in a request (enabled by default).

Read more about our new models here.

Our real-time transcription service now supports PCM Mu-law, an encoding used primarily in the telephony industry. This encoding is set by using the `encoding` parameter in requests to our API. You can read more about our PCM Mu-law support here.

We have improved internal reporting for our transcription service, which will allow us to better monitor traffic.

New LeMUR parameter, reduced hold music hallucinations

Users can now directly pass in custom text inputs into LeMUR through the input_text parameter as an alternative to transcript IDs. This gives users the ability to use any information from the async API, formatted however they want, with LeMUR for maximum flexibility.

For example, users can assign action items per user by inputting speaker-labeled transcripts, or pull citations by inputting timestamped transcripts. Learn more about the new input_text parameter in our LeMUR API reference, or check out examples of how to use the input_text parameter in the AssemblyAI Cookbook.

Weā€™ve made improvements that reduce hallucinations which sometimes occurred from transcribing hold music on phone calls. This improvement is effective immediately with no changes required by users.

Weā€™ve fixed an issue that would sometimes yield an inability to fulfill a request when XML was returned by LeMUR /task endpoint.

Reduced latency, improved error messaging

Weā€™ve made improvements to our file downloading pipeline which reduce transcription latency. Latency has been reduced by at least 3 seconds for all audio files, with greater improvements for large audio files provided via external URLs.

Weā€™ve improved error messaging for increased clarity in the case of internal server errors.

New Dashboard features and LeMUR fix

We have released the beta for our new usage dashboard. You can now see a usage summary broken down by async transcription, real-time transcription, Audio Intelligence, and LeMUR. Additionally, you can see charts of usage over time broken down by model.

We have added support for AWS marketplace on the dashboard/account management pages of our web application.

We have fixed an issue in which LeMUR would sometimes fail when handling extremely short transcripts.

New LeMUR features and other improvements

We have added a new parameter to LeMUR that allows users to specify a temperature for LeMUR generation. Temperature refers to how stochastic the generated text is and can be a value from 0 to 1, inclusive, where 0 corresponds to low creativity and 1 corresponds to high creativity. Lower values are preferred for tasks like multiple choice, and higher values are preferred for tasks like coming up with creative summaries of clips for social media.

Here is an example of how to set the temperature parameter with our Python SDK (which is available in version 0.18.0 and up):

import assemblyai as aai

aai.settings.api_key = f"{API_TOKEN}"

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")

result = transcript.lemur.summarize(
	temperature=0.25
)

print(result.response)

We have added a new endpoint that allows users to delete the data for a previously submitted LeMUR request. The response data as well as any context provided in the original request will be removed. Continuing the example from above, we can see how to delete LeMUR data using our Python SDK:

request_id = result.request_id

deletion_result = aai.Lemur.purge_request_data(request_id)
print(deletion_result)

We have improved the error messaging for our Word Search functionality. Each phrase used in a Word Search functionality must be 5 words or fewer. We have improved the clarity of the error message when a user makes a request which contains a phrase that exceeds this limit.

We have fixed an edge case error that would occur when both disfluencies and Auto Chapters were enabled for audio files that contained non-fluent English.

Improvements - observability, logging, and patches

We have improved logging for our LeMUR service to allow for the surfacing of more detailed errors to users.

We have increased observability into our Speech API internally, allowing for finer grained metrics of usage.

We have fixed a minor bug that would sometimes lead to incorrect timestamps for zero-confidence words.

We have fixed an issue in which requests to LeMUR would occasionally hang during peak usage due to a memory leak issue.

Multi-language speaker labels

We have recently launched Speaker Labels for 10 additional languages:

  • Spanish
  • Portuguese
  • German
  • Dutch
  • Finnish
  • French
  • Italian
  • Polish
  • Russian
  • Turkish
Audio Intelligence unbundling and price decreases

We have unbundled and lowered the price for our Audio Intelligence models. Previously, the bundled price for all Audio Intelligence models was $2.10/hr, regardless of the number of models used.

We have made each model accessible at a lower, unbundled, per-model rate:

  • Auto chapters: $0.30/hr
  • Content Moderation: $0.25/hr
  • Entity detection: $0.15/hr
  • Key Phrases: $0.06/hr
  • PII Redaction: $0.20/hr
  • Audio Redaction: $0.05/hr
  • Sentiment analysis: $0.12/hr
  • Summarization: $0.06/hr
  • Topic detection: $0.20/hr
New language support and improvements to existing languages

We now support the following additional languages for asynchronous transcription through our /v2/transcript endpoint:

  • Chinese
  • Finnish
  • Korean
  • Polish
  • Russian
  • Turkish
  • Ukrainian
  • Vietnamese

Additionally, we've made improvements in accuracy and quality to the following languages:

  • Dutch
  • French
  • German
  • Italian
  • Japanese
  • Portuguese
  • Spanish

You can see a full list of supported languages and features here. You can see how to specify a language in your API request here. Note that not all languages support Automatic Language Detection.

Pricing decreases

We have decreased the price of Core Transcription from $0.90 per hour to $0.65 per hour, and decreased the price of Real-Time Transcription from $0.90 per hour to $0.75 per hour.

Both decreases were effective as of August 3rd.

Significant Summarization model speedups

Weā€™ve implemented changes that yield between a 43% to 200% increase in processing speed for our Summarization models, depending on which model is selected, with no measurable impact on the quality of results.

We have standardized the response from our API for automatically detected languages that do not support requested features. In particular, when Automatic Language Detection is used and the detected language does not support a feature requested in the transcript request, our API will return null in the response for that feature.

Introducing LeMUR, the easiest way to build LLM apps on spoken data

We've released LeMUR - our framework for applying LLMs to spoken data - for general availability. LeMUR is optimized for high accuracy on specific tasks:

  1. Custom Summary allows users to automatically summarize files in a flexible way
  2. Question & Answer allows users to ask specific questions about audio files and receive answers to these questions
  3. Action Items allows users to automatically generate a list of action items from virtual or in-person meetings

Additionally, LeMUR can be applied to groups of transcripts in order to simultaneously analyze a set of files at once, allowing users to, for example, summarize many podcast episode or ask questions about a series of customer calls.

Our Python SDK allows users to work with LeMUR in just a few lines of code:

# version 0.15 or greater
import assemblyai as aai

# set your API key
aai.settings.api_key = f"{API_TOKEN}"

# transcribe the audio file (meeting recording)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")

# generate and print action items
result = transcript.lemur.action_items(
    context="A GitLab meeting to discuss logistics",
    answer_format="**<topic header>**\n<relevant action items>\n",
)

print(result.response)

Learn more about LeMUR in our blog post, or jump straight into the code in our associated Colab notebook.

Introducing our Conformer-2 model

We've released Conformer-2, our latest AI model for automatic speech recognition. Conformer-2 is trained on 1.1M hours of English audio data, extending Conformer-1 to provide improvements on proper nouns, alphanumerics, and robustness to noise.

Conformer-2 is now the default model for all English audio files sent to the v2/transcript endpoint for async processing and introduces no breaking changes.

Weā€™ll be releasing Conformer-2 for real-time English transcriptions within the next few weeks.

Read our full blog post about Conformer-2 here. You can also try it out in our Playground.

New parameter and timestamps fix

Weā€™ve introduced a new, optional speech_threshold parameter, allowing users to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1].

You can use the speech_threshold parameter with our Python SDK as below:

import assemblyai as aai

aai.settings.api_key = f"{ASSEMBLYAI_API_KEY}"

config = aai.TranscriptionConfig(speech_threshold=0.1)

file_url = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(file_url, config)

print(transcript.text)
Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from ...

If the percentage of speech in the audio file does not meet or surpass the provided threshold, then the value of transcript.text will be None and you will receive an error:

if not transcript.text:
	print(transcript.error)
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0

As usual, you can also include the speech_threshold parameter in the JSON of raw HTTP requests for any language.

Weā€™ve fixed a bug in which timestamps could sometimes be incorrectly reported for our Topic Detection and Content Safety models.

Weā€™ve made improvements to detect and remove a hallucination that would sometimes occur with specific audio patterns.

Character sequence improvements

Weā€™ve fixed an issue in which the last character in an alphanumeric sequence could fail to be transcribed. The fix is effective immediately and constitutes a 95% reduction in errors of this type.

Weā€™ve fixed an issue in which consecutive identical numbers in a long number sequence could fail to be transcribed. This fix is effective immediately and constitutes a 66% reduction in errors of this type.

Speaker Labels improvement

Weā€™ve made improvements to the Speaker Labels model, adjusting the impact of the speakers_expected parameter to better allow the model to determine the correct number of unique speakers, especially in cases where one or more speakers talks substantially less than others.

Weā€™ve expanded our caching system to include additional third-party resources to help further ensure our continued operations in the event of external resources being down.

Significant processing time improvement

Weā€™ve made significant improvements to our transcoding pipeline, resulting in a 98% overall speedup in transcoding time and a 12% overall improvement in processing time for our asynchronous API.

Weā€™ve implemented a caching system for some third-party resources to ensure our continued operations in the event of external resources being down.

Announcing LeMUR - our new framework for applying powerful LLMs to transcribed speech

Weā€™re introducing our new framework LeMUR, which makes it simple to apply Large Language Models (LLMs) to transcripts of audio files up to 10 hours in length.

LLMs unlock a range of impressive capabilities that allow teams to build powerful Generative AI features. However, building these features is difficult due to the limited context windows of modern LLMs, among other challenges that necessitate the development of complicated processing pipelines.

LeMUR circumvents this problem by making it easy to apply LLMs to transcribed speech, meaning that product teams can focus on building differentiating Generative AI features rather than focusing on building infrastructure. Learn more about what LeMUR can do and how it works in our announcement blog, or jump straight to trying LeMUR in our Playground.

New PII and Entity Detection Model

Weā€™ve upgraded to a new and more accurate PII Redaction model, which improves credit card detections in particular.

Weā€™ve made stability improvements regarding the handling and caching of web requests. These improvements additionally fix a rare issue with punctuation detection.

Multilingual and stereo audio fixes, & Japanese model retraining

Weā€™ve fixed two edge cases in our async transcription pipeline that were producing non-deterministic results from multilingual and stereo audio.

Weā€™ve improved word boundary detection in our Japanese automatic speech recognition model. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.

Decreased latency and improved password reset

Weā€™ve implemented a range of improvements to our English pipeline, leading to an average 38% improvement in overall latency for asynchronous English transcriptions.

Weā€™ve made improvements to our password reset process, offering greater clarity to users attempting to reset their passwords while still ensuring security throughout the reset process.

Conformer-1 now available for Real-Time transcription, new Speaker Labels parameter, and more

We're excited to announce that our new Conformer-1 Speech Recognition model is now available for real-time English transcriptions, offering a 24.3% relative accuracy improvement.

Effective immediately, this state-of-the-art model will be the default model for all English audio data sent to the wss://api.assemblyai.com/v2/realtime/ws WebSocket API.

The Speaker Labels model now accepts a new optional parameter called speakers_expected. If you have high confidence in the number of speakers in an audio file, then you can specify it with speakers_expected in order to improve Speaker Labels performance, particularly for short utterances.

TLS 1.3 is now available for use with the AssemblyAI API. Using TLS 1.3 can decrease latency when establishing a connection to the API.

Our PII redaction scaling has been improved to increase stability, particularly when processing longer files.

We've improved the quality and accuracy of our Japanese model.

Short transcripts that are unable to be summarized will now return an empty summary and a successful transcript.

Introducing our Conformer-1 model

We've released our new Conformer-1 model for speech recognition. Conformer-1 was trained on 650K hours of audio data and is our most accurate model to date.

Conformer-1 is now the default model for all English audio files sent to the /v2/transcript endpoint for async processing.

We'll be releasing it for real-time English transcriptions within the next two weeks, and will add support for more languages soon.

New AI Models for Italian / Japanese Punctuation Improvements

Our Content Safety and Topic Detection models are now available for use with Italian audio files.

Weā€™ve made improvements to our Japanese punctuation model, increasing relative accuracy by 11%. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.

Hindi Punctuation Improvements

Weā€™ve made improvements to our Hindi punctuation model, increasing relative accuracy by 26%. These changes are effective immediately for all Hindi audio files submitted to AssemblyAI.

Weā€™ve tuned our production infrastructure to reduce latency and improve overall consistency when using the Topic Detection and Content Moderation models.

Improved PII Redaction

Weā€™ve released a new version of our PII Redaction model to improve PII detection accuracy, especially for credit card and phone number edge cases. Improvements are effective immediately for all API calls that include PII redaction.

Automatic Language Detection Upgrade

Weā€™ve released a new version of our Automatic Language Detection model that better targets speech-dense parts of audio files, yielding improved accuracy. Additionally, support for dual-channel and low-volume files has been improved. All changes are effective immediately.

Our Core Transcription API has been migrated from EC2 to ECS in order to ensure scalable, reliable service and preemptively protect against service interruptions.

Password Reset

Users can now reset their passwords from our web UI. From the Dashboard login, simply click ā€œForgot your password?ā€ to initiate a password reset. Alternatively, users who are already logged in can change their passwords from the Account tab on the Dashboard.

The maximum phrase length for our Word Search feature has been increased from 2 to 5, effective immediately.

Dual Channel Support for Conversational Summarization / Improved Timestamps

Weā€™ve made updates to our Conversational Summarization model to support dual-channel files. Effective immediately, dual_channel may be set to True when summary_model is set to conversational.

We've made significant improvements to timestamps for non-English audio. Timestamps are now typically accurate between 0 and 100 milliseconds. This improvement is effective immediately for all non-English audio files submitted to AssemblyAI for transcription.

Improved Transcription Accuracy for Phone Numbers

Weā€™ve made updates to our Core Transcription model to improve the transcription accuracy of phone numbers by 10%. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.

We've improved scaling for our read-only database, resulting in improved performance for read-only requests.

v9 Transcription Model Released

We are happy to announce the release of our most accurate Speech Recognition model to date - version 9 (v9). This updated model delivers increased performance across many metrics on a wide range of audio types.

Word Error Rate, or WER, is the primary quantitative metric by which the performance of an automatic transcription model is measured. Our new v9 model shows significant improvements across a range of different audio types, as seen in the chart below, with a more than 11% improvement on average.

In addition to standard overall WER advancements, the new v9 model shows marked improvements with respect to proper nouns. In the chart below, we can see the relative performance increase of v9 over v8 for various types of audio, with a nearly 15% improvement on average.

The new v9 transcription model is currently live in production. This means that customers will see improved performance with no changes required on their end. The new model will automatically be used for all transcriptions created by our /v2/transcript endpoint going forward, with no need to upgrade for special access.

While our customers enjoy the elevated performance of the v9 model, our AI research team is already hard at work on our v10 model, which is slated to launch in early 2023. Building upon v9, the v10 model is expected to radically improve the state of the art in speech recognition.

Try our new v9 transcription model through your browser using the AssemblyAI Playground. Alternatively, sign up for a free API token to test it out through our API, or schedule a time with our AI experts to learn more.

New Summarization Models Tailored to Use Cases

We are excited to announce that new Summarization models are now available! Developers can now choose between multiple summary models that best fit their use case and customize the output based on the summary length.

The new models are:

  • Informative which is best for files with a single speaker, like a presentation or lecture
  • Conversational which is best for any multi-person conversation, like customer/agent phone calls or interview/interviewee calls
  • Catchy which is best for creating video, podcast, or media titles

Developers can use the summary_model parameter in their POST request to specify which of our summary models they would like to use. This new parameter can be used along with the existing summary_type parameter to allow the developer to customize the summary to their needs.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
    "audio_url": "https://bit.ly/3qDXLG8",
    "summarization": True,
    "summary_model": "informative", # conversational | catchy
    "summary_type": "bullets" # bullets_verbose | gist | headline | paragraph
}
headers = {
	"authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())

Check out our latest blog post to learn more about the new Summarization models or head to the AssemblyAI Playground to test Summarization in your browser!

Improved Transcription Accuracy for COVID

Weā€™ve made updates to our Core Transcription model to improve the transcription accuracy of the word COVID. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.

Static IP support for webhooks is now generally available!

Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.

See our walkthrough on how to start receiving webhooks for your transcriptions.

New Audio Intelligence Models: Summarization
import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
  "audio_url": "https://bit.ly/3qDXLG8",
    "summarization": True,
    "summary_type": "bullets" # paragraph | headline | gist 
}
headers = {
  "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())

Starting today, you can now transcribe and summarize entire audio files with a single API call.

To enable our new Summarization models, include the following parameter: "summarization": true in your POST request to /v2/transcript. When the transcription finishes, you will see the summary key in the JSON response containing the summary of your transcribed audio or video file.

By default, summaries will be returned in the style of bullet points. You can customize the style of summary by including the optional summary_type parameter in your POST request along with one of the following values: paragraph, headline, or gist. Here is the full list of summary types we support.

// summary_type = "paragraph"

"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat."

// summary_type = "headline"

"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcomes versus output."

// summary_type = "gist"

"summary": "Outcomes over output"

// summary_type = = "bullets"

"summary": "Josh Seiden and Brian Donohue discuss
the topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat.\n- ..."

Examples of use cases for Summarization include:

  • Identify key takeaways from phone calls to speed up post-call review and reduce manual summarization
  • Summarize long podcasts into short descriptions so users can preview before they listen.
  • Instantly generate meetings summaries to quickly recap virtual meetings and highlight post-meeting actions
  • Suggest 3-5 word video titles automatically for user-generated content
  • Synthesize long educational courses, lectures, and media broadcasts into their most important points for faster consumption

We're really excited to see what you build with our new Summarization models. To get started, try it out for free in our no-code playground or visit our documentation for more info on how to enable Summarization in your API requests.

Automatic Casing / Short Utterances

Weā€™ve improved our Automatic Casing model and fixed a minor bug that caused over-capitalization in English transcripts. The Automatic Casing model is enabled by default with our Core Transcription API to improve transcript readability for video captions (SRT/VTT). See our documentation for more info on Automatic Casing.

Our Core Transcription model has been fine-tuned to better detect short utterances in English transcripts. Examples of short utterances include one-word answers such as ā€œNo.ā€ and ā€œRight.ā€ This update will take effect immediately for all customers.

Static IP Support for Webhooks

Over the next few weeks, we will begin rolling out Static IP support for webhooks to customers in stages.

Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.

See our walkthrough on how to start receiving webhooks for your transcriptions.

Improved Number Transcription
PII Redaction Examples

Weā€™ve made improvements to our Core Transcription model to better identify and transcribe numbers present in your audio files.

Accurate number transcription is critical for customers that need to redact Personally Identifiable Information (PII) that gets exchanged during phone calls. Examples of PII include credit card numbers, addresses, phone numbers, and social security numbers.

In order to help you handle sensitive user data at scale, our PII Redaction model automatically detects and removes sensitive info from transcriptions. For example, when PII redaction is enabled, a phone number like 412-412-4124 would become ###-###-####.

To learn more, check out our blog that covers all of our PII Redaction Policies or try our PII Redaction model in our Sandbox here!

Improved Disfluency Timestamps

We've updated our Disfluency Detection model to improve the accuracy of timestamps for disfluency words.

By default, disfluencies such as "um" or "uh" and "hm" are automatically excluded from transcripts. However, we allow customers to include these filler words by simply setting the disfluencies parameter to true in their POST request to /v2/transcript, which enables our Disfluency Detection model.

More info and code examples can be found here.

Speaker Label Improvement

We've improved the Speaker Label modelā€™s ability to identify unique speakers for single word or short utterances.

Historical Transcript Bug Fix

We've fixed a bug with the Historical Transcript endpoint that was causing null to appear as the value of the completed key.

Japanese Transcription Now Available
Code snippet for Japanese transcription

Today, weā€™re releasing our new Japanese transcription model to help you transcribe and analyze your Japanese audio and video files using our cutting-edge AI.

Now you can automatically convert any Japanese audio or video file to text by including "language_code": "ja" in your POST request to our /v2/transcript endpoint.

In conjunction with transcription, weā€™ve also added Japanese support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing, Profanity Filtering, and more. This means you can boost transcription accuracy with more granularity based on your use case. See the full list of supported models available for Japanese transcriptions here.

To get started, visit our walkthrough on Specifying a Language on our AssemblyAI documentation page or try it out now in our Sandbox!

Hindi Transcription / Custom Webhook Headers
Code snippet for Hindi transcriptions

Weā€™ve released our new Hindi transcription model to help you transcribe and analyze your Hindi audio and video files.

Now you can automatically convert any Hindi audio or video file to text by including "language_code": "hi" in your POST request to our /v2/transcript endpoint.

Weā€™ve also added Hindi support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing, Profanity Filtering, and more. See the full list of supported models available for Hindi transcriptions here.

To get started with Hindi transcription, visit our walkthrough on Specifying a Language on our AssemblyAI documentation page.

Our Webhook service now supports the use of Custom Headers for authentication.

A Custom Header can be used for added security to authenticate webhook requests from AssemblyAI. This feature allows a developer to optionally provide a value to be used as an authorization header on the returning webhook from AssemblyAI, giving the ability to validate incoming webhook requests.

To use a Custom Header, you will include two additional parameters in your POST request to /v2/transcript: webhook_auth_header_name and webhook_auth_header_value. The webhook_auth_header_name parameter accepts a string containing the header's name which will be inserted into the webhook request. The webhook_auth_header_value parameter accepts a string with the value of the header that will be inserted into the webhook request. See our Using Webhooks documentation to learn more and view our code examples.

Improved Speaker Labels Accuracy and Speaker Segmentation

  • Improved the overall accuracy of the Speaker Labels feature and the modelā€™s ability to segment speakers.

  • Fix a small edge case that would occasionally cause some transcripts to complete with NULL as the language_code value.
Content Moderation and Topic Detection Available for Portuguese

  • Improved Inverse Text Normalization of money amounts in transcript text.

  • Addressed an issue with Real-Time Transcription that would occasionally cause variance in timestamps over the course of a session.
  • Fixed an edge case with transcripts including Filler Words that would occasionally cause server errors.
Automatic Language Detection Available for Dutch and Portuguese

  • Accuracy of the Automatic Language Detection model improved on files with large amounts of silence.
  • Improved speaker segmentation accuracy for Speaker Labels.
Dutch and Portuguese Support Released

  • Dutch and Portuguese transcription is now generally available for our /v2/transcript endpoint. See our documentation for more information on specifying a language in your POST request.
Content Moderation and Topic Detection Available for French, German, and Spanish

  • Improved redaction accuracy for credit_card_number, credit_card_expiration, and credit_card_cvv policies in our PII Redaction feature.

  • Fixed an edge case that would occasionally affect the capitalization of words in transcripts when disfluencies was set to true.
French, German, and Italian Support Released

  • French, German, and Italian transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.

  • Released v2 of our Spanish model, improving absolute accuracy by ~4%.
  • Automatic Language Detection now supports French, German, and Italian.
  • Reduced the volume of the beep used to redact PII information in redacted audio files.
Miscellaneous Bug Fixes

  • Fixed an edge case that would occasionally affect timestamps for a small number of words when disfluencies was set to true.
  • Fixed an edge case where PII audio redaction would occasionally fail when using local files.
New Policies Added for PII Redaction and Entity Detection

Spanish Language Support, Automatic Language Detection, and Custom Spelling Released

  • Spanish transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.
  • Automatic Language Detection is now available for our /v2/transcript endpoint. This feature can identify the dominant language thatā€™s spoken in an audio file and route the file to the appropriate model for the detected language.
  • Our new Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change all instances "CS 50" to "CS50".
Auto Chapters v6 Released

  • Released Auto Chapters v6, improving the summarization of longer chapters.
Auto Chapters v5 Released

  • Auto Chapters v5 released, improving headline and gist generation and quote formatting in the summary key.

  • Fixed an edge case in Dual-Channel files where initial words in an audio file would occasionally be missed in the transcription.
Regional Spelling Improvements

  • Region-specific spelling improved for en_uk and en_au language codes.
  • Improved the formatting of ā€œMP3ā€ in transcripts.
  • Improved Real-Time transcription error handling for corrupted audio files.
Real-Time v3 Released

  • Released v3 of our Real-Time Transcription model, improving overall accuracy by 18% and proper noun recognition by 23% relative to the v2 model.

  • Improved PII Redaction and Entity Detection for CREDIT_CARD_CVV and LOCATION.
Auto Chapters v4 Released, Auto Retry Feature Added

  • Added an Auto Retry feature, which automatically retries transcripts that fail with a Server error, developers have been alerted message. This feature is enabled by default. To disable it, visit the Account tab in your Developer Dashboard.

  • Auto Chapters v4 released, improving chapter summarization in the summary key.
  • Added a trailing period for the gist key in the Auto Chapters feature.
Auto Chapters v3 Released

  • Released v3 of our Auto Chapters model, improving the modelā€™s ability to segment audio into chapters and chapter boundary detection by 56.3%.
  • Improved formatting for Auto Chapters summaries. The summary, headline, and gist keys now include better punctuation, casing, and text formatting.
Webhook Status Codes, Entity Detection Improved

  • POST requests from the API to webhook URLs will now accept any status code from 200 to 299 as a successful HTTP response. Previously only 200 status codes were accepted.
  • Updated the text key in our Entity Detection feature to return the proper noun rather than the possessive noun. For example, Andrew instead of Andrewā€™s.

  • Fixed an edge case with Entity Detection where under certain contexts, a disfluency could be identified as an entity.
Punctuation and Casing Accuracy Improved, Inverse Text Normalization Model Updated

  • Released v4 of our Punctuation model, increasing punctuation and casing accuracy by ~2%.
  • Updated our Inverse Text Normalization (ITN) model for our /v2/transcript endpoint, improving web address and email address formatting and fixing the occasional number formatting issue.

  • Fixed an edge case where multi-channel files would return no text when the two channels were out of phase with each other.
Support for Non-English Languages Coming Soon

  • Our Deep Learning team has been hard at work training our new non-English language models. In the coming weeks, we will be adding support for French, German, Italian, and Spanish.
Shorter Summaries Added to Auto Chapters, Improved Filler Word Detection

  • Added a new gist key to the Auto Chapters feature. This new key provides an ultra-short, usually 3 to 8 word summary of the content spoken during that chapter.

  • Implemented profanity filtering into Auto Chapters, which will prevent the API from generating a summary, headline, or gist that includes profanity.
  • Improved Filler Word (aka, disfluencies) detection by ~5%.
  • Improved accuracy for Real-Time Streaming Transcription.

  • Fixed an edge case where WebSocket connections for Real-Time Transcription sessions would occasionally not close properly after the session was terminated. This resulted in the client receiving a 4031 error code even after sending a session termination message.
  • Corrected a bug that occasionally attributed disfluencies to the wrong utterance when Speaker Labels or Dual-Channel Transcription was enabled.
v8.5 Asynchronous Transcription Model Released

  • Our Asynchronous Speech Recognition model is now even better with the release of v8.5.
  • This update improves overall accuracy by 4% relative to our v8 model.
  • This is achieved by improving the modelā€™s ability to handle noisy or difficult-to-decipher audio.
  • The v8.5 model also improves Inverse Text Normalization for numbers.
New and Improved API Documentation

  • Launched the new AssemblyAI Docs, with more complete documentation and an easy-to-navigate interface so developers can effectively use and integrate with our API. Click here to view the new and improved documentation.

  • Added two new fields to the FinalTranscript response for Real-time Transcriptions. The punctuated key is a Boolean value indicating if punctuation was successful. The text_formatted key is a Boolean value indicating if Inverse Text Normalization (ITN) was successful.
Inverse Text Normalization Added to Real-Time, Word Boost Accuracy Improved

  • Inverse Text Normalization (ITN) added for our /v2/realtime and /v2/stream endpoints. ITN improves formatting of entities like numbers, dates, and proper nouns in the transcription text.

  • Improved accuracy for Custom Vocabulary (aka, Word Boosts) with the Real-Time transcription API.

  • Fixed an edge case that would sometimes cause transcription errors when disfluencies was set to true and no words were identified in the audio file.
Entity Detection Released, Improved Filler Word Detection, Usage Alerts

  • v1 release of Entity Detection - automatically detects a wide range of entities like person and company names, emails, addresses, dates, locations, events, and more.
  • To include Entity Detection in your transcript, set entity_detection to true in your POST request to /v2/transcript.
  • When your transcript is complete, you will see an entities key towards the bottom of the JSON response containing the entities detected, as shown here:
  • Read more about Entity Detection in our official documentation.
  • Usage Alert feature added, allowing customers to set a monthly usage threshold on their account along with a list of email addresses to be notified when that monthly threshold has been exceeded. This feature can be enabled by clicking ā€œSet up alertsā€ on the ā€œDevelopersā€ tab in the Dashboard.
  • When Content Safety is enabled, a summary of the severity scores detected will now be returned in the API response under the severity_score_summary nested inside of the content_safety_labels Ā key, as shown below.

  • Improved Filler Word (aka, disfluencies) detection by ~25%.

  • Fixed a bug in Auto Chapters that would occasionally add an extra space between sentences for headlines and summaries.
Additional MIME Type Detection Added for OPUS Files

  • Added additional MIME type detection to detect a wider variety of OPUS files.

  • Fixed an issue with word timing calculations that caused issues with speaker labeling for a small number of transcripts.
Custom Vocabulary Accuracy Significantly Improved

  • Significantly improved the accuracy of Custom Vocabulary, and the impact of the boost_param field to control the weight for Custom Vocabulary.
  • Improved precision of word timings.
New Auto Chapters, Sentiment Analysis, and Disfluencies Features Released

  • v1 release of Auto Chapters - which provides a "summary over time" by breaking audio/video files into "chapters" based on the topic of conversation. Check out our blog to read more about this new feature. To enable Auto Chapters in your request, you can set auto_chapters: true in your POST request to /v2/transcript.
  • v1 release of Sentiment Analysis - that determines the sentiment of sentences in a transcript as "positive", "negative", or "neutral". Sentiment Analysis can be enabled by including the sentiment_analysis: true parameter in your POST request to /v2/transcript.
  • Filler-words like "um" and "uh" can now be included in the transcription text. Simply include disfluencies: true in your POST request to /v2/transcript.

  • Deployed Speaker Labels version 1.3.0. Improves overall diarization/labeling accuracy.
  • Improved our internal auto-scaling for asynchronous transcription, to keep turnaround times consistently low during periods of high usage.
New Language Code Parameter for English Spelling

  • Added a new language_code parameter when making requests to /v2/transcript.
  • Developers can set this to en_us, en_uk, and en_au, which will ensure the correct English spelling is used - British English, Australian English, or US English (Default).
  • Quick note: for customers that were historically using the assemblyai_en_au or assemblyai_en_uk acoustic models, the language_code parameter is essentially redundant and doesn't need to be used.

  • Fixed an edge-case where some files with prolonged silences would occasionally have a single word predicted, such as "you" or "hi."
New Features Coming Soon, Bug Fixes

  • This week, our engineering team has been hard at work preparing for the release of exciting new features like:
  • Chapter Detection: Automatically summarize audio and video files into segments (aka "chapters").
  • Sentiment Analysis: Determine the sentiment of sentences in your transcript as "positive", "negative", or "neutral".
  • Disfluencies: Detects filler-words like "um" and "uh".

  • Improved average real-time latency by 2.1% and p99 latency by 0.06%.

  • Fixed an edge-case where confidence scores in the utterances category for dual-channel audio files would occasionally receive a confidence score greater than 1.0.
Improved v8 Model Processing Speed

  • Improved the API's ability to handle audio/video files with a duration over 8 hours.

  • Further improved transcription processing times by 12%.
  • Fixed an edge case in our responses for dual channel audio files where if speaker 2 interrupted speaker 1, Ā the text from speaker 2 would cause the text from speaker 1 to be split into multiple turns, rather than contextually keeping all of speaker 1's text together.
v8 Transcription Model Released

  • Today, we're happy to announce the release of our most accurate Speech Recognition model for asynchronous transcription to dateā€”version 8 (v8).
  • This new model dramatically improves overall accuracy (up to 19% relative), and proper noun accuracy as well (up to 25% relative).
  • You can read more about our v8 model in our blog here.

  • Fixed an edge case where a small percentage of short (<60 seconds in length) dual-channel audio files, with the same audio on each channel, resulted in repeated words in the transcription.
v2 Real-Time and v4 Topic Detection Models Released

  • Launched our v2 Real-Time Streaming Transcription model (read more on our blog).
  • This new model improves accuracy of our Real-Time Streaming Transcription by ~10%.
  • Launched our Topic Detection v4 model, with an accuracy boost of ~8.37% over v3 (read more on our blog).
v3 Topic Detection Model, PII Redaction Bug Fixes

  • Released our v3 Topic Detection model.
  • This model dramatically improves the Topic Detection feature's ability to accurately detect topics based on context.
  • For example, in the following text, the model was able to accurately predict "Rugby" without the mention of the sport directly, due to the mention of "Ed Robinson" (a Rugby coach).

  • PII Redaction has been improved to better identify (and redact) phone numbers even when they are not explicitly referred to as a phone number.

  • Released a fix for PII Redaction that corrects an issue where the model would sometimes detect phone numbers as credit card numbers or social security numbers.
Severity Scores for Content Safety
  • The API now returns a severity score along with the confidence and label keys when using the Content Safety feature.
  • The severity score measures how intense a detected Content Safety label is on a scale of 0 to 1.
  • For example, a natural disaster that leads to mass casualties will have a score of 1.0, while a small storm that breaks a mailbox will only be 0.1.

  • Fixed an edge case where a small number of transcripts with Automatic Transcript Highlights turned on were not returning any results.
Real-time Transcription and Streaming Fixes

  • Fixed an edge case where higher sample rates would occasionally trigger a Client sent audio too fast error from the Real-Time Streaming WebSocket API.
  • Fixed an edge case where some streams from Real-Time Streaming WebSocket API were held open after a customer idled their session.
  • Fixed an edge case in the /v2/stream endpoint, where large periods of silence would occasionally cause automatic punctuation to fail.
  • Improved error handling when non-JSON input is sent to the /v2/transcript endpoint.
Punctuation v3, Word Search, Bug Fixes

  • v3 Punctuation Model released.
  • v3 brings improved accuracy to automatic punctuation and casing for both async (/v2/transcript) and real-time (WebSocket API) transcripts.
  • Released an all-new Word Search feature that will allow developers to search for words in a completed transcript.
  • This new feature returns how many times the word was spoken, the index of that word in the transcript's JSON response word list/array, and the associated timestamps for each matched word.

  • Fixed an issue causing a small subset of words not to be filtered when profanity filtering was turned on.
New Dashboard and Real-Time Data

This week, we released an entirely new dashboard for developers:

The new developer dashboard introduces:

  • Better usage reports for API usage and spend
  • Easier access to API Docs, API Tokens, and Account information
  • A no-code demo to test the API on your audio/video files without having to write any code
General Improvements
  • Fixed a bug with PII Redaction, where sometimes dollar amount and date tokens were not being properly redacted.
  • AssemblyAI now supports even more audio/video file formats thanks to improvements to our audio transcoding pipeline!
  • Fixed a rare bug where a small percentage of transcripts (0.01%) would incorrectly sit in a status of "queued" for up to 60 seconds.
ITN Model Update

Today we've released a major improvement to our ITN (Inverse Text Normalization) model. This results in better formatting for entities within the transcription, such as phone numbers, money amounts, and dates.

For example:

Money:

  • Spoken: "Hey, do you have five dollars?"
  • Model output with ITN: "Hey, do you have $5?"

Years:

  • Spoken: "Yes, I believe it was back in two thousand eight"
  • Model output with ITN: "Yes, I believe it was back in 2008."
Punctuation Model v2.5 Released

Today we've released an updated Automatic Punctuation and Casing Restoration model (Punctuation v2.5)! This update results in improved capitalization of proper nouns in transcripts, reduces over-capitalization issues where some words like were being incorrectly capitalized, and improves some edge cases around words with commas around them. For example:

  • "....in the Us" now becomes "....in the US."
  • "whatsapp," now becomes "WhatsApp,"
Content Safety Model (v7) Released

We have released an updated Content Safety Model - v7! Performance for 10 out of all 19 Content Safety labels has been improved, with the biggest improvements being for the Profanity and Natural Disasters labels.

Real-Time Transcription Model v1.1 Released

We have just released a major real-time update!

Developers will now be able to use the word_boost parameter in requests to the real-time API, allowing you to introduce your own custom vocabulary to the model for that given session! This custom vocabulary will lead to improved accuracy for the provided words.

General Improvements

We will now be limiting one websocket connection per real-time session to ensure the integrity of a customer's transcription and prevent multiple users/clients from using the websocket same session.

Note: Developers can still have multiple real-time sessions open in parallel, up to the Concurrency Limit on the account. For example, if an account has a Concurrency Limit of 32, that account could have up to 32 concurrent real-time sessions open.

Topic Detection Model v2 Released

Today we have released v2 of our Topic Detection Model. This new model will predict multiple topics for each paragraph of text, whereas v1 was limited to predicting a single. For example, given the text:

"Elon Musk just released a new Tesla that drives itself!"

v1:

  • Automotive>AutoType>DriverlessCars: 1

v2:

  • Automotive>AutoType>DriverlessCars: 1
  • PopCulture : 0.84
  • PopCulture>CelebrityStyle: 0.56

This improvement will result in the visual output looking significantly better, and containing more informative responses for developers!

Increased Number of Categories Returned for Topic Detection Summary

In this minor improvement, we have increased the number of topics the model can return in the summary key of the JSON response from 10 to 20.

Temporary Tokens for Real-Time

Often times, developers will need to expose their AssemblyAI API Key in their client applications when establishing connections with our real-time streaming transcription API. Now, developers can create a temporary API token that expires in a customizable amount of time (similar to an AWS S3 Temporary Authorization URL) that can safely be exposed in the client applications and front-ends.

This will allow developers to create short-lived API tokens designed to be used securely in the browser, along with authorization within the query string!

For example, authenticating in the query parameters with a temporary token would look like so:

wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={TEMP_TOKEN}

For more information, you can view our Docs!

Adding "Marijuana" and "Sensitive Social Issues" as Possible Content Safety Labels

In this minor update, we improve the accuracy across all Content Safety labels, and add two new labels for better content categorization. The two new labels are sensitive_social_issues and marijuana.

New label definitions:

  • sensitive_social_issues: This category includes content that may be considered insensitive, irresponsible, or harmful to specific groups based on their beliefs, political affiliation, sexual orientation, or gender identity.
  • marijuana: This category includes content that discusses marijuana or its usage.
Real-Time Transcription is Now GA

We are pleased to announce the official release of our Real-Time Streaming Transcription API! This API uses WebSockets and a fast Conformer Neural Network architecture that allows for a quick and accurate transcription in real-time.

Find out more in our Docs here!

Content Safety Detection and Topic Detection are now GA!

Today we have released two of our enterprise-level models, Content Safety Detection and Topic Detection, to all users!

Now any developer can make use of these cutting edge models within their applications and products. Explore these new features in our Docs:

Minor Update to PII Redaction

With this minor update, our Redaction Model will better detect Social Security Numbers and Medical References for additional security and data protection!

New Punctuation Model (v2)

Today we released a new punctuation model that is more extensive than its predecessor, and will drive improvements in punctuation and casing accuracy!

New Features & Updates

List Historical Transcripts

  • Developers can get a list of their historical transcriptions. This list can be filtered by status and date. This new endpoint will allow developers to see if they have any queued, processing, or throttled transcriptions.

Pre-Formatted Paragraphs

  • Developers can now get pre-formatted paragraphs by calling our new paragraphs endpoint! The model will attempt to semantically break the transcript up into paragraphs of five sentences or less.

You can explore each feature further in our Docs:

Topic Detection Response Improvements

  • Now each topic will include timestamps for each segment of classified text. We have also added a new summary key that will contain the confidence of all unique topics detected throughout the entire transcript.

  • We have made improvements to our Speaker Diarization Model that increases accuracy over short and long transcripts.
New PII Classes

We have released an update to our PII Redaction Model that will now support detecting and redacting additional classes!

  • blood_type
  • medical_condition
  • drug (including vitamins/minerals)
  • injury
  • medical_process

Entity Definitions:

  • blood_type: Blood type
  • medical_condition: A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
  • drug: Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol
  • injury: Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages, and dislocations.
  • medical_process: Medical process, including treatments, procedures, and tests. E.g., "heart surgery," "CT scan."