Changelog

Follow along to see weekly accuracy and product improvements.

Usage and spend alerts

Users can now set up billing alerts in their user portals. Billing alerts notify you when your monthly spend or account balance reaches a threshold.

To set up a billing alert, go to the billing page of your portal, and click Set up a new alert under the Your alerts widget:

You can then set up an alert by specifying whether to alert on monthly spend or account balance, as well as the specific threshold at which to send an alert.

Universal-1 now available in German

Universal-1, our most powerful and accurate multilingual Speech-to-Text model, is now available in German.

No special action is needed to utilize Universal-1 on German audio - all requests sent to our /v2/transcript endpoint with German audio files will now use Universal-1 by default. Learn more about how to integrate Universal-1 into your apps in our Getting Started guides.

New languages for Speaker Diarization

Speaker Diarization is now available in five additional languages for both the Best and Nano tiers:

  • Chinese
  • Hindi
  • Japanese 
  • Korean 
  • Vietnamese
New API Reference, Timestamps improvements

We’ve released a new version of the API Reference section of our docs for an improved developer experience. Here’s what’s new:

  1. New API Reference pages with exhaustive endpoint documentation for transcription, LeMUR, and streaming
  2. cURL examples for every endpoint
  3. Interactive Playground: Test our API endpoints with the interactive playground. It includes a form-builder for generating requests and corresponding code examples in cURL, Python, and TypeScript
  4. Always up to date: The new API Reference is autogenerated based on our Open-Source OpenAPI and AsyncAPI specs

We’ve made improvements to Universal-1’s timestamps for both the Best and Nano tiers, yielding improved timestamp accuracy and a reduced incidence of overlapping timestamps.

We’ve fixed an issue in which users could receive an `Unable to create transcription. Developers have been alerted` error that would be surfaced when using long files with Sentiment Analysis.

New codec support; account deletion support

We’ve upgraded our transcoding library and now support the following new codecs:

  • Bonk, APAC, Mi-SC4, 100i, VQC, FTR PHM, WBMP, XMD ADPCM, WADY DPCM, CBD2 DPCM
  • HEVC, VP9, AV1 codec in enhanced flv format

Users can now delete their accounts by selecting the Delete account option on the Account page of their AssemblyAI Dashboards.

Users will now receive a 400 error when using an invalid tier and language code combination, with an error message such as The selected language_code is supported by the following speech_models: best, conformer-2. See https://www.assemblyai.com/docs/concepts/supported-languages..

We’ve fixed an issue in which nested JSON responses from LeMUR would cause Invalid LLM response, unable to fulfill request. Please try again. errors.

We’ve fixed a bug in which very long files would sometimes fail to transcribe, leading to timeout errors.

AssemblyAI app for Make.com

Make (formerly Integromat) is a no-code automation platform that makes it easy to build tasks and workflows that synthesize many different services.

We’ve released the AssemblyAI app for Make that allows Make users to incorporate AssemblyAI into their workflows, or scenarios. In other words, in Make you can now use our AI models to

  1. Transcribe audio data with speech recognition models
  2. Analyze audio data with audio intelligence models
  3. Build generative features on top of audio data with LLMs

For example, in our tutorial on Redacting PII with Make, we demonstrate how to build a Make scenario that automatically creates a redacted audio file and redacted transcription for any audio file uploaded to a Google Drive folder.

GDPR and PCI DSS compliance

AssemblyAI is now officially PCI Compliant. The Payment Card Industry Data Security Standard Requirements and Security Assessment Procedures (PCI DSS) certification is a rigorous assessment that ensures card holder data is being properly and securely handled and stored. You can read more about PCI DSS here.

Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our PCI attestation of compliance, as well as other security-related documents.

AssemblyAI is also GDPR Compliant. The General Data Protection Regulation (GDPR) is regulation regarding privacy and security for the European Union that applies to businesses that serve customers within the EU. You can read more about GDPR here.

Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our GDPR assessment on compliance, as well as other security-related documents.

Self-serve invoices; dual-channel improvement

Users of our API can now view and download their self-serve invoices in their dashboards under Billing > Your invoices.

We’ve made readability improvements to the formatting of utterances for dual-channel transcription by combining sequential utterances from the same channel.

We’ve added a patch to improve stability in turnaround times for our async transcription and LeMUR services.

We’ve fixed an issue in which timestamp accuracy would be degraded in certain edge cases when using our async transcription service.

Introducing Universal-1

Last week we released Universal-1, a state-of-the-art multimodal speech recognition model. Universal-1 is trained on 12.5M hours of multilingual audio data, yielding impressive performance across the four key languages for which it was trained - English, Spanish, German, and French.

Word Error Rate across four languages for several providers. Lower is better.

Universal-1 is now the default model for English and Spanish audio files sent to our v2/transcript endpoint for async processing, while German and French will be rolled out in the coming weeks.

You can read more about Universal-1 in our announcement blog or research blog, or you can try it out now on our Playground.

New Streaming STT features

We’ve added a new message type to our Streaming Speech-to-Text (STT) service. This new message type SessionInformation is sent immediately before the final SessionTerminated message when closing a Streaming session, and it contains a field called audio_duration_seconds which contains the total audio duration processed during the session. This feature allows customers to run end-user-specific billing calculations.

To enable this feature, set the enable_extra_session_information query parameter to true when connecting to a Streaming WebSocket.

endpoint_str = 'wss://api.assemblyai.com
/v2/realtime/ws?sample_rate=8000&enable_extra_session_information=true'

This feature will be rolled out in our SDKs soon.

We’ve added a new feature to our Streaming STT service, allowing users to disable Partial Transcripts in a Streaming session. Our Streaming API sends two types of transcripts - Partial Transcripts (unformatted and unpunctuated) that gradually build up the current utterance, and Final Transcripts which are sent when an utterance is complete, containing the entire utterance punctuated and formatted.

Users can now set the disable_partial_transcripts query parameter to true when connecting to a Streaming WebSocket to disable the sending of Partial Transcript messages.

endpoint_str = 'wss://api.assemblyai.com
/v2/realtime/ws?sample_rate=8000&disable_partial_transcripts=true'

This feature will be rolled out in our SDKs soon.

We have fixed a bug in our async transcription service, eliminating File does not appear to contain audio errors. Previously, this error would be surfaced in edge cases where our transcoding pipeline would not have enough resources to transcode a given file, thus failing due to resource starvation.

Dual channel transcription improvements

We’ve made improvements to how utterances are handled during dual-channel transcription. In particular, the transcription service now has elevated sensitivity when detecting utterances, leading to improved utterance insertions when there is overlapping speech on the two channels.

LeMUR concurrency fix

We’ve fixed a temporary issue in which users with low account balances would occasionally be rate-limited to a value less than 30 when using LeMUR.

Fewer "File does not appear to contain audio" errors

We’ve fixed an edge-case bug in our async API, leading to a significant reduction in errors that say File does not appear to contain audio. Users can expect to see an immediate reduction in this type of error. If this error does occur, users should retry their requests given that retries are generally successful.

We’ve made improvements to our transcription service autoscaling, leading to improved turnaround times for requests that use Word Boost when there is a spike in requests to our API.

New developer controls for real-time end-of-utterance

We have released developer controls for real-time end-of-utterance detection, providing developers control over when an utterance is considered complete. Developers can now either manually force the end of an utterance, or set a threshold for time of silence before an utterance is considered complete. 

We have made changes to our English async transcription service that improve sentence segmentation for our Sentiment Analysis, Topic Detection, and Content Moderation models. The improvements fix a bug in which these models would sometimes delineate sentences on titles that end in periods like Dr. and Mrs.

We have fixed an issue in which transcriptions of very long files (8h+) with disfluencies enabled would error out.

PII Redaction and Entity Detection available in 13 additional languages

We have launched PII Text Redaction and Entity Detection for 13 new languages:

  1. Spanish
  2. Finnish
  3. French
  4. German
  5. Hindi
  6. Italian
  7. Korean
  8. Polish
  9. Portuguese
  10. Russian
  11. Turkish
  12. Ukrainian
  13. Vietnamese

We have increased the memory of our transcoding service workers, leading to a significant reduction in errors that say File does not appear to contain audio.

Fewer LeMUR 500 errors

We’ve made improvements to our LeMUR service to reduce the number of 500 errors.

We’ve made improvements to our real-time service, which provides a small increase to the accuracy of timestamps in some edge cases.

Free tier limit increase; Real-time concurrency increase

We have increased the usage limit for our free tier to 100 hours. New users can now use our async API to transcribe up to 100 hours of audio, with a concurrency limit of 5, before needing to upgrade their accounts.

We have rolled out the concurrency limit increase for our real-time service. Users now have access to up to 100 concurrent streams by default when using our real-time service.

Higher concurrency is available upon request with no limit to what our API can support. If you need a higher concurrency limit, please either contact our Sales team or reach out to us at support@assemblyai.com. Note that our real-time service is only available for upgraded accounts.

Latency and cost reductions, concurrency increase

We introduced major improvements to our API’s inference latency, with the majority of audio files now completing in well under 45 seconds regardless of audio duration, with a Real-Time Factor (RTF) of up to .008.

To put an RTF of .008x into perspective, this means you can now convert a:

  • 1h3min (75MB) meeting in 35 seconds
  • 3h15min (191MB) podcast in 133 seconds
  • 8h21min (464MB) video course in 300 seconds

In addition to these latency improvements, we have reduced our Speech-to-Text pricing. You can now access our Speech AI models with the following pricing:

  • Async Speech-to-Text for $0.37 per hour (previously $0.65) 
  • Real-time Speech-to-Text for $0.47 per hour (previously $0.75)

We’ve also reduced our pricing for the following Audio Intelligence models: Key Phrases, Sentiment Analysis, Summarization, PII Audio Redaction, PII Redaction, Auto Chapters, Entity Detection, Content Moderation, and Topic Detection. You can view the complete list of pricing updates on our Pricing page.

Finally, we've increased the default concurrency limits for both our async and real-time services. The increase is immediate for async, and will be rolled out soon for real-time. These new limits are now:

  • 200 for async (up from 32)
  • 100 for real-time (up from 32)

These new changes stem from the efficiencies that our incredible research and engineering teams drive at every level of our inference pipeline, including optimized model compilation, intelligent mini batching, hardware parallelization, and optimized serving infrastructure.

Learn more about these changes and our inference pipeline in our blog post.

Claude 2.1 available through LeMUR

Anthropic’s Claude 2.1 is now generally available through LeMUR. Claude 2.1 is similar to our Default model and has reduced hallucinations, a larger context window, and performs better in citations.

Claude 2.1 can be used by setting the final_model parameter to anthropic/claude-2-1 in API requests to LeMUR. Here's an example of how to do this through our Python SDK:

import assemblyai as aai

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://example.org/customer.mp3")

result = transcript.lemur.task(
  "Summarize the following transcript in three to five sentences.",
  final_model=aai.LemurModel.claude2_1,
)


print(result.response)

You can learn more about setting the model used with LeMUR in our docs.

Real-time Binary support, improved async timestamps

Our real-time service now supports binary mode for sending audio segments. Users no longer need to encode audio segments as base64 sequences inside of JSON objects - the raw binary audio segment can now be directly sent to our API.

Moving forward, sending audio segments through websockets via the audio_data field is considered a deprecated functionality, although it remains the default for now to avoid breaking changes. We plan to support the audio_data field until 2025.

If you are using our SDKs, no changes are required on your end.

We have fixed a bug that would yield a degradation to timestamp accuracy at the end of very long files with many disfluencies.

New Node/JavaScript SDK works in multiple runtimes

We’ve released v4 of our Node JavaScript SDK. Previously, the SDK was developed specifically for Node, but the latest version now works in additional runtimes without any extra steps. The SDK can now be used in the browser, Deno, Bun, Cloudflare Workers, etc.

Check out the SDK’s GitHub repository for additional information.

New Punctuation Restoration and Truecasing models, PCM Mu-law support

We’ve released new Punctuation and Truecasing models, achieving significant improvements for acronyms, mixed-case words, and more.

Below is a visual comparison between our previous Punctuation Restoration and Truecasing models (red) and the new models (green):

Going forward, the new Punctuation Restoration and Truecasing models will automatically be used for async and real-time transcriptions, with no need to upgrade for special access. Use the parameters punctuate and format_text, respectively, to enable/disable the models in a request (enabled by default).

Read more about our new models here.

Our real-time transcription service now supports PCM Mu-law, an encoding used primarily in the telephony industry. This encoding is set by using the `encoding` parameter in requests to our API. You can read more about our PCM Mu-law support here.

We have improved internal reporting for our transcription service, which will allow us to better monitor traffic.

New LeMUR parameter, reduced hold music hallucinations

Users can now directly pass in custom text inputs into LeMUR through the input_text parameter as an alternative to transcript IDs. This gives users the ability to use any information from the async API, formatted however they want, with LeMUR for maximum flexibility.

For example, users can assign action items per user by inputting speaker-labeled transcripts, or pull citations by inputting timestamped transcripts. Learn more about the new input_text parameter in our LeMUR API reference, or check out examples of how to use the input_text parameter in the AssemblyAI Cookbook.

We’ve made improvements that reduce hallucinations which sometimes occurred from transcribing hold music on phone calls. This improvement is effective immediately with no changes required by users.

We’ve fixed an issue that would sometimes yield an inability to fulfill a request when XML was returned by LeMUR /task endpoint.

Reduced latency, improved error messaging

We’ve made improvements to our file downloading pipeline which reduce transcription latency. Latency has been reduced by at least 3 seconds for all audio files, with greater improvements for large audio files provided via external URLs.

We’ve improved error messaging for increased clarity in the case of internal server errors.

New Dashboard features and LeMUR fix

We have released the beta for our new usage dashboard. You can now see a usage summary broken down by async transcription, real-time transcription, Audio Intelligence, and LeMUR. Additionally, you can see charts of usage over time broken down by model.

We have added support for AWS marketplace on the dashboard/account management pages of our web application.

We have fixed an issue in which LeMUR would sometimes fail when handling extremely short transcripts.

New LeMUR features and other improvements

We have added a new parameter to LeMUR that allows users to specify a temperature for LeMUR generation. Temperature refers to how stochastic the generated text is and can be a value from 0 to 1, inclusive, where 0 corresponds to low creativity and 1 corresponds to high creativity. Lower values are preferred for tasks like multiple choice, and higher values are preferred for tasks like coming up with creative summaries of clips for social media.

Here is an example of how to set the temperature parameter with our Python SDK (which is available in version 0.18.0 and up):

import assemblyai as aai

aai.settings.api_key = f"{API_TOKEN}"

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")

result = transcript.lemur.summarize(
	temperature=0.25
)

print(result.response)

We have added a new endpoint that allows users to delete the data for a previously submitted LeMUR request. The response data as well as any context provided in the original request will be removed. Continuing the example from above, we can see how to delete LeMUR data using our Python SDK:

request_id = result.request_id

deletion_result = aai.Lemur.purge_request_data(request_id)
print(deletion_result)

We have improved the error messaging for our Word Search functionality. Each phrase used in a Word Search functionality must be 5 words or fewer. We have improved the clarity of the error message when a user makes a request which contains a phrase that exceeds this limit.

We have fixed an edge case error that would occur when both disfluencies and Auto Chapters were enabled for audio files that contained non-fluent English.

Improvements - observability, logging, and patches

We have improved logging for our LeMUR service to allow for the surfacing of more detailed errors to users.

We have increased observability into our Speech API internally, allowing for finer grained metrics of usage.

We have fixed a minor bug that would sometimes lead to incorrect timestamps for zero-confidence words.

We have fixed an issue in which requests to LeMUR would occasionally hang during peak usage due to a memory leak issue.

Multi-language speaker labels

We have recently launched Speaker Labels for 10 additional languages:

  • Spanish
  • Portuguese
  • German
  • Dutch
  • Finnish
  • French
  • Italian
  • Polish
  • Russian
  • Turkish
Audio Intelligence unbundling and price decreases

We have unbundled and lowered the price for our Audio Intelligence models. Previously, the bundled price for all Audio Intelligence models was $2.10/hr, regardless of the number of models used.

We have made each model accessible at a lower, unbundled, per-model rate:

  • Auto chapters: $0.30/hr
  • Content Moderation: $0.25/hr
  • Entity detection: $0.15/hr
  • Key Phrases: $0.06/hr
  • PII Redaction: $0.20/hr
  • Audio Redaction: $0.05/hr
  • Sentiment analysis: $0.12/hr
  • Summarization: $0.06/hr
  • Topic detection: $0.20/hr
New language support and improvements to existing languages

We now support the following additional languages for asynchronous transcription through our /v2/transcript endpoint:

  • Chinese
  • Finnish
  • Korean
  • Polish
  • Russian
  • Turkish
  • Ukrainian
  • Vietnamese

Additionally, we've made improvements in accuracy and quality to the following languages:

  • Dutch
  • French
  • German
  • Italian
  • Japanese
  • Portuguese
  • Spanish

You can see a full list of supported languages and features here. You can see how to specify a language in your API request here. Note that not all languages support Automatic Language Detection.

Pricing decreases

We have decreased the price of Core Transcription from $0.90 per hour to $0.65 per hour, and decreased the price of Real-Time Transcription from $0.90 per hour to $0.75 per hour.

Both decreases were effective as of August 3rd.

Significant Summarization model speedups

We’ve implemented changes that yield between a 43% to 200% increase in processing speed for our Summarization models, depending on which model is selected, with no measurable impact on the quality of results.

We have standardized the response from our API for automatically detected languages that do not support requested features. In particular, when Automatic Language Detection is used and the detected language does not support a feature requested in the transcript request, our API will return null in the response for that feature.

Introducing LeMUR, the easiest way to build LLM apps on spoken data

We've released LeMUR - our framework for applying LLMs to spoken data - for general availability. LeMUR is optimized for high accuracy on specific tasks:

  1. Custom Summary allows users to automatically summarize files in a flexible way
  2. Question & Answer allows users to ask specific questions about audio files and receive answers to these questions
  3. Action Items allows users to automatically generate a list of action items from virtual or in-person meetings

Additionally, LeMUR can be applied to groups of transcripts in order to simultaneously analyze a set of files at once, allowing users to, for example, summarize many podcast episode or ask questions about a series of customer calls.

Our Python SDK allows users to work with LeMUR in just a few lines of code:

# version 0.15 or greater
import assemblyai as aai

# set your API key
aai.settings.api_key = f"{API_TOKEN}"

# transcribe the audio file (meeting recording)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")

# generate and print action items
result = transcript.lemur.action_items(
    context="A GitLab meeting to discuss logistics",
    answer_format="**<topic header>**\n<relevant action items>\n",
)

print(result.response)

Learn more about LeMUR in our blog post, or jump straight into the code in our associated Colab notebook.

Introducing our Conformer-2 model

We've released Conformer-2, our latest AI model for automatic speech recognition. Conformer-2 is trained on 1.1M hours of English audio data, extending Conformer-1 to provide improvements on proper nouns, alphanumerics, and robustness to noise.

Conformer-2 is now the default model for all English audio files sent to the v2/transcript endpoint for async processing and introduces no breaking changes.

We’ll be releasing Conformer-2 for real-time English transcriptions within the next few weeks.

Read our full blog post about Conformer-2 here. You can also try it out in our Playground.

New parameter and timestamps fix

We’ve introduced a new, optional speech_threshold parameter, allowing users to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1].

You can use the speech_threshold parameter with our Python SDK as below:

import assemblyai as aai

aai.settings.api_key = f"{ASSEMBLYAI_API_KEY}"

config = aai.TranscriptionConfig(speech_threshold=0.1)

file_url = "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(file_url, config)

print(transcript.text)
Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from ...

If the percentage of speech in the audio file does not meet or surpass the provided threshold, then the value of transcript.text will be None and you will receive an error:

if not transcript.text:
	print(transcript.error)
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0

As usual, you can also include the speech_threshold parameter in the JSON of raw HTTP requests for any language.

We’ve fixed a bug in which timestamps could sometimes be incorrectly reported for our Topic Detection and Content Safety models.

We’ve made improvements to detect and remove a hallucination that would sometimes occur with specific audio patterns.

Character sequence improvements

We’ve fixed an issue in which the last character in an alphanumeric sequence could fail to be transcribed. The fix is effective immediately and constitutes a 95% reduction in errors of this type.

We’ve fixed an issue in which consecutive identical numbers in a long number sequence could fail to be transcribed. This fix is effective immediately and constitutes a 66% reduction in errors of this type.

Speaker Labels improvement

We’ve made improvements to the Speaker Labels model, adjusting the impact of the speakers_expected parameter to better allow the model to determine the correct number of unique speakers, especially in cases where one or more speakers talks substantially less than others.

We’ve expanded our caching system to include additional third-party resources to help further ensure our continued operations in the event of external resources being down.

Significant processing time improvement

We’ve made significant improvements to our transcoding pipeline, resulting in a 98% overall speedup in transcoding time and a 12% overall improvement in processing time for our asynchronous API.

We’ve implemented a caching system for some third-party resources to ensure our continued operations in the event of external resources being down.

Announcing LeMUR - our new framework for applying powerful LLMs to transcribed speech

We’re introducing our new framework LeMUR, which makes it simple to apply Large Language Models (LLMs) to transcripts of audio files up to 10 hours in length.

LLMs unlock a range of impressive capabilities that allow teams to build powerful Generative AI features. However, building these features is difficult due to the limited context windows of modern LLMs, among other challenges that necessitate the development of complicated processing pipelines.

LeMUR circumvents this problem by making it easy to apply LLMs to transcribed speech, meaning that product teams can focus on building differentiating Generative AI features rather than focusing on building infrastructure. Learn more about what LeMUR can do and how it works in our announcement blog, or jump straight to trying LeMUR in our Playground.

New PII and Entity Detection Model

We’ve upgraded to a new and more accurate PII Redaction model, which improves credit card detections in particular.

We’ve made stability improvements regarding the handling and caching of web requests. These improvements additionally fix a rare issue with punctuation detection.

Multilingual and stereo audio fixes, & Japanese model retraining

We’ve fixed two edge cases in our async transcription pipeline that were producing non-deterministic results from multilingual and stereo audio.

We’ve improved word boundary detection in our Japanese automatic speech recognition model. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.

Decreased latency and improved password reset

We’ve implemented a range of improvements to our English pipeline, leading to an average 38% improvement in overall latency for asynchronous English transcriptions.

We’ve made improvements to our password reset process, offering greater clarity to users attempting to reset their passwords while still ensuring security throughout the reset process.

Conformer-1 now available for Real-Time transcription, new Speaker Labels parameter, and more

We're excited to announce that our new Conformer-1 Speech Recognition model is now available for real-time English transcriptions, offering a 24.3% relative accuracy improvement.

Effective immediately, this state-of-the-art model will be the default model for all English audio data sent to the wss://api.assemblyai.com/v2/realtime/ws WebSocket API.

The Speaker Labels model now accepts a new optional parameter called speakers_expected. If you have high confidence in the number of speakers in an audio file, then you can specify it with speakers_expected in order to improve Speaker Labels performance, particularly for short utterances.

TLS 1.3 is now available for use with the AssemblyAI API. Using TLS 1.3 can decrease latency when establishing a connection to the API.

Our PII redaction scaling has been improved to increase stability, particularly when processing longer files.

We've improved the quality and accuracy of our Japanese model.

Short transcripts that are unable to be summarized will now return an empty summary and a successful transcript.

Introducing our Conformer-1 model

We've released our new Conformer-1 model for speech recognition. Conformer-1 was trained on 650K hours of audio data and is our most accurate model to date.

Conformer-1 is now the default model for all English audio files sent to the /v2/transcript endpoint for async processing.

We'll be releasing it for real-time English transcriptions within the next two weeks, and will add support for more languages soon.

New AI Models for Italian / Japanese Punctuation Improvements

Our Content Safety and Topic Detection models are now available for use with Italian audio files.

We’ve made improvements to our Japanese punctuation model, increasing relative accuracy by 11%. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.

Hindi Punctuation Improvements

We’ve made improvements to our Hindi punctuation model, increasing relative accuracy by 26%. These changes are effective immediately for all Hindi audio files submitted to AssemblyAI.

We’ve tuned our production infrastructure to reduce latency and improve overall consistency when using the Topic Detection and Content Moderation models.

Improved PII Redaction

We’ve released a new version of our PII Redaction model to improve PII detection accuracy, especially for credit card and phone number edge cases. Improvements are effective immediately for all API calls that include PII redaction.

Automatic Language Detection Upgrade

We’ve released a new version of our Automatic Language Detection model that better targets speech-dense parts of audio files, yielding improved accuracy. Additionally, support for dual-channel and low-volume files has been improved. All changes are effective immediately.

Our Core Transcription API has been migrated from EC2 to ECS in order to ensure scalable, reliable service and preemptively protect against service interruptions.

Password Reset

Users can now reset their passwords from our web UI. From the Dashboard login, simply click “Forgot your password?” to initiate a password reset. Alternatively, users who are already logged in can change their passwords from the Account tab on the Dashboard.

The maximum phrase length for our Word Search feature has been increased from 2 to 5, effective immediately.

Dual Channel Support for Conversational Summarization / Improved Timestamps

We’ve made updates to our Conversational Summarization model to support dual-channel files. Effective immediately, dual_channel may be set to True when summary_model is set to conversational.

We've made significant improvements to timestamps for non-English audio. Timestamps are now typically accurate between 0 and 100 milliseconds. This improvement is effective immediately for all non-English audio files submitted to AssemblyAI for transcription.

Improved Transcription Accuracy for Phone Numbers

We’ve made updates to our Core Transcription model to improve the transcription accuracy of phone numbers by 10%. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.

We've improved scaling for our read-only database, resulting in improved performance for read-only requests.

v9 Transcription Model Released

We are happy to announce the release of our most accurate Speech Recognition model to date - version 9 (v9). This updated model delivers increased performance across many metrics on a wide range of audio types.

Word Error Rate, or WER, is the primary quantitative metric by which the performance of an automatic transcription model is measured. Our new v9 model shows significant improvements across a range of different audio types, as seen in the chart below, with a more than 11% improvement on average.

In addition to standard overall WER advancements, the new v9 model shows marked improvements with respect to proper nouns. In the chart below, we can see the relative performance increase of v9 over v8 for various types of audio, with a nearly 15% improvement on average.

The new v9 transcription model is currently live in production. This means that customers will see improved performance with no changes required on their end. The new model will automatically be used for all transcriptions created by our /v2/transcript endpoint going forward, with no need to upgrade for special access.

While our customers enjoy the elevated performance of the v9 model, our AI research team is already hard at work on our v10 model, which is slated to launch in early 2023. Building upon v9, the v10 model is expected to radically improve the state of the art in speech recognition.

Try our new v9 transcription model through your browser using the AssemblyAI Playground. Alternatively, sign up for a free API token to test it out through our API, or schedule a time with our AI experts to learn more.

New Summarization Models Tailored to Use Cases

We are excited to announce that new Summarization models are now available! Developers can now choose between multiple summary models that best fit their use case and customize the output based on the summary length.

The new models are:

  • Informative which is best for files with a single speaker, like a presentation or lecture
  • Conversational which is best for any multi-person conversation, like customer/agent phone calls or interview/interviewee calls
  • Catchy which is best for creating video, podcast, or media titles

Developers can use the summary_model parameter in their POST request to specify which of our summary models they would like to use. This new parameter can be used along with the existing summary_type parameter to allow the developer to customize the summary to their needs.

import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
    "audio_url": "https://bit.ly/3qDXLG8",
    "summarization": True,
    "summary_model": "informative", # conversational | catchy
    "summary_type": "bullets" # bullets_verbose | gist | headline | paragraph
}
headers = {
	"authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())

Check out our latest blog post to learn more about the new Summarization models or head to the AssemblyAI Playground to test Summarization in your browser!

Improved Transcription Accuracy for COVID

We’ve made updates to our Core Transcription model to improve the transcription accuracy of the word COVID. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.

Static IP support for webhooks is now generally available!

Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.

See our walkthrough on how to start receiving webhooks for your transcriptions.

New Audio Intelligence Models: Summarization
import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
  "audio_url": "https://bit.ly/3qDXLG8",
    "summarization": True,
    "summary_type": "bullets" # paragraph | headline | gist 
}
headers = {
  "authorization": "YOUR-API-TOKEN",
    "content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())

Starting today, you can now transcribe and summarize entire audio files with a single API call.

To enable our new Summarization models, include the following parameter: "summarization": true in your POST request to /v2/transcript. When the transcription finishes, you will see the summary key in the JSON response containing the summary of your transcribed audio or video file.

By default, summaries will be returned in the style of bullet points. You can customize the style of summary by including the optional summary_type parameter in your POST request along with one of the following values: paragraph, headline, or gist. Here is the full list of summary types we support.

// summary_type = "paragraph"

"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat."

// summary_type = "headline"

"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcomes versus output."

// summary_type = "gist"

"summary": "Outcomes over output"

// summary_type = = "bullets"

"summary": "Josh Seiden and Brian Donohue discuss
the topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat.\n- ..."

Examples of use cases for Summarization include:

  • Identify key takeaways from phone calls to speed up post-call review and reduce manual summarization
  • Summarize long podcasts into short descriptions so users can preview before they listen.
  • Instantly generate meetings summaries to quickly recap virtual meetings and highlight post-meeting actions
  • Suggest 3-5 word video titles automatically for user-generated content
  • Synthesize long educational courses, lectures, and media broadcasts into their most important points for faster consumption

We're really excited to see what you build with our new Summarization models. To get started, try it out for free in our no-code playground or visit our documentation for more info on how to enable Summarization in your API requests.

Automatic Casing / Short Utterances

We’ve improved our Automatic Casing model and fixed a minor bug that caused over-capitalization in English transcripts. The Automatic Casing model is enabled by default with our Core Transcription API to improve transcript readability for video captions (SRT/VTT). See our documentation for more info on Automatic Casing.

Our Core Transcription model has been fine-tuned to better detect short utterances in English transcripts. Examples of short utterances include one-word answers such as “No.” and “Right.” This update will take effect immediately for all customers.

Static IP Support for Webhooks

Over the next few weeks, we will begin rolling out Static IP support for webhooks to customers in stages.

Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.

See our walkthrough on how to start receiving webhooks for your transcriptions.

Improved Number Transcription
PII Redaction Examples

We’ve made improvements to our Core Transcription model to better identify and transcribe numbers present in your audio files.

Accurate number transcription is critical for customers that need to redact Personally Identifiable Information (PII) that gets exchanged during phone calls. Examples of PII include credit card numbers, addresses, phone numbers, and social security numbers.

In order to help you handle sensitive user data at scale, our PII Redaction model automatically detects and removes sensitive info from transcriptions. For example, when PII redaction is enabled, a phone number like 412-412-4124 would become ###-###-####.

To learn more, check out our blog that covers all of our PII Redaction Policies or try our PII Redaction model in our Sandbox here!

Improved Disfluency Timestamps

We've updated our Disfluency Detection model to improve the accuracy of timestamps for disfluency words.

By default, disfluencies such as "um" or "uh" and "hm" are automatically excluded from transcripts. However, we allow customers to include these filler words by simply setting the disfluencies parameter to true in their POST request to /v2/transcript, which enables our Disfluency Detection model.

More info and code examples can be found here.

Speaker Label Improvement

We've improved the Speaker Label model’s ability to identify unique speakers for single word or short utterances.

Historical Transcript Bug Fix

We've fixed a bug with the Historical Transcript endpoint that was causing null to appear as the value of the completed key.

Japanese Transcription Now Available
Code snippet for Japanese transcription

Today, we’re releasing our new Japanese transcription model to help you transcribe and analyze your Japanese audio and video files using our cutting-edge AI.

Now you can automatically convert any Japanese audio or video file to text by including "language_code": "ja" in your POST request to our /v2/transcript endpoint.

In conjunction with transcription, we’ve also added Japanese support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing, Profanity Filtering, and more. This means you can boost transcription accuracy with more granularity based on your use case. See the full list of supported models available for Japanese transcriptions here.

To get started, visit our walkthrough on Specifying a Language on our AssemblyAI documentation page or try it out now in our Sandbox!

Hindi Transcription / Custom Webhook Headers
Code snippet for Hindi transcriptions

We’ve released our new Hindi transcription model to help you transcribe and analyze your Hindi audio and video files.

Now you can automatically convert any Hindi audio or video file to text by including "language_code": "hi" in your POST request to our /v2/transcript endpoint.

We’ve also added Hindi support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing, Profanity Filtering, and more. See the full list of supported models available for Hindi transcriptions here.

To get started with Hindi transcription, visit our walkthrough on Specifying a Language on our AssemblyAI documentation page.

Our Webhook service now supports the use of Custom Headers for authentication.

A Custom Header can be used for added security to authenticate webhook requests from AssemblyAI. This feature allows a developer to optionally provide a value to be used as an authorization header on the returning webhook from AssemblyAI, giving the ability to validate incoming webhook requests.

To use a Custom Header, you will include two additional parameters in your POST request to /v2/transcript: webhook_auth_header_name and webhook_auth_header_value. The webhook_auth_header_name parameter accepts a string containing the header's name which will be inserted into the webhook request. The webhook_auth_header_value parameter accepts a string with the value of the header that will be inserted into the webhook request. See our Using Webhooks documentation to learn more and view our code examples.

Improved Speaker Labels Accuracy and Speaker Segmentation

  • Improved the overall accuracy of the Speaker Labels feature and the model’s ability to segment speakers.

  • Fix a small edge case that would occasionally cause some transcripts to complete with NULL as the language_code value.
Content Moderation and Topic Detection Available for Portuguese

  • Improved Inverse Text Normalization of money amounts in transcript text.

  • Addressed an issue with Real-Time Transcription that would occasionally cause variance in timestamps over the course of a session.
  • Fixed an edge case with transcripts including Filler Words that would occasionally cause server errors.
Automatic Language Detection Available for Dutch and Portuguese

  • Accuracy of the Automatic Language Detection model improved on files with large amounts of silence.
  • Improved speaker segmentation accuracy for Speaker Labels.
Dutch and Portuguese Support Released

  • Dutch and Portuguese transcription is now generally available for our /v2/transcript endpoint. See our documentation for more information on specifying a language in your POST request.
Content Moderation and Topic Detection Available for French, German, and Spanish

  • Improved redaction accuracy for credit_card_number, credit_card_expiration, and credit_card_cvv policies in our PII Redaction feature.

  • Fixed an edge case that would occasionally affect the capitalization of words in transcripts when disfluencies was set to true.
French, German, and Italian Support Released

  • French, German, and Italian transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.

  • Released v2 of our Spanish model, improving absolute accuracy by ~4%.
  • Automatic Language Detection now supports French, German, and Italian.
  • Reduced the volume of the beep used to redact PII information in redacted audio files.
Miscellaneous Bug Fixes

  • Fixed an edge case that would occasionally affect timestamps for a small number of words when disfluencies was set to true.
  • Fixed an edge case where PII audio redaction would occasionally fail when using local files.
New Policies Added for PII Redaction and Entity Detection

Spanish Language Support, Automatic Language Detection, and Custom Spelling Released

  • Spanish transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.
  • Automatic Language Detection is now available for our /v2/transcript endpoint. This feature can identify the dominant language that’s spoken in an audio file and route the file to the appropriate model for the detected language.
  • Our new Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change all instances "CS 50" to "CS50".
Auto Chapters v6 Released

  • Released Auto Chapters v6, improving the summarization of longer chapters.
Auto Chapters v5 Released

  • Auto Chapters v5 released, improving headline and gist generation and quote formatting in the summary key.

  • Fixed an edge case in Dual-Channel files where initial words in an audio file would occasionally be missed in the transcription.
Regional Spelling Improvements

  • Region-specific spelling improved for en_uk and en_au language codes.
  • Improved the formatting of “MP3” in transcripts.
  • Improved Real-Time transcription error handling for corrupted audio files.
Real-Time v3 Released

  • Released v3 of our Real-Time Transcription model, improving overall accuracy by 18% and proper noun recognition by 23% relative to the v2 model.

  • Improved PII Redaction and Entity Detection for CREDIT_CARD_CVV and LOCATION.
Auto Chapters v4 Released, Auto Retry Feature Added

  • Added an Auto Retry feature, which automatically retries transcripts that fail with a Server error, developers have been alerted message. This feature is enabled by default. To disable it, visit the Account tab in your Developer Dashboard.

  • Auto Chapters v4 released, improving chapter summarization in the summary key.
  • Added a trailing period for the gist key in the Auto Chapters feature.
Auto Chapters v3 Released

  • Released v3 of our Auto Chapters model, improving the model’s ability to segment audio into chapters and chapter boundary detection by 56.3%.
  • Improved formatting for Auto Chapters summaries. The summary, headline, and gist keys now include better punctuation, casing, and text formatting.
Webhook Status Codes, Entity Detection Improved

  • POST requests from the API to webhook URLs will now accept any status code from 200 to 299 as a successful HTTP response. Previously only 200 status codes were accepted.
  • Updated the text key in our Entity Detection feature to return the proper noun rather than the possessive noun. For example, Andrew instead of Andrew’s.

  • Fixed an edge case with Entity Detection where under certain contexts, a disfluency could be identified as an entity.
Punctuation and Casing Accuracy Improved, Inverse Text Normalization Model Updated

  • Released v4 of our Punctuation model, increasing punctuation and casing accuracy by ~2%.
  • Updated our Inverse Text Normalization (ITN) model for our /v2/transcript endpoint, improving web address and email address formatting and fixing the occasional number formatting issue.

  • Fixed an edge case where multi-channel files would return no text when the two channels were out of phase with each other.
Support for Non-English Languages Coming Soon

  • Our Deep Learning team has been hard at work training our new non-English language models. In the coming weeks, we will be adding support for French, German, Italian, and Spanish.
Shorter Summaries Added to Auto Chapters, Improved Filler Word Detection

  • Added a new gist key to the Auto Chapters feature. This new key provides an ultra-short, usually 3 to 8 word summary of the content spoken during that chapter.

  • Implemented profanity filtering into Auto Chapters, which will prevent the API from generating a summary, headline, or gist that includes profanity.
  • Improved Filler Word (aka, disfluencies) detection by ~5%.
  • Improved accuracy for Real-Time Streaming Transcription.

  • Fixed an edge case where WebSocket connections for Real-Time Transcription sessions would occasionally not close properly after the session was terminated. This resulted in the client receiving a 4031 error code even after sending a session termination message.
  • Corrected a bug that occasionally attributed disfluencies to the wrong utterance when Speaker Labels or Dual-Channel Transcription was enabled.
v8.5 Asynchronous Transcription Model Released

  • Our Asynchronous Speech Recognition model is now even better with the release of v8.5.
  • This update improves overall accuracy by 4% relative to our v8 model.
  • This is achieved by improving the model’s ability to handle noisy or difficult-to-decipher audio.
  • The v8.5 model also improves Inverse Text Normalization for numbers.
New and Improved API Documentation

  • Launched the new AssemblyAI Docs, with more complete documentation and an easy-to-navigate interface so developers can effectively use and integrate with our API. Click here to view the new and improved documentation.

  • Added two new fields to the FinalTranscript response for Real-time Transcriptions. The punctuated key is a Boolean value indicating if punctuation was successful. The text_formatted key is a Boolean value indicating if Inverse Text Normalization (ITN) was successful.
Inverse Text Normalization Added to Real-Time, Word Boost Accuracy Improved

  • Inverse Text Normalization (ITN) added for our /v2/realtime and /v2/stream endpoints. ITN improves formatting of entities like numbers, dates, and proper nouns in the transcription text.

  • Improved accuracy for Custom Vocabulary (aka, Word Boosts) with the Real-Time transcription API.

  • Fixed an edge case that would sometimes cause transcription errors when disfluencies was set to true and no words were identified in the audio file.
Entity Detection Released, Improved Filler Word Detection, Usage Alerts

  • v1 release of Entity Detection - automatically detects a wide range of entities like person and company names, emails, addresses, dates, locations, events, and more.
  • To include Entity Detection in your transcript, set entity_detection to true in your POST request to /v2/transcript.
  • When your transcript is complete, you will see an entities key towards the bottom of the JSON response containing the entities detected, as shown here:
  • Read more about Entity Detection in our official documentation.
  • Usage Alert feature added, allowing customers to set a monthly usage threshold on their account along with a list of email addresses to be notified when that monthly threshold has been exceeded. This feature can be enabled by clicking “Set up alerts” on the “Developers” tab in the Dashboard.
  • When Content Safety is enabled, a summary of the severity scores detected will now be returned in the API response under the severity_score_summary nested inside of the content_safety_labels  key, as shown below.

  • Improved Filler Word (aka, disfluencies) detection by ~25%.

  • Fixed a bug in Auto Chapters that would occasionally add an extra space between sentences for headlines and summaries.
Additional MIME Type Detection Added for OPUS Files

  • Added additional MIME type detection to detect a wider variety of OPUS files.

  • Fixed an issue with word timing calculations that caused issues with speaker labeling for a small number of transcripts.
Custom Vocabulary Accuracy Significantly Improved

  • Significantly improved the accuracy of Custom Vocabulary, and the impact of the boost_param field to control the weight for Custom Vocabulary.
  • Improved precision of word timings.
New Auto Chapters, Sentiment Analysis, and Disfluencies Features Released

  • v1 release of Auto Chapters - which provides a "summary over time" by breaking audio/video files into "chapters" based on the topic of conversation. Check out our blog to read more about this new feature. To enable Auto Chapters in your request, you can set auto_chapters: true in your POST request to /v2/transcript.
  • v1 release of Sentiment Analysis - that determines the sentiment of sentences in a transcript as "positive", "negative", or "neutral". Sentiment Analysis can be enabled by including the sentiment_analysis: true parameter in your POST request to /v2/transcript.
  • Filler-words like "um" and "uh" can now be included in the transcription text. Simply include disfluencies: true in your POST request to /v2/transcript.

  • Deployed Speaker Labels version 1.3.0. Improves overall diarization/labeling accuracy.
  • Improved our internal auto-scaling for asynchronous transcription, to keep turnaround times consistently low during periods of high usage.
New Language Code Parameter for English Spelling

  • Added a new language_code parameter when making requests to /v2/transcript.
  • Developers can set this to en_us, en_uk, and en_au, which will ensure the correct English spelling is used - British English, Australian English, or US English (Default).
  • Quick note: for customers that were historically using the assemblyai_en_au or assemblyai_en_uk acoustic models, the language_code parameter is essentially redundant and doesn't need to be used.

  • Fixed an edge-case where some files with prolonged silences would occasionally have a single word predicted, such as "you" or "hi."
New Features Coming Soon, Bug Fixes

  • This week, our engineering team has been hard at work preparing for the release of exciting new features like:
  • Chapter Detection: Automatically summarize audio and video files into segments (aka "chapters").
  • Sentiment Analysis: Determine the sentiment of sentences in your transcript as "positive", "negative", or "neutral".
  • Disfluencies: Detects filler-words like "um" and "uh".

  • Improved average real-time latency by 2.1% and p99 latency by 0.06%.

  • Fixed an edge-case where confidence scores in the utterances category for dual-channel audio files would occasionally receive a confidence score greater than 1.0.
Improved v8 Model Processing Speed

  • Improved the API's ability to handle audio/video files with a duration over 8 hours.

  • Further improved transcription processing times by 12%.
  • Fixed an edge case in our responses for dual channel audio files where if speaker 2 interrupted speaker 1,  the text from speaker 2 would cause the text from speaker 1 to be split into multiple turns, rather than contextually keeping all of speaker 1's text together.
v8 Transcription Model Released

  • Today, we're happy to announce the release of our most accurate Speech Recognition model for asynchronous transcription to date—version 8 (v8).
  • This new model dramatically improves overall accuracy (up to 19% relative), and proper noun accuracy as well (up to 25% relative).
  • You can read more about our v8 model in our blog here.

  • Fixed an edge case where a small percentage of short (<60 seconds in length) dual-channel audio files, with the same audio on each channel, resulted in repeated words in the transcription.
v2 Real-Time and v4 Topic Detection Models Released

  • Launched our v2 Real-Time Streaming Transcription model (read more on our blog).
  • This new model improves accuracy of our Real-Time Streaming Transcription by ~10%.
  • Launched our Topic Detection v4 model, with an accuracy boost of ~8.37% over v3 (read more on our blog).
v3 Topic Detection Model, PII Redaction Bug Fixes

  • Released our v3 Topic Detection model.
  • This model dramatically improves the Topic Detection feature's ability to accurately detect topics based on context.
  • For example, in the following text, the model was able to accurately predict "Rugby" without the mention of the sport directly, due to the mention of "Ed Robinson" (a Rugby coach).

  • PII Redaction has been improved to better identify (and redact) phone numbers even when they are not explicitly referred to as a phone number.

  • Released a fix for PII Redaction that corrects an issue where the model would sometimes detect phone numbers as credit card numbers or social security numbers.
Severity Scores for Content Safety
  • The API now returns a severity score along with the confidence and label keys when using the Content Safety feature.
  • The severity score measures how intense a detected Content Safety label is on a scale of 0 to 1.
  • For example, a natural disaster that leads to mass casualties will have a score of 1.0, while a small storm that breaks a mailbox will only be 0.1.

  • Fixed an edge case where a small number of transcripts with Automatic Transcript Highlights turned on were not returning any results.
Real-time Transcription and Streaming Fixes

  • Fixed an edge case where higher sample rates would occasionally trigger a Client sent audio too fast error from the Real-Time Streaming WebSocket API.
  • Fixed an edge case where some streams from Real-Time Streaming WebSocket API were held open after a customer idled their session.
  • Fixed an edge case in the /v2/stream endpoint, where large periods of silence would occasionally cause automatic punctuation to fail.
  • Improved error handling when non-JSON input is sent to the /v2/transcript endpoint.
Punctuation v3, Word Search, Bug Fixes

  • v3 Punctuation Model released.
  • v3 brings improved accuracy to automatic punctuation and casing for both async (/v2/transcript) and real-time (WebSocket API) transcripts.
  • Released an all-new Word Search feature that will allow developers to search for words in a completed transcript.
  • This new feature returns how many times the word was spoken, the index of that word in the transcript's JSON response word list/array, and the associated timestamps for each matched word.

  • Fixed an issue causing a small subset of words not to be filtered when profanity filtering was turned on.
New Dashboard and Real-Time Data

This week, we released an entirely new dashboard for developers:

The new developer dashboard introduces:

  • Better usage reports for API usage and spend
  • Easier access to API Docs, API Tokens, and Account information
  • A no-code demo to test the API on your audio/video files without having to write any code
General Improvements
  • Fixed a bug with PII Redaction, where sometimes dollar amount and date tokens were not being properly redacted.
  • AssemblyAI now supports even more audio/video file formats thanks to improvements to our audio transcoding pipeline!
  • Fixed a rare bug where a small percentage of transcripts (0.01%) would incorrectly sit in a status of "queued" for up to 60 seconds.
ITN Model Update

Today we've released a major improvement to our ITN (Inverse Text Normalization) model. This results in better formatting for entities within the transcription, such as phone numbers, money amounts, and dates.

For example:

Money:

  • Spoken: "Hey, do you have five dollars?"
  • Model output with ITN: "Hey, do you have $5?"

Years:

  • Spoken: "Yes, I believe it was back in two thousand eight"
  • Model output with ITN: "Yes, I believe it was back in 2008."
Punctuation Model v2.5 Released

Today we've released an updated Automatic Punctuation and Casing Restoration model (Punctuation v2.5)! This update results in improved capitalization of proper nouns in transcripts, reduces over-capitalization issues where some words like were being incorrectly capitalized, and improves some edge cases around words with commas around them. For example:

  • "....in the Us" now becomes "....in the US."
  • "whatsapp," now becomes "WhatsApp,"
Content Safety Model (v7) Released

We have released an updated Content Safety Model - v7! Performance for 10 out of all 19 Content Safety labels has been improved, with the biggest improvements being for the Profanity and Natural Disasters labels.

Real-Time Transcription Model v1.1 Released

We have just released a major real-time update!

Developers will now be able to use the word_boost parameter in requests to the real-time API, allowing you to introduce your own custom vocabulary to the model for that given session! This custom vocabulary will lead to improved accuracy for the provided words.

General Improvements

We will now be limiting one websocket connection per real-time session to ensure the integrity of a customer's transcription and prevent multiple users/clients from using the websocket same session.

Note: Developers can still have multiple real-time sessions open in parallel, up to the Concurrency Limit on the account. For example, if an account has a Concurrency Limit of 32, that account could have up to 32 concurrent real-time sessions open.

Topic Detection Model v2 Released

Today we have released v2 of our Topic Detection Model. This new model will predict multiple topics for each paragraph of text, whereas v1 was limited to predicting a single. For example, given the text:

"Elon Musk just released a new Tesla that drives itself!"

v1:

  • Automotive>AutoType>DriverlessCars: 1

v2:

  • Automotive>AutoType>DriverlessCars: 1
  • PopCulture : 0.84
  • PopCulture>CelebrityStyle: 0.56

This improvement will result in the visual output looking significantly better, and containing more informative responses for developers!

Increased Number of Categories Returned for Topic Detection Summary

In this minor improvement, we have increased the number of topics the model can return in the summary key of the JSON response from 10 to 20.

Temporary Tokens for Real-Time

Often times, developers will need to expose their AssemblyAI API Key in their client applications when establishing connections with our real-time streaming transcription API. Now, developers can create a temporary API token that expires in a customizable amount of time (similar to an AWS S3 Temporary Authorization URL) that can safely be exposed in the client applications and front-ends.

This will allow developers to create short-lived API tokens designed to be used securely in the browser, along with authorization within the query string!

For example, authenticating in the query parameters with a temporary token would look like so:

wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={TEMP_TOKEN}

For more information, you can view our Docs!

Adding "Marijuana" and "Sensitive Social Issues" as Possible Content Safety Labels

In this minor update, we improve the accuracy across all Content Safety labels, and add two new labels for better content categorization. The two new labels are sensitive_social_issues and marijuana.

New label definitions:

  • sensitive_social_issues: This category includes content that may be considered insensitive, irresponsible, or harmful to specific groups based on their beliefs, political affiliation, sexual orientation, or gender identity.
  • marijuana: This category includes content that discusses marijuana or its usage.
Real-Time Transcription is Now GA

We are pleased to announce the official release of our Real-Time Streaming Transcription API! This API uses WebSockets and a fast Conformer Neural Network architecture that allows for a quick and accurate transcription in real-time.

Find out more in our Docs here!

Content Safety Detection and Topic Detection are now GA!

Today we have released two of our enterprise-level models, Content Safety Detection and Topic Detection, to all users!

Now any developer can make use of these cutting edge models within their applications and products. Explore these new features in our Docs:

Minor Update to PII Redaction

With this minor update, our Redaction Model will better detect Social Security Numbers and Medical References for additional security and data protection!

New Punctuation Model (v2)

Today we released a new punctuation model that is more extensive than its predecessor, and will drive improvements in punctuation and casing accuracy!

New Features & Updates

List Historical Transcripts

  • Developers can get a list of their historical transcriptions. This list can be filtered by status and date. This new endpoint will allow developers to see if they have any queued, processing, or throttled transcriptions.

Pre-Formatted Paragraphs

  • Developers can now get pre-formatted paragraphs by calling our new paragraphs endpoint! The model will attempt to semantically break the transcript up into paragraphs of five sentences or less.

You can explore each feature further in our Docs:

Topic Detection Response Improvements

  • Now each topic will include timestamps for each segment of classified text. We have also added a new summary key that will contain the confidence of all unique topics detected throughout the entire transcript.

  • We have made improvements to our Speaker Diarization Model that increases accuracy over short and long transcripts.
New PII Classes

We have released an update to our PII Redaction Model that will now support detecting and redacting additional classes!

  • blood_type
  • medical_condition
  • drug (including vitamins/minerals)
  • injury
  • medical_process

Entity Definitions:

  • blood_type: Blood type
  • medical_condition: A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
  • drug: Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol
  • injury: Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages, and dislocations.
  • medical_process: Medical process, including treatments, procedures, and tests. E.g., "heart surgery," "CT scan."