We've made a series of improvements to our Free Offer:
All new and existing users will get $50 in free credits (equivalent to 135 hours of Best transcription, or 417 hours of Nano transcription)
All unused free credits will be automatically transferred to a user's account balance after upgrade to pay-as-you-go pricing.
Free Offer users will now see a tracker in their dashboard to see how many credits they have remaining
Free Offer users will now have access to the usage dashboard, their billing rates, concurrency limit, and billing alerts
Learn more about our Free Offer on our Pricing page, and then check out our Quickstart in our Docs to get started.
Speaker Diarization improvements
We've made improvements to our Speaker Diarization model, especially robustness in distinguishing between speakers with similar voices.
We've fixed an error in which the last word in a transcript was always attributed to the same speaker as the second-to-last word.
File upload improvements and more
We've made improvements to error handling for file uploads that fail. Now if there is an error, such as a file containing no audio, the following 422 error will be returned:
Upload failed, please try again. If you continue to have issues please reach out to support@assemblyai.com
We've made scaling improvements that reduce p90 latency for some non-English languages when using the Best tier
We've made improvements to notifications for auto-refill failures. Now, users will be alerted more rapidly when their automatic payments are unsuccessful.
New endpoints for LeMUR Claude 3
Last month, we announced support for Claude 3 in LeMUR. Today, we are adding support for two new endpoints - Question & Answer and Summary (in addition to the pre-existing Task endpoint) - for these newest models:
Claude 3 Opus
Claude 3.5 Sonnet
Claude 3 Sonnet
Claude 3 Haiku
Here's how you can use Claude 3.5 Sonnet to summarize a virtual meeting with LeMUR:
import assemblyai as aai
aai.settings.api_key = "YOUR-KEY-HERE"
audio_url = "https://storage.googleapis.com/aai-web-samples/meeting.mp4"
transcript = aai.Transcriber().transcribe(audio_url)
result = transcript.lemur.summarize(
final_model=aai.LemurModel.claude3_5_sonnet,
context="A GitLab meeting to discuss logistics",
answer_format="TLDR"
)
print(result.response)
Learn more about these specialized endpoints and how to use them in our Docs.
Enhanced AssemblyAI app for Zapier
We've launched our Zapier integration v2.0, which makes it easy to use our API in a no-code way. The enhanced app is more flexible, supports more Speech AI features, and integrates more closely into the Zap editor.
The Transcribe event (formerly Get Transcript) now supports all of the options available in our transcript API, making all of our Speech Recognition and Audio Intelligence features available to Zapier users, including asynchronous transcription. In addition, we've added 5 new events to the AssemblyAI app for Zapier:
Get Transcript: Retrieve a transcript that you have previously created.
Get Transcript Subtitles: Generate STT or VTT subtitles for the transcript.
Get Transcript Paragraphs: Retrieve the transcript segmented into paragraphs.
Get Transcript Sentences: Retrieve the transcript segmented into sentences.
Get Transcript Redacted Audio Result: Retrieve the result of the PII audio redaction model. The result contains the status and the URL to the redacted audio file.
LeMUR can now be used from browsers, either via our JavaScript SDK or fetch.
LeMUR - Claude 3 support
Last week, we released Anthropic's Claude 3 model family into LeMUR, our LLM framework for speech.
Claude 3.5 Sonnet
Claude 3 Opus
Claude 3 Sonnet
Claude 3 Haiku
You can now easily apply any of these models to your audio data. Learn more about how to get started in our docs or try out the new models in a no-code way through our playground.
For more information, check out our blog post about the release.
import assemblyai as aai
# Step 1: Transcribe an audio file
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("./common_sports_injuries.mp3")
# Step 2: Define a prompt
prompt = "Provide a brief summary of the transcript."
# Step 3: Choose an LLM to use with LeMUR
result = transcript.lemur.task(
prompt,
final_model=aai.LemurModel.claude3_5_sonnet
)
print(result.response)
JavaScript SDK fix
We've fixed an issue which was causing the JavaScript SDK to surface the following error when using the SDK in the browser:
Access to fetch at 'https://api.assemblyai.com/v2/transcript' from origin 'https://exampleurl.com' has been blocked by CORS policy: Request header field assemblyai-agent is not allowed by Access-Control-Allow-Headers in preflight response.
Timestamps improvement; bugfixes
We've made significant improvements to the timestamp accuracy of our Speech-to-Text Best tier for English, Spanish, and German. 96% of timestamps are accurate within 200ms, and 86% of timestamps are now accurate within 100ms.
We've fixed a bug in which confidence scores of transcribed words for the Nano tier would sometimes be outside of the range [0, 1]
We've fixed a rare issue in which the speech for only one channel in a short dual channel file would be transcribed when disfluencies was also enabled.
Streaming (formerly Real-time) improvements
We've made model improvements that significantly improve the accuracy of timestamps when using our Streaming Speech-to-Text service. Most timestamps are now accurate within 100 ms.
Our Streaming Speech-to-Text service will now return a new error 'Audio too small to be transcoded' (code 4034) when a client submits an audio chunk that is too small to be transcoded (less than 10 ms).
Variable-bitrate video support; bugfix
We've deployed changes which now permit variable-bitrate video files to be submitted to our API.
We've fixed a recent bug in which audio files with a large amount of silence at the beginning of the file would fail to transcribe.
LeMUR improvements
We have added two new keys to the LeMUR response, input_tokens and output_tokens, which can help users track usage.
We've implemented a new fallback system to further boost the reliability of LeMUR.
We have addressed an edge case issue affecting LeMUR and certain XML tags. In particular, when LeMUR responds with a <question> XML tag, it will now always close it with a </question> tag rather than erroneous tags which would sometimes be returned (e.g. </answer>).
PII Redaction and Entity Detection improvements
We've improved our PII Text Redaction and Entity Detection models, yielding more accurate detection and removal of PII and other entities from transcripts.
We've added 16 new entities, including vehicle_id and account_number, and updated 4 of our existing entities. Users may need to update to the latest version of our SDKs to use these new entities.
We've added PII Text Redaction and Entity Detection support in 4 new languages:
Chinese
Dutch
Japanese
Georgian
PII Text Redaction and Entity Detection now support a total of 47 languages between our Best and Nano tiers.
Usage and spend alerts
Users can now set up billing alerts in their user portals. Billing alerts notify you when your monthly spend or account balance reaches a threshold.
To set up a billing alert, go to the billing page of your portal, and click Set up a new alert under the Your alerts widget:
You can then set up an alert by specifying whether to alert on monthly spend or account balance, as well as the specific threshold at which to send an alert.
Universal-1 now available in German
Universal-1, our most powerful and accurate multilingual Speech-to-Text model, is now available in German.
No special action is needed to utilize Universal-1 on German audio - all requests sent to our /v2/transcript endpoint with German audio files will now use Universal-1 by default. Learn more about how to integrate Universal-1 into your apps in our Getting Started guides.
We’ve released a new version of the API Reference section of our docsfor an improved developer experience. Here’s what’s new:
New API Reference pages with exhaustive endpoint documentation for transcription, LeMUR, and streaming
cURL examples for every endpoint
Interactive Playground: Test our API endpoints with the interactive playground. It includes a form-builder for generating requests and corresponding code examples in cURL, Python, and TypeScript
Always up to date: The new API Reference is autogenerated based on our Open-Source OpenAPI and AsyncAPI specs
We’ve made improvements to Universal-1’s timestamps for both the Best and Nano tiers, yielding improved timestamp accuracy and a reduced incidence of overlapping timestamps.
We’ve fixed an issue in which users could receive an `Unable to create transcription. Developers have been alerted` error that would be surfaced when using long files with Sentiment Analysis.
New codec support; account deletion support
We’ve upgraded our transcoding library and now support the following new codecs:
Users can now delete their accounts by selecting the Delete account option on the Account page of their AssemblyAI Dashboards.
Users will now receive a 400 error when using an invalid tier and language code combination, with an error message such as The selected language_code is supported by the following speech_models: best, conformer-2. See https://www.assemblyai.com/docs/concepts/supported-languages..
We’ve fixed an issue in which nested JSON responses from LeMUR would cause Invalid LLM response, unable to fulfill request. Please try again. errors.
We’ve fixed a bug in which very long files would sometimes fail to transcribe, leading to timeout errors.
AssemblyAI app for Make.com
Make (formerly Integromat) is a no-code automation platform that makes it easy to build tasks and workflows that synthesize many different services.
We’ve released the AssemblyAI app for Make that allows Make users to incorporate AssemblyAI into their workflows, or scenarios. In other words, in Make you can now use our AI models to
Transcribe audio data with speech recognition models
Analyze audio data with audio intelligence models
Build generative features on top of audio data with LLMs
For example, in our tutorial on Redacting PII with Make, we demonstrate how to build a Make scenario that automatically creates a redacted audio file and redacted transcription for any audio file uploaded to a Google Drive folder.
GDPR and PCI DSS compliance
AssemblyAI is now officially PCI Compliant. The Payment Card Industry Data Security Standard Requirements and Security Assessment Procedures (PCI DSS) certification is a rigorous assessment that ensures card holder data is being properly and securely handled and stored. You can read more about PCI DSS here.
Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our PCI attestation of compliance, as well as other security-related documents.
AssemblyAI is also GDPR Compliant. The General Data Protection Regulation (GDPR) is regulation regarding privacy and security for the European Union that applies to businesses that serve customers within the EU. You can read more about GDPR here.
Additionally, organizations which have signed an NDA can go to our Trust Portal in order to view our GDPR assessment on compliance, as well as other security-related documents.
Self-serve invoices; dual-channel improvement
Users of our API can now view and download their self-serve invoices in their dashboards under Billing > Your invoices.
We’ve made readability improvements to the formatting of utterances for dual-channel transcription by combining sequential utterances from the same channel.
We’ve added a patch to improve stability in turnaround times for our async transcription and LeMUR services.
We’ve fixed an issue in which timestamp accuracy would be degraded in certain edge cases when using our async transcription service.
Introducing Universal-1
Last week we released Universal-1, a state-of-the-art multimodal speech recognition model. Universal-1 is trained on 12.5M hours of multilingual audio data, yielding impressive performance across the four key languages for which it was trained - English, Spanish, German, and French.
Universal-1 is now the default model for English and Spanish audio files sent to our v2/transcript endpoint for async processing, while German and French will be rolled out in the coming weeks.
We’ve added a new message type to our Streaming Speech-to-Text (STT) service. This new message type SessionInformation is sent immediately before the final SessionTerminated message when closing a Streaming session, and it contains a field called audio_duration_seconds which contains the total audio duration processed during the session. This feature allows customers to run end-user-specific billing calculations.
To enable this feature, set the enable_extra_session_information query parameter to true when connecting to a Streaming WebSocket.
We’ve added a new feature to our Streaming STT service, allowing users to disable Partial Transcripts in a Streaming session. Our Streaming API sends two types of transcripts - Partial Transcripts (unformatted and unpunctuated) that gradually build up the current utterance, and Final Transcripts which are sent when an utterance is complete, containing the entire utterance punctuated and formatted.
Users can now set the disable_partial_transcripts query parameter to true when connecting to a Streaming WebSocket to disable the sending of Partial Transcript messages.
We have fixed a bug in our async transcription service, eliminating File does not appear to contain audio errors. Previously, this error would be surfaced in edge cases where our transcoding pipeline would not have enough resources to transcode a given file, thus failing due to resource starvation.
Dual channel transcription improvements
We’ve made improvements to how utterances are handled during dual-channel transcription. In particular, the transcription service now has elevated sensitivity when detecting utterances, leading to improved utterance insertions when there is overlapping speech on the two channels.
LeMUR concurrency fix
We’ve fixed a temporary issue in which users with low account balances would occasionally be rate-limited to a value less than 30 when using LeMUR.
Fewer "File does not appear to contain audio" errors
We’ve fixed an edge-case bug in our async API, leading to a significant reduction in errors that say File does not appear to contain audio. Users can expect to see an immediate reduction in this type of error. If this error does occur, users should retry their requests given that retries are generally successful.
We’ve made improvements to our transcription service autoscaling, leading to improved turnaround times for requests that use Word Boost when there is a spike in requests to our API.
New developer controls for real-time end-of-utterance
We have released developer controls for real-time end-of-utterance detection, providing developers control over when an utterance is considered complete. Developers can now either manually force the end of an utterance, or set a threshold for time of silence before an utterance is considered complete.
We have made changes to our English async transcription service that improve sentence segmentation for our Sentiment Analysis, Topic Detection, and Content Moderation models. The improvements fix a bug in which these models would sometimes delineate sentences on titles that end in periods like Dr. and Mrs..
We have fixed an issue in which transcriptions of very long files (8h+) with disfluencies enabled would error out.
PII Redaction and Entity Detection available in 13 additional languages
We have increased the memory of our transcoding service workers, leading to a significant reduction in errors that say File does not appear to contain audio.
Fewer LeMUR 500 errors
We’ve made improvements to our LeMUR service to reduce the number of 500 errors.
We’ve made improvements to our real-time service, which provides a small increase to the accuracy of timestamps in some edge cases.
We have increased the usage limit for our free tier to 100 hours. New users can now use our async API to transcribe up to 100 hours of audio, with a concurrency limit of 5, before needing to upgrade their accounts.
We have rolled out the concurrency limit increase for our real-time service. Users now have access to up to 100 concurrent streams by default when using our real-time service.
Higher concurrency is available upon request with no limit to what our API can support. If you need a higher concurrency limit, please either contact our Sales team or reach out to us at support@assemblyai.com. Note that our real-time service is only available for upgraded accounts.
Latency and cost reductions, concurrency increase
We introduced major improvements to our API’s inference latency, with the majority of audio files now completing in well under 45 seconds regardless of audio duration, with a Real-Time Factor (RTF) of up to .008.
To put an RTF of .008x into perspective, this means you can now convert a:
1h3min (75MB) meeting in 35 seconds
3h15min (191MB) podcast in 133 seconds
8h21min (464MB) video course in 300 seconds
In addition to these latency improvements, we have reduced our Speech-to-Text pricing. You can now access our Speech AI models with the following pricing:
Async Speech-to-Text for $0.37 per hour (previously $0.65)
Real-time Speech-to-Text for $0.47 per hour (previously $0.75)
We’ve also reduced our pricing for the following Audio Intelligence models: Key Phrases, Sentiment Analysis, Summarization, PII Audio Redaction, PII Redaction, Auto Chapters, Entity Detection, Content Moderation, and Topic Detection. You can view the complete list of pricing updates on our Pricing page.
Finally, we've increased the default concurrency limits for both our async and real-time services. The increase is immediate for async, and will be rolled out soon for real-time. These new limits are now:
200 for async (up from 32)
100 for real-time (up from 32)
These new changes stem from the efficiencies that our incredible research and engineering teams drive at every level of our inference pipeline, including optimized model compilation, intelligent mini batching, hardware parallelization, and optimized serving infrastructure.
Learn more about these changes and our inference pipeline in our blog post.
Claude 2.1 available through LeMUR
Anthropic’s Claude 2.1 is now generally available through LeMUR. Claude 2.1 is similar to our Default model and has reduced hallucinations, a larger context window, and performs better in citations.
Claude 2.1 can be used by setting the final_model parameter to anthropic/claude-2-1 in API requests to LeMUR. Here's an example of how to do this through our Python SDK:
import assemblyai as aai
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://example.org/customer.mp3")
result = transcript.lemur.task(
"Summarize the following transcript in three to five sentences.",
final_model=aai.LemurModel.claude2_1,
)
print(result.response)
You can learn more about setting the model used with LeMUR in our docs.
Our real-time service now supports binary mode for sending audio segments. Users no longer need to encode audio segments as base64 sequences inside of JSON objects - the raw binary audio segment can now be directly sent to our API.
Moving forward, sending audio segments through websockets via the audio_data field is considered a deprecated functionality, although it remains the default for now to avoid breaking changes. We plan to support the audio_data field until 2025.
If you are using our SDKs, no changes are required on your end.
We have fixed a bug that would yield a degradation to timestamp accuracy at the end of very long files with many disfluencies.
New Node/JavaScript SDK works in multiple runtimes
We’ve released v4 of our Node JavaScript SDK. Previously, the SDK was developed specifically for Node, but the latest version now works in additional runtimes without any extra steps. The SDK can now be used in the browser, Deno, Bun, Cloudflare Workers, etc.
New Punctuation Restoration and Truecasing models, PCM Mu-law support
We’ve released new Punctuation and Truecasing models, achieving significant improvements for acronyms, mixed-case words, and more.
Below is a visual comparison between our previous Punctuation Restoration and Truecasing models (red) and the new models (green):
Going forward, the new Punctuation Restoration and Truecasing models will automatically be used for async and real-time transcriptions, with no need to upgrade for special access. Use the parameters punctuate and format_text, respectively, to enable/disable the models in a request (enabled by default).
Our real-time transcription service now supports PCM Mu-law, an encoding used primarily in the telephony industry. This encoding is set by using the `encoding` parameter in requests to our API. You can read more about our PCM Mu-law support here.
We have improved internal reporting for our transcription service, which will allow us to better monitor traffic.
New LeMUR parameter, reduced hold music hallucinations
Users can now directly pass in custom text inputs into LeMUR through the input_text parameter as an alternative to transcript IDs. This gives users the ability to use any information from the async API, formatted however they want, with LeMUR for maximum flexibility.
For example, users can assign action items per user by inputting speaker-labeled transcripts, or pull citations by inputting timestamped transcripts. Learn more about the new input_text parameter in our LeMUR API reference, or check out examples of how to use the input_text parameter in the AssemblyAI Cookbook.
We’ve made improvements that reduce hallucinations which sometimes occurred from transcribing hold music on phone calls. This improvement is effective immediately with no changes required by users.
We’ve fixed an issue that would sometimes yield an inability to fulfill a request when XML was returned by LeMUR /task endpoint.
Reduced latency, improved error messaging
We’ve made improvements to our file downloading pipeline which reduce transcription latency. Latency has been reduced by at least 3 seconds for all audio files, with greater improvements for large audio files provided via external URLs.
We’ve improved error messaging for increased clarity in the case of internal server errors.
New Dashboard features and LeMUR fix
We have released the beta for our new usage dashboard. You can now see a usage summary broken down by async transcription, real-time transcription, Audio Intelligence, and LeMUR. Additionally, you can see charts of usage over time broken down by model.
We have added support for AWS marketplace on the dashboard/account management pages of our web application.
We have fixed an issue in which LeMUR would sometimes fail when handling extremely short transcripts.
New LeMUR features and other improvements
We have added a new parameter to LeMUR that allows users to specify a temperature for LeMUR generation. Temperature refers to how stochastic the generated text is and can be a value from 0 to 1, inclusive, where 0 corresponds to low creativity and 1 corresponds to high creativity. Lower values are preferred for tasks like multiple choice, and higher values are preferred for tasks like coming up with creative summaries of clips for social media.
Here is an example of how to set the temperature parameter with our Python SDK (which is available in version 0.18.0 and up):
import assemblyai as aai
aai.settings.api_key = f"{API_TOKEN}"
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")
result = transcript.lemur.summarize(
temperature=0.25
)
print(result.response)
We have added a new endpoint that allows users to delete the data for a previously submitted LeMUR request. The response data as well as any context provided in the original request will be removed. Continuing the example from above, we can see how to delete LeMUR data using our Python SDK:
We have improved the error messaging for our Word Search functionality. Each phrase used in a Word Search functionality must be 5 words or fewer. We have improved the clarity of the error message when a user makes a request which contains a phrase that exceeds this limit.
We have fixed an edge case error that would occur when both disfluencies and Auto Chapters were enabled for audio files that contained non-fluent English.
Improvements - observability, logging, and patches
We have improved logging for our LeMUR service to allow for the surfacing of more detailed errors to users.
We have increased observability into our Speech API internally, allowing for finer grained metrics of usage.
We have fixed a minor bug that would sometimes lead to incorrect timestamps for zero-confidence words.
We have fixed an issue in which requests to LeMUR would occasionally hang during peak usage due to a memory leak issue.
Multi-language speaker labels
We have recently launched Speaker Labels for 10 additional languages:
Spanish
Portuguese
German
Dutch
Finnish
French
Italian
Polish
Russian
Turkish
Audio Intelligence unbundling and price decreases
We have unbundled and lowered the price for our Audio Intelligence models. Previously, the bundled price for all Audio Intelligence models was $2.10/hr, regardless of the number of models used.
We have made each model accessible at a lower, unbundled, per-model rate:
Auto chapters: $0.30/hr
Content Moderation: $0.25/hr
Entity detection: $0.15/hr
Key Phrases: $0.06/hr
PII Redaction: $0.20/hr
Audio Redaction: $0.05/hr
Sentiment analysis: $0.12/hr
Summarization: $0.06/hr
Topic detection: $0.20/hr
New language support and improvements to existing languages
We now support the following additional languages for asynchronous transcription through our /v2/transcript endpoint:
Chinese
Finnish
Korean
Polish
Russian
Turkish
Ukrainian
Vietnamese
Additionally, we've made improvements in accuracy and quality to the following languages:
Dutch
French
German
Italian
Japanese
Portuguese
Spanish
You can see a full list of supported languages and features here. You can see how to specify a language in your API request here. Note that not all languages support Automatic Language Detection.
Pricing decreases
We have decreased the price of Core Transcription from $0.90 per hour to $0.65 per hour, and decreased the price of Real-Time Transcription from $0.90 per hour to $0.75 per hour.
Both decreases were effective as of August 3rd.
Significant Summarization model speedups
We’ve implemented changes that yield between a 43% to 200% increase in processing speed for our Summarization models, depending on which model is selected, with no measurable impact on the quality of results.
We have standardized the response from our API for automatically detected languages that do not support requested features. In particular, when Automatic Language Detection is used and the detected language does not support a feature requested in the transcript request, our API will return null in the response for that feature.
Introducing LeMUR, the easiest way to build LLM apps on spoken data
We've released LeMUR - our framework for applying LLMs to spoken data - for general availability. LeMUR is optimized for high accuracy on specific tasks:
Custom Summary allows users to automatically summarize files in a flexible way
Question & Answer allows users to ask specific questions about audio files and receive answers to these questions
Action Items allows users to automatically generate a list of action items from virtual or in-person meetings
Additionally, LeMUR can be applied to groups of transcripts in order to simultaneously analyze a set of files at once, allowing users to, for example, summarize many podcast episode or ask questions about a series of customer calls.
Our Python SDK allows users to work with LeMUR in just a few lines of code:
# version 0.15 or greater
import assemblyai as aai
# set your API key
aai.settings.api_key = f"{API_TOKEN}"
# transcribe the audio file (meeting recording)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("https://storage.googleapis.com/aai-web-samples/meeting.mp4")
# generate and print action items
result = transcript.lemur.action_items(
context="A GitLab meeting to discuss logistics",
answer_format="**<topic header>**\n<relevant action items>\n",
)
print(result.response)
Learn more about LeMUR in our blog post, or jump straight into the code in our associated Colab notebook.
Introducing our Conformer-2 model
We've released Conformer-2, our latest AI model for automatic speech recognition. Conformer-2 is trained on 1.1M hours of English audio data, extending Conformer-1 to provide improvements on proper nouns, alphanumerics, and robustness to noise.
Conformer-2 is now the default model for all English audio files sent to the v2/transcript endpoint for async processing and introduces no breaking changes.
We’ll be releasing Conformer-2 for real-time English transcriptions within the next few weeks.
Read our full blog post about Conformer-2 here. You can also try it out in our Playground.
New parameter and timestamps fix
We’ve introduced a new, optional speech_threshold parameter, allowing users to only transcribe files that contain at least a specified percentage of spoken audio, represented as a ratio in the range [0, 1].
You can use the speech_threshold parameter with our Python SDK as below:
Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from ...
If the percentage of speech in the audio file does not meet or surpass the provided threshold, then the value of transcript.text will be None and you will receive an error:
if not transcript.text:
print(transcript.error)
Audio speech threshold 0.9461 is below the requested speech threshold value 1.0
As usual, you can also include the speech_threshold parameter in the JSON of raw HTTP requests for any language.
We’ve fixed a bug in which timestamps could sometimes be incorrectly reported for our Topic Detection and Content Safety models.
We’ve made improvements to detect and remove a hallucination that would sometimes occur with specific audio patterns.
Character sequence improvements
We’ve fixed an issue in which the last character in an alphanumeric sequence could fail to be transcribed. The fix is effective immediately and constitutes a 95% reduction in errors of this type.
We’ve fixed an issue in which consecutive identical numbers in a long number sequence could fail to be transcribed. This fix is effective immediately and constitutes a 66% reduction in errors of this type.
Speaker Labels improvement
We’ve made improvements to the Speaker Labels model, adjusting the impact of the speakers_expected parameter to better allow the model to determine the correct number of unique speakers, especially in cases where one or more speakers talks substantially less than others.
We’ve expanded our caching system to include additional third-party resources to help further ensure our continued operations in the event of external resources being down.
Significant processing time improvement
We’ve made significant improvements to our transcoding pipeline, resulting in a 98% overall speedup in transcoding time and a 12% overall improvement in processing time for our asynchronous API.
We’ve implemented a caching system for some third-party resources to ensure our continued operations in the event of external resources being down.
Announcing LeMUR - our new framework for applying powerful LLMs to transcribed speech
We’re introducing our new framework LeMUR, which makes it simple to apply Large Language Models (LLMs) to transcripts of audio files up to 10 hours in length.
LLMs unlock a range of impressive capabilities that allow teams to build powerful Generative AI features. However, building these features is difficult due to the limited context windows of modern LLMs, among other challenges that necessitate the development of complicated processing pipelines.
LeMUR circumvents this problem by making it easy to apply LLMs to transcribed speech, meaning that product teams can focus on building differentiating Generative AI features rather than focusing on building infrastructure. Learn more about what LeMUR can do and how it works in our announcement blog, or jump straight to trying LeMUR in our Playground.
New PII and Entity Detection Model
We’ve upgraded to a new and more accurate PII Redaction model, which improves credit card detections in particular.
We’ve made stability improvements regarding the handling and caching of web requests. These improvements additionally fix a rare issue with punctuation detection.
Multilingual and stereo audio fixes, & Japanese model retraining
We’ve fixed two edge cases in our async transcription pipeline that were producing non-deterministic results from multilingual and stereo audio.
We’ve improved word boundary detection in our Japanese automatic speech recognition model. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.
Decreased latency and improved password reset
We’ve implemented a range of improvements to our English pipeline, leading to an average 38% improvement in overall latency for asynchronous English transcriptions.
We’ve made improvements to our password reset process, offering greater clarity to users attempting to reset their passwords while still ensuring security throughout the reset process.
Conformer-1 now available for Real-Time transcription, new Speaker Labels parameter, and more
We're excited to announce that our new Conformer-1 Speech Recognition model is now available for real-time English transcriptions, offering a 24.3% relative accuracy improvement.
Effective immediately, this state-of-the-art model will be the default model for all English audio data sent to the wss://api.assemblyai.com/v2/realtime/wsWebSocket API.
The Speaker Labels model now accepts a new optional parameter called speakers_expected. If you have high confidence in the number of speakers in an audio file, then you can specify it with speakers_expected in order to improve Speaker Labels performance, particularly for short utterances.
TLS 1.3 is now available for use with the AssemblyAI API. Using TLS 1.3 can decrease latency when establishing a connection to the API.
Our PII redaction scaling has been improved to increase stability, particularly when processing longer files.
We've improved the quality and accuracy of our Japanese model.
Short transcripts that are unable to be summarized will now return an empty summary and a successful transcript.
Introducing our Conformer-1 model
We've released our new Conformer-1 model for speech recognition. Conformer-1 was trained on 650K hours of audio data and is our most accurate model to date.
Conformer-1 is now the default model for all English audio files sent to the /v2/transcript endpoint for async processing.
We'll be releasing it for real-time English transcriptions within the next two weeks, and will add support for more languages soon.
New AI Models for Italian / Japanese Punctuation Improvements
We’ve made improvements to our Japanese punctuation model, increasing relative accuracy by 11%. These changes are effective immediately for all Japanese audio files submitted to AssemblyAI.
Hindi Punctuation Improvements
We’ve made improvements to our Hindi punctuation model, increasing relative accuracy by 26%. These changes are effective immediately for all Hindi audio files submitted to AssemblyAI.
We’ve tuned our production infrastructure to reduce latency and improve overall consistency when using the Topic Detection and Content Moderation models.
Improved PII Redaction
We’ve released a new version of our PII Redaction model to improve PII detection accuracy, especially for credit card and phone number edge cases. Improvements are effective immediately for all API calls that include PII redaction.
Automatic Language Detection Upgrade
We’ve released a new version of our Automatic Language Detection model that better targets speech-dense parts of audio files, yielding improved accuracy. Additionally, support for dual-channel and low-volume files has been improved. All changes are effective immediately.
Our Core Transcription API has been migrated from EC2 to ECS in order to ensure scalable, reliable service and preemptively protect against service interruptions.
Password Reset
Users can now reset their passwords from our web UI. From the Dashboard login, simply click “Forgot your password?” to initiate a password reset. Alternatively, users who are already logged in can change their passwords from the Account tab on the Dashboard.
The maximum phrase length for our Word Search feature has been increased from 2 to 5, effective immediately.
Dual Channel Support for Conversational Summarization / Improved Timestamps
We’ve made updates to our Conversational Summarization model to support dual-channel files. Effective immediately, dual_channel may be set to True when summary_model is set to conversational.
We've made significant improvements to timestamps for non-English audio. Timestamps are now typically accurate between 0 and 100 milliseconds. This improvement is effective immediately for all non-English audio files submitted to AssemblyAI for transcription.
Improved Transcription Accuracy for Phone Numbers
We’ve made updates to our Core Transcription model to improve the transcription accuracy of phone numbers by 10%. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.
We've improved scaling for our read-only database, resulting in improved performance for read-only requests.
v9 Transcription Model Released
We are happy to announce the release of our most accurate Speech Recognition model to date - version 9 (v9). This updated model delivers increased performance across many metrics on a wide range of audio types.
Word Error Rate, or WER, is the primary quantitative metric by which the performance of an automatic transcription model is measured. Our new v9 model shows significant improvements across a range of different audio types, as seen in the chart below, with a more than 11% improvement on average.
In addition to standard overall WER advancements, the new v9 model shows marked improvements with respect to proper nouns. In the chart below, we can see the relative performance increase of v9 over v8 for various types of audio, with a nearly 15% improvement on average.
The new v9 transcription model is currently live in production. This means that customers will see improved performance with no changes required on their end. The new model will automatically be used for all transcriptions created by our /v2/transcript endpoint going forward, with no need to upgrade for special access.
While our customers enjoy the elevated performance of the v9 model, our AI research team is already hard at work on our v10 model, which is slated to launch in early 2023. Building upon v9, the v10 model is expected to radically improve the state of the art in speech recognition.
Try our new v9 transcription model through your browser using the AssemblyAI Playground. Alternatively, sign up for a free API token to test it out through our API, or schedule a time with our AI experts to learn more.
New Summarization Models Tailored to Use Cases
We are excited to announce that new Summarization models are now available! Developers can now choose between multiple summary models that best fit their use case and customize the output based on the summary length.
The new models are:
Informative which is best for files with a single speaker, like a presentation or lecture
Conversational which is best for any multi-person conversation, like customer/agent phone calls or interview/interviewee calls
Catchy which is best for creating video, podcast, or media titles
Developers can use the summary_model parameter in their POST request to specify which of our summary models they would like to use. This new parameter can be used along with the existing summary_type parameter to allow the developer to customize the summary to their needs.
Check out our latest blog post to learn more about the new Summarization models or head to the AssemblyAI Playground to test Summarization in your browser!
Improved Transcription Accuracy for COVID
We’ve made updates to our Core Transcription model to improve the transcription accuracy of the word COVID. This improvement is effective immediately for all audio files submitted to AssemblyAI for transcription.
Static IP support for webhooks is now generally available!
Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.
See our walkthrough on how to start receiving webhooks for your transcriptions.
Starting today, you can now transcribe and summarize entire audio files with a single API call.
To enable our new Summarization models, include the following parameter: "summarization": truein your POST request to /v2/transcript. When the transcription finishes, you will see the summary key in the JSON response containing the summary of your transcribed audio or video file.
By default, summaries will be returned in the style of bullet points. You can customize the style of summary by including the optional summary_type parameter in your POST request along with one of the following values: paragraph, headline, or gist. Here is the full list of summary types we support.
// summary_type = "paragraph"
"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat."
// summary_type = "headline"
"summary": "Josh Seiden and Brian Donohue discuss the
topic of outcomes versus output."
// summary_type = "gist"
"summary": "Outcomes over output"
// summary_type = = "bullets"
"summary": "Josh Seiden and Brian Donohue discuss
the topic of outcome versus output on Inside Intercom.
Josh Seiden is a product consultant and author who has
just released a book called Outcomes Over Output.
Brian is product management director and he's looking
forward to the chat.\n- ..."
Examples of use cases for Summarization include:
Identify key takeaways from phone calls to speed up post-call review and reduce manual summarization
Summarize long podcasts into short descriptions so users can preview before they listen.
Instantly generate meetings summaries to quickly recap virtual meetings and highlight post-meeting actions
Suggest 3-5 word video titles automatically for user-generated content
Synthesize long educational courses, lectures, and media broadcasts into their most important points for faster consumption
We're really excited to see what you build with our new Summarization models. To get started, try it out for free in our no-code playground or visit our documentation for more info on how to enable Summarization in your API requests.
Automatic Casing / Short Utterances
We’ve improved our Automatic Casing model and fixed a minor bug that caused over-capitalization in English transcripts. The Automatic Casing model is enabled by default with our Core Transcription API to improve transcript readability for video captions (SRT/VTT). See our documentation for more info on Automatic Casing.
Our Core Transcription model has been fine-tuned to better detect short utterances in English transcripts. Examples of short utterances include one-word answers such as “No.” and “Right.” This update will take effect immediately for all customers.
Static IP Support for Webhooks
Over the next few weeks, we will begin rolling out Static IP support for webhooks to customers in stages.
Outgoing webhook requests sent from AssemblyAI will now originate from a static IP address 44.238.19.20, rather than a dynamic IP address. This gives you the ability to easily validate that the source of the incoming request is coming from our server. Optionally, you can choose to whitelist this static IP address to add an additional layer of security to your system.
See our walkthrough on how to start receiving webhooks for your transcriptions.
Improved Number Transcription
We’ve made improvements to our Core Transcription model to better identify and transcribe numbers present in your audio files.
Accurate number transcription is critical for customers that need to redact Personally Identifiable Information (PII) that gets exchanged during phone calls. Examples of PII include credit card numbers, addresses, phone numbers, and social security numbers.
In order to help you handle sensitive user data at scale, our PII Redaction model automatically detects and removes sensitive info from transcriptions. For example, when PII redaction is enabled, a phone number like 412-412-4124 would become ###-###-####.
To learn more, check out our blog that covers all of our PII Redaction Policies or try our PII Redaction model in our Sandbox here!
Improved Disfluency Timestamps
We've updated our Disfluency Detection model to improve the accuracy of timestamps for disfluency words.
By default, disfluencies such as "um" or "uh" and "hm" are automatically excluded from transcripts. However, we allow customers to include these filler words by simply setting the disfluencies parameter to true in their POST request to /v2/transcript, which enables our Disfluency Detection model.
We've improved the Speaker Label model’s ability to identify unique speakers for single word or short utterances.
Historical Transcript Bug Fix
We've fixed a bug with the Historical Transcript endpoint that was causing null to appear as the value of the completed key.
Japanese Transcription Now Available
Today, we’re releasing our new Japanese transcription model to help you transcribe and analyze your Japanese audio and video files using our cutting-edge AI.
Now you can automatically convert any Japanese audio or video file to text by including "language_code": "ja" in your POST request to our /v2/transcript endpoint.
In conjunction with transcription, we’ve also added Japanese support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing,Profanity Filtering, and more. This means you can boost transcription accuracy with more granularity based on your use case. See the full list of supported models available for Japanese transcriptions here.
We’ve released our new Hindi transcription model to help you transcribe and analyze your Hindi audio and video files.
Now you can automatically convert any Hindi audio or video file to text by including "language_code": "hi" in your POST request to our /v2/transcript endpoint.
We’ve also added Hindi support for our AI models including Custom Vocabulary (Word Boost), Custom Spelling, Automatic Punctuation / Casing,Profanity Filtering, and more. See the full list of supported models available for Hindi transcriptions here.
Our Webhook service now supports the use of Custom Headers for authentication.
A Custom Header can be used for added security to authenticate webhook requests from AssemblyAI. This feature allows a developer to optionally provide a value to be used as an authorization header on the returning webhook from AssemblyAI, giving the ability to validate incoming webhook requests.
To use a Custom Header, you will include two additional parameters in your POST request to /v2/transcript: webhook_auth_header_name and webhook_auth_header_value. The webhook_auth_header_name parameter accepts a string containing the header's name which will be inserted into the webhook request. The webhook_auth_header_value parameter accepts a string with the value of the header that will be inserted into the webhook request. See our Using Webhooks documentation to learn more and view our code examples.
Improved Speaker Labels Accuracy and Speaker Segmentation
Improved the overall accuracy of the Speaker Labels feature and the model’s ability to segment speakers.
Fix a small edge case that would occasionally cause some transcripts to complete with NULL as the language_code value.
Content Moderation and Topic Detection Available for Portuguese
Accuracy of the Automatic Language Detection model improved on files with large amounts of silence.
Improved speaker segmentation accuracy for Speaker Labels.
Dutch and Portuguese Support Released
Dutch and Portuguese transcription is now generally available for our /v2/transcript endpoint. See our documentation for more information on specifying a language in your POST request.
Content Moderation and Topic Detection Available for French, German, and Spanish
Improved redaction accuracy for credit_card_number, credit_card_expiration, and credit_card_cvv policies in our PII Redaction feature.
Fixed an edge case that would occasionally affect the capitalization of words in transcripts when disfluencies was set to true.
French, German, and Italian Support Released
French, German, and Italian transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.
Released v2 of our Spanish model, improving absolute accuracy by ~4%.
Spanish Language Support, Automatic Language Detection, and Custom Spelling Released
Spanish transcription is now publicly available. Check out our documentation for more information on Specifying a Language in your POST request.
Automatic Language Detection is now available for our /v2/transcript endpoint. This feature can identify the dominant language that’s spoken in an audio file and route the file to the appropriate model for the detected language.
Our new Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change all instances "CS 50" to "CS50".
Auto Chapters v6 Released
Released Auto Chapters v6, improving the summarization of longer chapters.
Auto Chapters v5 Released
Auto Chapters v5 released, improving headline and gist generation and quote formatting in the summary key.
Fixed an edge case in Dual-Channel files where initial words in an audio file would occasionally be missed in the transcription.
Regional Spelling Improvements
Region-specific spelling improved for en_uk and en_au language codes.
Improved the formatting of “MP3” in transcripts.
Improved Real-Time transcription error handling for corrupted audio files.
Real-Time v3 Released
Released v3 of our Real-Time Transcription model, improving overall accuracy by 18% and proper noun recognition by 23% relative to the v2 model.
Improved PII Redaction and Entity Detection for CREDIT_CARD_CVV and LOCATION.
Auto Chapters v4 Released, Auto Retry Feature Added
Added an Auto Retry feature, which automatically retries transcripts that fail with a Server error, developers have been alerted message. This feature is enabled by default. To disable it, visit the Account tab in your Developer Dashboard.
Auto Chapters v4 released, improving chapter summarization in the summary key.
Added a trailing period for the gist key in the Auto Chapters feature.
Auto Chapters v3 Released
Released v3 of our Auto Chapters model, improving the model’s ability to segment audio into chapters and chapter boundary detection by 56.3%.
Improved formatting for Auto Chapters summaries. The summary, headline, and gist keys now include better punctuation, casing, and text formatting.
Webhook Status Codes, Entity Detection Improved
POST requests from the API to webhook URLs will now accept any status code from 200 to 299 as a successful HTTP response. Previously only 200 status codes were accepted.
Updated the text key in our Entity Detection feature to return the proper noun rather than the possessive noun. For example, Andrew instead of Andrew’s.
Fixed an edge case with Entity Detection where under certain contexts, a disfluency could be identified as an entity.
Punctuation and Casing Accuracy Improved, Inverse Text Normalization Model Updated
Released v4 of our Punctuation model, increasing punctuation and casing accuracy by ~2%.
Updated our Inverse Text Normalization (ITN) model for our /v2/transcript endpoint, improving web address and email address formatting and fixing the occasional number formatting issue.
Fixed an edge case where multi-channel files would return no text when the two channels were out of phase with each other.
Support for Non-English Languages Coming Soon
Our Deep Learning team has been hard at work training our new non-English language models. In the coming weeks, we will be adding support for French, German, Italian, and Spanish.
Shorter Summaries Added to Auto Chapters, Improved Filler Word Detection
Added a new gist key to the Auto Chapters feature. This new key provides an ultra-short, usually 3 to 8 word summary of the content spoken during that chapter.
Implemented profanity filtering into Auto Chapters, which will prevent the API from generating a summary, headline, or gist that includes profanity.
Improved Filler Word (aka, disfluencies) detection by ~5%.
Improved accuracy for Real-Time Streaming Transcription.
Fixed an edge case where WebSocket connections for Real-Time Transcription sessions would occasionally not close properly after the session was terminated. This resulted in the client receiving a 4031 error code even after sending a session termination message.
Corrected a bug that occasionally attributed disfluencies to the wrong utterance when Speaker Labels or Dual-Channel Transcription was enabled.
v8.5 Asynchronous Transcription Model Released
Our Asynchronous Speech Recognition model is now even better with the release of v8.5.
This update improves overall accuracy by 4% relative to our v8 model.
This is achieved by improving the model’s ability to handle noisy or difficult-to-decipher audio.
The v8.5 model also improves Inverse Text Normalization for numbers.
New and Improved API Documentation
Launched the new AssemblyAI Docs, with more complete documentation and an easy-to-navigate interface so developers can effectively use and integrate with our API. Click here to view the new and improved documentation.
Added two new fields to the FinalTranscript response for Real-time Transcriptions. The punctuated key is a Boolean value indicating if punctuation was successful. The text_formatted key is a Boolean value indicating if Inverse Text Normalization (ITN) was successful.
Inverse Text Normalization Added to Real-Time, Word Boost Accuracy Improved
Inverse Text Normalization (ITN) added for our /v2/realtime and /v2/stream endpoints. ITN improves formatting of entities like numbers, dates, and proper nouns in the transcription text.
Improved accuracy for Custom Vocabulary (aka, Word Boosts) with the Real-Time transcription API.
Fixed an edge case that would sometimes cause transcription errors when disfluencies was set to true and no words were identified in the audio file.
Entity Detection Released, Improved Filler Word Detection, Usage Alerts
v1 release of Entity Detection - automatically detects a wide range of entities like person and company names, emails, addresses, dates, locations, events, and more.
To include Entity Detection in your transcript, set entity_detection to true in your POST request to /v2/transcript.
When your transcript is complete, you will see an entities key towards the bottom of the JSON response containing the entities detected, as shown here:
Usage Alert feature added, allowing customers to set a monthly usage threshold on their account along with a list of email addresses to be notified when that monthly threshold has been exceeded. This feature can be enabled by clicking “Set up alerts” on the “Developers” tab in the Dashboard.
When Content Safety is enabled, a summary of the severity scores detected will now be returned in the API response under the severity_score_summary nested inside of the content_safety_labels key, as shown below.
Improved Filler Word (aka, disfluencies) detection by ~25%.
Fixed a bug in Auto Chapters that would occasionally add an extra space between sentences for headlines and summaries.
Additional MIME Type Detection Added for OPUS Files
Added additional MIME type detection to detect a wider variety of OPUS files.
Fixed an issue with word timing calculations that caused issues with speaker labeling for a small number of transcripts.
Custom Vocabulary Accuracy Significantly Improved
Significantly improved the accuracy of Custom Vocabulary, and the impact of the boost_param field to control the weight for Custom Vocabulary.
Improved precision of word timings.
New Auto Chapters, Sentiment Analysis, and Disfluencies Features Released
v1 release of Auto Chapters - which provides a "summary over time" by breaking audio/video files into "chapters" based on the topic of conversation. Check out our blog to read more about this new feature. To enable Auto Chapters in your request, you can set auto_chapters: true in your POST request to /v2/transcript.
v1 release of Sentiment Analysis - that determines the sentiment of sentences in a transcript as "positive", "negative", or "neutral". Sentiment Analysis can be enabled by including the sentiment_analysis: true parameter in your POST request to /v2/transcript.
Filler-words like "um" and "uh" can now be included in the transcription text. Simply include disfluencies: true in your POST request to /v2/transcript.
Deployed Speaker Labels version 1.3.0. Improves overall diarization/labeling accuracy.
Improved our internal auto-scaling for asynchronous transcription, to keep turnaround times consistently low during periods of high usage.
New Language Code Parameter for English Spelling
Added a new language_code parameter when making requests to /v2/transcript.
Developers can set this to en_us, en_uk, and en_au, which will ensure the correct English spelling is used - British English, Australian English, or US English (Default).
Quick note: for customers that were historically using the assemblyai_en_au or assemblyai_en_uk acoustic models, the language_code parameter is essentially redundant and doesn't need to be used.
Fixed an edge-case where some files with prolonged silences would occasionally have a single word predicted, such as "you" or "hi."
New Features Coming Soon, Bug Fixes
This week, our engineering team has been hard at work preparing for the release of exciting new features like:
Chapter Detection: Automatically summarize audio and video files into segments (aka "chapters").
Sentiment Analysis: Determine the sentiment of sentences in your transcript as "positive", "negative", or "neutral".
Disfluencies: Detects filler-words like "um" and "uh".
Improved average real-time latency by 2.1% and p99 latency by 0.06%.
Fixed an edge-case where confidence scores in the utterances category for dual-channel audio files would occasionally receive a confidence score greater than 1.0.
Improved v8 Model Processing Speed
Improved the API's ability to handle audio/video files with a duration over 8 hours.
Further improved transcription processing times by 12%.
Fixed an edge case in our responses for dual channel audio files where if speaker 2 interrupted speaker 1, the text from speaker 2 would cause the text from speaker 1 to be split into multiple turns, rather than contextually keeping all of speaker 1's text together.
v8 Transcription Model Released
Today, we're happy to announce the release of our most accurate Speech Recognition model for asynchronous transcription to date—version 8 (v8).
This new model dramatically improves overall accuracy (up to 19% relative), and proper noun accuracy as well (up to 25% relative).
You can read more about our v8 model in our blog here.
Fixed an edge case where a small percentage of short (<60 seconds in length) dual-channel audio files, with the same audio on each channel, resulted in repeated words in the transcription.
v2 Real-Time and v4 Topic Detection Models Released
This new model improves accuracy of our Real-Time Streaming Transcription by ~10%.
Launched our Topic Detection v4 model, with an accuracy boost of ~8.37% over v3 (read more on our blog).
v3 Topic Detection Model, PII Redaction Bug Fixes
Released our v3 Topic Detection model.
This model dramatically improves the Topic Detection feature's ability to accurately detect topics based on context.
For example, in the following text, the model was able to accurately predict "Rugby" without the mention of the sport directly, due to the mention of "Ed Robinson" (a Rugby coach).
PII Redaction has been improved to better identify (and redact) phone numbers even when they are not explicitly referred to as a phone number.
Released a fix for PII Redaction that corrects an issue where the model would sometimes detect phone numbers as credit card numbers or social security numbers.
Severity Scores for Content Safety
The API now returns a severity score along with the confidence and label keys when using the Content Safety feature.
The severity score measures how intense a detected Content Safety label is on a scale of 0 to 1.
For example, a natural disaster that leads to mass casualties will have a score of 1.0, while a small storm that breaks a mailbox will only be 0.1.
Fixed an edge case where a small number of transcripts with Automatic Transcript Highlights turned on were not returning any results.
Real-time Transcription and Streaming Fixes
Fixed an edge case where higher sample rates would occasionally trigger a Client sent audio too fast error from the Real-Time Streaming WebSocket API.
Fixed an edge case where some streams from Real-Time Streaming WebSocket API were held open after a customer idled their session.
Fixed an edge case in the /v2/stream endpoint, where large periods of silence would occasionally cause automatic punctuation to fail.
Improved error handling when non-JSON input is sent to the /v2/transcript endpoint.
Punctuation v3, Word Search, Bug Fixes
v3 Punctuation Model released.
v3 brings improved accuracy to automatic punctuation and casing for both async (/v2/transcript) and real-time (WebSocket API) transcripts.
Released an all-new Word Search feature that will allow developers to search for words in a completed transcript.
This new feature returns how many times the word was spoken, the index of that word in the transcript's JSON response word list/array, and the associated timestamps for each matched word.
Fixed an issue causing a small subset of words not to be filtered when profanity filtering was turned on.
New Dashboard and Real-Time Data
This week, we released an entirely new dashboard for developers:
The new developer dashboard introduces:
Better usage reports for API usage and spend
Easier access to API Docs, API Tokens, and Account information
A no-code demo to test the API on your audio/video files without having to write any code
General Improvements
Fixed a bug with PII Redaction, where sometimes dollar amount and date tokens were not being properly redacted.
AssemblyAI now supports even more audio/video file formats thanks to improvements to our audio transcoding pipeline!
Fixed a rare bug where a small percentage of transcripts (0.01%) would incorrectly sit in a status of "queued" for up to 60 seconds.
ITN Model Update
Today we've released a major improvement to our ITN (Inverse Text Normalization) model. This results in better formatting for entities within the transcription, such as phone numbers, money amounts, and dates.
For example:
Money:
Spoken: "Hey, do you have five dollars?"
Model output with ITN: "Hey, do you have $5?"
Years:
Spoken: "Yes, I believe it was back in two thousand eight"
Model output with ITN: "Yes, I believe it was back in 2008."
Punctuation Model v2.5 Released
Today we've released an updated Automatic Punctuation and Casing Restoration model (Punctuation v2.5)! This update results in improved capitalization of proper nouns in transcripts, reduces over-capitalization issues where some words like were being incorrectly capitalized, and improves some edge cases around words with commas around them. For example:
"....in the Us" now becomes "....in the US."
"whatsapp," now becomes "WhatsApp,"
Content Safety Model (v7) Released
We have released an updated Content Safety Model - v7! Performance for 10 out of all 19 Content Safety labels has been improved, with the biggest improvements being for the Profanity and Natural Disasters labels.
Real-Time Transcription Model v1.1 Released
We have just released a major real-time update!
Developers will now be able to use the word_boost parameter in requests to the real-time API, allowing you to introduce your own custom vocabulary to the model for that given session! This custom vocabulary will lead to improved accuracy for the provided words.
General Improvements
We will now be limiting one websocket connection per real-time session to ensure the integrity of a customer's transcription and prevent multiple users/clients from using the websocket same session.
Note: Developers can still have multiple real-time sessions open in parallel, up to the Concurrency Limit on the account. For example, if an account has a Concurrency Limit of 32, that account could have up to 32 concurrent real-time sessions open.
Topic Detection Model v2 Released
Today we have released v2 of our Topic Detection Model. This new model will predict multipletopics for each paragraph of text, whereas v1 was limited to predicting a single. For example, given the text:
"Elon Musk just released a new Tesla that drives itself!"
v1:
Automotive>AutoType>DriverlessCars: 1
v2:
Automotive>AutoType>DriverlessCars: 1
PopCulture : 0.84
PopCulture>CelebrityStyle: 0.56
This improvement will result in the visual output looking significantly better, and containing more informative responses for developers!
Increased Number of Categories Returned for Topic Detection Summary
In this minor improvement, we have increased the number of topics the model can return in the summary key of the JSON response from 10 to 20.
Temporary Tokens for Real-Time
Often times, developers will need to expose their AssemblyAI API Key in their client applications when establishing connections with our real-time streaming transcription API. Now, developers can create a temporary API token that expires in a customizable amount of time (similar to an AWS S3 Temporary Authorization URL) that can safely be exposed in the client applications and front-ends.
This will allow developers to create short-lived API tokens designed to be used securely in the browser, along with authorization within the query string!
For example, authenticating in the query parameters with a temporary token would look like so:
Adding "Marijuana" and "Sensitive Social Issues" as Possible Content Safety Labels
In this minor update, we improve the accuracy across all Content Safety labels, and add two new labels for better content categorization. The two new labels are sensitive_social_issues and marijuana.
New label definitions:
sensitive_social_issues: This category includes content that may be considered insensitive, irresponsible, or harmful to specific groups based on their beliefs, political affiliation, sexual orientation, or gender identity.
marijuana:This category includes content that discusses marijuana or its usage.
Real-Time Transcription is Now GA
We are pleased to announce the official release of our Real-Time Streaming Transcription API! This API uses WebSockets and a fast Conformer Neural Network architecture that allows for a quick and accurate transcription in real-time.
With this minor update, our Redaction Model will better detect Social Security Numbers and Medical References for additional security and data protection!
New Punctuation Model (v2)
Today we released a new punctuation model that is more extensive than its predecessor, and will drive improvements in punctuation and casing accuracy!
New Features & Updates
List Historical Transcripts
Developers can get a list of their historical transcriptions. This list can be filtered by status and date. This new endpoint will allow developers to see if they have any queued, processing, or throttled transcriptions.
Pre-Formatted Paragraphs
Developers can now get pre-formatted paragraphs by calling our new paragraphs endpoint! The model will attempt to semantically break the transcript up into paragraphs of five sentences or less.
Now each topic will include timestamps for each segment of classified text. We have also added a new summary key that will contain the confidence of all unique topics detected throughout the entire transcript.
We have made improvements to our Speaker Diarization Model that increases accuracy over short and long transcripts.
New PII Classes
We have released an update to our PII Redaction Model that will now support detecting and redacting additional classes!
blood_type
medical_condition
drug (including vitamins/minerals)
injury
medical_process
Entity Definitions:
blood_type: Blood type
medical_condition: A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
drug: Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol
injury: Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages, and dislocations.
medical_process: Medical process, including treatments, procedures, and tests. E.g., "heart surgery," "CT scan."