Data retention and model training
Model training
We consider model training critical to providing you with the most accurate models and services that we can. Only certain files submitted to the API, as permitted by the applicable contract, are used for model training. These files undergo a redaction process designed to redact personally identifiable information before any remaining data is used for model training. We will not use files you submit for model training if you are subject to a Business Associate Addendum, are utilizing our European servers, or if you have opted out from model training. You can find more information on if and how to opt out here.
LLM Gateway model training
AssemblyAI has opted out of data training with all LLM Gateway providers.
Please note this is separate from whether AssemblyAI may train our models with your data. You can find more information on if and how to opt out of data sharing for our model improvement program here.
Encryption
Data at rest is encrypted with AES 128 or AES-256, and data in transit uses TLS 1.2+. AssemblyAI posts SSL scans quarterly to its Trust Center to verify the use of TLS with modern ciphersuites to its service.
Async
For transcription of pre-recorded audio, AssemblyAI supports the following TLS versions and cipher suites:
Supported TLS versions:
- TLS 1.3
- TLS 1.2
Supported cipher suites:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
- ECDHE-ECDSA-AES128-GCM-SHA256
- ECDHE-RSA-AES128-GCM-SHA256
- ECDHE-ECDSA-AES256-GCM-SHA384
- ECDHE-RSA-AES256-GCM-SHA384
Streaming
For transcription of streaming audio, AssemblyAI supports the following TLS version and cipher suites:
Supported TLS versions:
- TLS 1.3
Supported cipher suites:
- TLS_AES_128_GCM_SHA256
- TLS_AES_256_GCM_SHA384
- TLS_CHACHA20_POLY1305_SHA256
Ensure your client or application is configured to use one of the supported TLS versions and cipher suites when connecting to AssemblyAI services.
GDPR compliance
We have designed our products with GDPR principles top of mind but also understand that privacy compliance is a moving target. As privacy requirements continue to evolve (rapidly), we are constantly working to assess and improve our practices. You can read more about our privacy practices in our Privacy Policy here, and Data Processing Addendum here.
SOC2 certification
We have both SOC2 Type 1 and Type 2 certifications. You can find more information on this on our Trust Center. We also have a great blog post on the subject, which you can find here.
Data retention
Streaming production environment
If you are opted out of model training, we offer zero data retention of audio and transcripts for our Streaming product. Certain metadata about the transcript is stored and maintained for logging and billing purposes.
The model training environment differs from the production environment. You can find more information on model training in our Model Training section. If you would like to opt out of model training, please see our Opt-Out FAQ.
Asynchronous production environment
*Certain metadata is stored for logging and billing purposes.
**The minimum TTL that AssemblyAI may set for Final Transcription Artifacts in the asynchronous production environment is 1 (one) hour. The TTL mechanism that AssemblyAI uses is through Amazon Web Services’s (“AWS”) DynamoDB TTL mechanism (the “AWS TTL”). The deletion process begins in AWS at TTL expiration, but is subject to AWS TTL processing times. In practice, these deletion events typically take place anywhere from a few minutes to a few hours after the deletion process begins in AWS, depending on circumstances, including server location. However, we have seen lag times anywhere from 2-3 hours to a few days. Once the artifact is deleted in AWS, AssemblyAI processes this deletion almost immediately. See here for more information about AWS’s TTL mechanism.
Confirming deletion
Should you wish to confirm a file has been deleted, or in case you did not store the transcript_id when the transcription request was made, you can get a list of all transcripts. You can make a GET request to https://api.assemblyai.com/v2/transcript which will return a list of all transcripts created or specify a transcript_id to review a single transcript.
The model training environment differs from the production environment. You can find more information on model training in our Model Training section. If you would like to opt out of model training, please see our Opt-Out FAQ.
LLM Gateway production environment
-
If you have an executed BAA and use either Anthropic or Google inference models, we offer zero data retention for LLM Gateway inputs and outputs. Certain metadata is stored for logging and billing purposes.
-
If you have a designated TTL on your LLM Gateway account, we delete inputs and outputs on an hourly basis. Certain metadata is stored for logging and billing purposes.
-
If a customer initiates a deletion request, inputs and outputs are deleted at the time of the request. Certain metadata is stored for logging and billing purposes. For deletion of speech understanding requests, please see below.
-
For speech understanding requests, such as translation, speaker ID, or custom formatting, the retention is linked to the life of an asynchronous Final Transcription Artifact, noted above.
For more information about how Anthropic, Google, and OpenAI retain data, please refer to the Available Models charts in our LLM Gateway Overview.
AssemblyAI has opted out of model training with all LLM Gateway providers. Please note this is separate from whether AssemblyAI may train our models with your data, and the model training environment differs from the production environment. You can find more information on model training in our Model Training section. If you would like to opt out of model training, please see our Opt-Out FAQ.