Does AssemblyAI offer Zero Data Retention?
AssemblyAI, as of July 23, 2024, does not have a formal Zero Data Retention (ZDR) policy. However, under our general customer data retention policy, we retain customer data in the typical course, subject to minimum purpose limitation principles.
In that, we retain data for the longer of:
- subject to the applicable customer contract
- as required by applicable law or industry certification requirement
- for the limited purpose for which the data was collected and processed.
Nonetheless, we deploy a risk conservative data deletion protocol (described below) for key customer files, and provide mechanisms through which customers may delete their own files, achieving and configuring ZDR status.
Default Data Retention: Audio & Transcripts
Implementation of ZDR: Pre-recorded audio
This section applies only to the pre-recorded audio transcription endpoint:
https://api.assemblyai.com/v2/transcript
.
ZDR in AssemblyAI for pre-recorded audio can be achieved by performing a deletion request after a file has been transcribed by the API. Each file is provided with a unique identifier known as the transcript_id
which can be stored by the customer and/or retrieved via a GET request.
Upon making a successful POST request for transcription to https://api.assemblyai.com/v2/transcript
, the response will include a key id in the JSON response. This id is the unique identifier for the transcript which can be used to retrieve a final transcript once the transcription job has been completed on the AssemblyAI side. Until a GET request is made to retrieve the final transcript, it is not recommended to run any deletion since you will have not yet been able to retrieve the result from the API.
Upon making a successful GET request to https://api.assemblyai.com/v2/transcript:transcript_id
, the status key should be checked for the response as completed or error. If the status is not completed or error, the user should continue to poll for results to retrieve the transcript. Once a completed status is achieved, all outputs should be stored on the customer end in their own database for any record keeping. If an error status is retrieved, the customer should check the error key of the response to diagnose what went wrong and re-run the file with the recommended changes.
Once a transcript has been retrieved via a completed status or has thrown an error status, it should now be deleted via a DEL request to https://api.assemblyai.com/v2/transcript:transcript_id
. This will trigger a deletion job on AssemblyAI’s end and any sensitive data from the request will no longer be accessible via the API. Sensitive data here is defined as the information contained in the actual request - audio file URL, transcript text, and any model outputs - as other data around the actual API request logged is retained for record keeping.
Regarding the deletion of audio data, if you used a presigned URL and host the audio file in your cloud environment, all audio data will also be deleted when the deletion request for a given transcript_id
is submitted. If you used the upload endpoint for your audio file https://www.assemblyai.com/docs/api-reference/files/upload
, the audio file data from this endpoint will be deleted on a schedule after 2 days. All intermediate audio transcription artifacts used for processing the file - transcoded audio, original audio files, etc. - will be deleted on a schedule after 3 days, unless a deletion request is submitted.
You can also reference the API documentation for the requests above:
- Transcribe a file - POST request
- Retrieve a transcript - GET request
- Delete a transcript - DEL request
- Handling transcription errors
Confirming Data Deletion: Pre-recorded audio
Should you wish to confirm a file has been deleted, or in case you did not store the transcript_id
when the transcription request was made, you can get a list of all transcripts. You can make a GET request to https://api.assemblyai.com/v2/transcript
which will return a list of all transcripts created or specify a transcript_id
to get a single transcript status. This will provide a full list of all transcripts associated with an account and their current status.
Implementation of ZDR: Streaming audio
This section applies only to the streaming audio transcription endpoint:
wss://streaming.assemblyai.com/v3/ws
.
ZDR for the streaming audio endpoint is natively supported in the design of the API. By default, we do not store or maintain any information about the audio streamed to us - it is processed for transcription and thrown away. Certain metadata about the transcript is stored and maintained for logging and billing purposes, but none of the original audio is stored.