Skip to main content

Entity Detection

The Entity Detection model is a powerful tool that can automatically identify and categorize key information in audio content transcribed, such as the names of people, organizations, addresses, phone numbers, medical data, social security numbers, and more.


When submitting files for transcription, include the entity_detection parameter in your request body and set it to true.

You can explore the full JSON response here:


You run this code snippet in Colab here, or you can view the full source code here.

Understanding the response

The JSON object above contains all information about the transcription. Depending on which Models are used to analyze the audio, the attributes of this object will vary. For example, in the quickstart above we did not enable Summarization, which is reflected by the summarization: false key-value pair in the JSON above. Had we activated Summarization, then the summary, summary_type, and summary_model key values would contain the file summary (and additional details) rather than the current null values.

To access the Entity Detection information, we use the entity_detection and entities keys:

The reference table below lists all relevant attributes along with their descriptions, where we've called the JSON response object results. Object attributes are accessed via dot notation, and arbitrary array elements are denoted with [i]. For example, results.words[i].text refers to the text attribute of the i-th element of the words array in the JSON results object.

results.entity_detectionbooleanWhether Entity Detection was enabled in the transcription request
results.entitiesarrayAn array of detected entities
results.entities[i].entity_typestringThe type of entity for the i-th detected entity
results.entities[i].textstringThe text for the i-th detected entity
results.entities[i].startnumberThe starting time, in milliseconds, at which the i-th detected entity appears in the audio file
results.entities[i].endnumberThe ending time, in milliseconds, for the i-th detected entity in the audio file

All entities supported by the model

The model is designed to automatically detect and classify various types of entities within the transcription text. The detected entities and their corresponding types is listed individually in the entities key of the response object, ordered by when they first appear in the transcript.

banking_informationBanking information, including account and routing numbers
blood_typeBlood type (e.g., O-, AB positive)
credit_card_cvvCredit card verification code (e.g., CVV: 080)
credit_card_expirationExpiration date of a credit card
credit_card_numberCredit card number
dateSpecific calendar date (e.g., December 18)
date_of_birthDate of Birth (e.g., Date of Birth: March 7, 1961)
drivers_licenseDriver's license number (e.g., DL #356933-540)
drugMedications, vitamins, or supplements (e.g., Advil, Acetaminophen, Panadol)
email_addressEmail address (e.g.,
eventName of an event or holiday (e.g., Olympics, Yom Kippur)
injuryBodily injury (e.g., I broke my arm, I have a sprained wrist)
languageName of a natural language (e.g., Spanish, French)
locationAny location reference including mailing address, postal code, city, state, province, or country
medical_conditionName of a medical condition, disease, syndrome, deficit, or disorder (e.g., chronic fatigue syndrome, arrhythmia, depression)
medical_processMedical process, including treatments, procedures, and tests (e.g., heart surgery, CT scan)
money_amountName and/or amount of currency (e.g., 15 pesos, $94.50)
nationalityTerms indicating nationality, ethnicity, or race (e.g., American, Asian, Caucasian)
occupationJob title or profession (e.g., professor, actors, engineer, CPA)
organizationName of an organization (e.g., CNN, McDonalds, University of Alaska)
passwordAccount passwords, PINs, access keys, or verification answers (e.g., 27%alfalfa, temp1234, My mother's maiden name is Smith)
person_ageNumber associated with an age (e.g., 27, 75)
person_nameName of a person (e.g., Bob, Doug Jones)
phone_numberTelephone or fax number
political_affiliationTerms referring to a political party, movement, or ideology (e.g., Republican, Liberal)
religionTerms indicating religious affiliation (e.g., Hindu, Catholic)
timeExpressions indicating clock times (e.g., 19:37:28, 10pm EST)
urlInternet addresses (e.g.,
us_social_security_numberSocial Security Number or equivalent


How does the Entity Detection model handle misspellings or variations of entities?

The model is capable of identifying entities with variations in spelling or formatting. However, the accuracy of the detection may depend on the severity of the variation or misspelling.

Can the Entity Detection model identify custom entity types?

No, the Entity Detection model currently doesn't support the detection of custom entity types. However, the model is capable of detecting a wide range of predefined entity types, including people, organizations, locations, dates, times, addresses, phone numbers, medical data, and banking information, among others.

How can I improve the accuracy of the Entity Detection model?

To improve the accuracy of the Entity Detection model, it's recommended to provide high-quality audio files with clear and distinct speech. In addition, it's important to ensure that the audio content is relevant to the use case and that the entities being detected are relevant to the intended analysis. Finally, it may be helpful to review and adjust the model's configuration parameters, such as the confidence threshold for entity detection, to optimize the results.