Introducing Auto Chapters - Summarize Audio and Video Files

Today, we're excited to officially announce our newest feature at AssemblyAI - Auto Chapters.

The Auto Chapters, or Text Summarization, feature provides a "summary over time" for audio content transcribed with AssemblyAI's Speech-to-Text API. It works by first breaking audio/video files into logical "chapters" as the topic of conversation changes, and then provides an automatically generated summary for each "chapter" of content.

In the above graphic, we demonstrate the "chapters" that were extracted, and the summaries that were generated, from Joe Biden's State of the Union address. For each "chapter" that is detected, the API returns a JSON schema like the below:

chapters: [
    {
        "start": 0,
        "end": 20000,
        "summary": "The American job plan is going to create millions of good paying jobs.  jobs created in an American jobs plan do not require a College degree. 75% don't require an associate's degree.",
        "headline": "The American job plan is going to create millions of good paying jobs.",
    }
    ...
]

As you can see above, the API responds with the start and end timestamps (in milliseconds) for each "chapter" that was detected, a summary which is a few sentence summary of the content spoken during that timeframe, and a short headline which can be thought of as a "summary of the summary".

Auto Chapters In Action

Below is the entire 1 hour and 43 minute State of the Union address that Biden gave to Congress on April 28, 2021, and the "chapters" that were detected by the AssemblyAI API, along with their summaries.

1:45: I have the high privilege and distinct honor to present to you the President of the United States.

31:42: 90% of Americans now live within 5 miles of a vaccination site.

44:28: The American job plan is going to create millions of good paying jobs.

47:59: No one working 40 hours a week should live below the poverty line.

48:22: American jobs finally be the biggest increase in non defense research and development.

49:21: The National Institute of Health, the NIH, should create a similar advanced research Projects agency for Health.

50:31: It would have a singular purpose to develop breakthroughs to prevent, detect and treat diseases like Alzheimer's, diabetes and cancer.

51:29: I wanted to lay out before the Congress my plan.

52:19: When this nation made twelve years of public education universal in the last century, it made us the best educated, best prepared nation in the world.

54:25: The American Family's Plan guarantees four additional years of public education for every person in America, starting as early as we can.

57:08: American Family's Plan will provide access to quality, affordable childcare.

61:58: I will not impose any tax increase on people making less than $400,000.

67:34: He said the U.S. will become an Arsenal for vaccines for other countries.

74:12: After 20 years of value, Valor and sacrifice, it's time to bring those troops home.

76:01: We have to come together to heal the soul of this nation.

80:02: Gun violence has become an epidemic in America.

84:23: If you believe we need to secure the border, pass it.

85:00: Congress needs to pass legislation this year to finally secure protection for dreamers.

87:02: If we want to restore the soul of America, we need to protect the right to vote.

How Auto Chapters Works

Behind the Auto Chapters feature is a set of powerful Machine Learning models. The first model is able to segment an audio file into "chapters" (ie, detect when the topic changes), and the second model summarizes those chapters into bite-sized summaries.

Use Cases

Below are just some of the use cases our customers are already using the Auto Chapters feature for:

Video Platforms - Automatically create "video chapters" to make videos easier for users to click around, and to jump to the content they're looking for.
Podcast Players - Extract interesting segments of a podcast episode, and make podcast episodes more searchable so users can jump to key parts of an episode to "sample" an episode before listening to the entire thing.
Virtual Meeting Platforms - Offer summaries of the key parts of a meeting, and make meeting recordings easier to consume after the fact.
Telephony - Make phone calls easier to navigate, especially when doing QA within contact centers.

Using the Auto Chapters Feature

When requesting a transcription with the AssemblyAI API, simply include the auto_chapters: true parameter in your POST requests. For example, in cURL:

curl --request POST \
  --url https://api.assemblyai.com/v2/transcript \
  --header 'authorization: YOUR-API-TOKEN' \
  --header 'content-type: application/json' \
  --data '{"audio_url": "https://foo.bar/7510.mp3", "auto_chapters": true}'

When your transcription is completed, you'll see a chapters key in the JSON response, like below:

{
    "audio_duration": 12.0960090702948,
    "audio_url": "https://s3-us-west-2.amazonaws.com/blog.assemblyai.com/audio/8-7-2018-post/7510.mp3",
    "confidence": 0.956,
    "id": "5551722-f677-48a6-9287-39c0aafd9ac1",
    "status": "completed",
    "text": "The American job plan ...",
    # auto chapter results can be found in the JSON result here
    chapters: [
        {
            "start": 0,
            "end": 20000,
            "summary": "The American job plan is going to create millions of good paying jobs.  jobs created in an American jobs plan do not require a College degree. 75% don't require an associate's degree.",
            "headline": "The American job plan is going to create millions of good paying jobs.",
        }
        ...
    ]    
    "words": [
        {
            "confidence": 1.0,
            "end": 440,
            "start": 0,
            "text": "You"
        },
        ...
    ]
}

Isolating the chapters key for a moment, we can drill into the JSON response here:

chapters: [
    {
        "start": 0,
        "end": 20000,
        "summary": "The American job plan is going to create millions of good paying jobs.  jobs created in an American jobs plan do not require a College degree. 75% don't require an associate's degree.",
        "headline": "The American job plan is going to create millions of good paying jobs.",
    }
    ...
]

For each chapter that was detected, the API will include with the start and end timestamps (in milliseconds), a summary - which is a few sentence summary of the content spoken during that timeframe - and a short headline, which can be thought of as a "summary of the summary".