Have you ever needed to take notes from a YouTube video lecture? Back in college I totally had some hard classes where I had to use a lot of YouTube University and take notes from some YouTube video I found. A lot of times I also had to stop and rewind and replay multiple times to take notes because it was going too fast. Today I’ll show you an amazing way to get around this problem by getting the transcript of the video via AssemblyAI’s transcription API. You can find the source code here.
Prerequisites
I'm going to show you how to build a command line tool that will download a video from a YouTube link and extract the transcription for you via AssemblyAI in Python 3. You'll need:
- youtube-dl
- ffmpeg, ffprobe, ffplay
- an Assembly AI API key
- click (a python library)
- access to internet
AssemblyAI is an API for fast, automatic speech to text conversion. We’ll use the AssemblyAI API to transcribe the YouTube videos we download. To get an AssemblyAI API key, visit AssemblyAI and sign up, you'll see your API key clearly displayed, I've circled where it should be in the picture.

Next we'll have to download youtube-dl for Python. Youtube-dl is an open source library for easily downloading youtube videos. There's multiple ways to do this, but I suggest using pip
pip install youtube_dl
Next comes installing ffmpeg. FFmpeg is an open source and free software for handling, video, audio, and other multimedia files. We’ll be using this in conjunction with youtube-dl to convert the video we download into an audio file. This part is different for Windows and OSX users. First, we’re going to download the binaries from https://ffbinaries.com/downloads
If you’re a Windows user what you’ll want to do is download the binaries and unzip the files. You’ll see an executable file for each of the three ffbinaries we need, ffmpeg, ffprobe, and ffplay. Copy each executable file to a folder and make sure you know where that folder is. For the purposes of this tutorial, I copied it to the same folder that I am running the python program from. Later we’ll add an option to the request we send youtube_dl that will tell it where to find the program.
If you’re an OSX user, you’ll want to go to the site and download the binaries, and then add the location where you’ve downloaded them to to your PATH variable. Like so:
1. Run
sudo cp ./ffmpeg ./ffplay ./ffprobe /usr/local/bin
2. Open up ~/.zshrc with whatever text editor you’d like, I just run
vim ~/.zshrc
3. Add the line
PATH=”/usr/local/bin:$PATH”
The last prerequisite to install is Click. Click is a Python library, short for “command line interface creation kit”. Click’s three main points of interest are arbitrary command nesting, automatic help page generation, and lazy loading for subcommands at runtime. I just install Click with pip in the terminal like so:
pip install click
At this point we’ve done all the prerequisite steps to building our application so let’s start setting up. One thing to note is that I do have a configure.py file in which I store my auth key from AssemblyAI, and you’ll need to create one too. The whole file can just be one like that says:
auth_key = '<Your AssemblyAI API key here>'
Setup
For our setup, we need to know a few things:
- What options to pass to youtube_dl
- The AssemblyAI endpoints
- Some other constants
For youtube_dl options, we want to download the video and extract the audio, so we’ll go with bestaudio as our format option. Then because we need to get the audio, we need to pass a postprocessor, and that’s where ffmpeg comes in. You’ll notice I also added an ffmpeg location of ‘./’ and that’s for Windows users who have moved the ff- binaries up to the folder with your program in it. I also added an outtmpl (output template) and set the name of the file to be equal to the YouTube id of the video, this is totally optional, I did it because I find that the title of the file can get long and cumbersome to work with in some settings, especially if there are spaces in it.
ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192',
}],
'ffmpeg-location': './',
'outtmpl': "./%(id)s.%(ext)s",
}
We’ll interact with two of the AssemblyAI endpoints here, one to upload the audio of the YouTube video to, and the other to get a transcription from. We’ll define them in our code like so:
transcript_endpoint = "https://api.assemblyai.com/v2/transcript"
upload_endpoint = 'https://api.assemblyai.com/v2/upload'
Finally, we’ll set up a couple more constants, the headers that we need to send when interacting with the AssemblyAI API, and the desired chunk size when reading a file. Which we will set up like so:
headers_auth_only = {'authorization': auth_key}
headers = {
"authorization": auth_key,
"content-type": "application/json"
}
CHUNK_SIZE = 5242880
We’ve installed our prerequisite libraries and we set up our constants, now it’s time to dive into making the app itself.
Let’s break this down into four steps (and conveniently also four commands):
- Downloading the audio from YouTube (download)
- Uploading the audio file to Assembly (upload)
- Transcribing the audio file via Assembly (transcribe)
- Getting the transcribed text file (poll)
Downloading the audio from YouTube
Our end goal with this step is to create a function that takes a link, downloads it, and returns that download location back to us. When it’s done, it should look something like this:

First, let’s initialize our CLI, we’ll import the clicks library and define an api group.
import click
@click.group()
def apis():
"""A CLI for getting transcriptions of YouTube videos"""
def main():
apis(prog_name='apis')
if __name__ == '__main__':
main()
Now let’s make our download function. YouTube_dl works by taking the id of the youtube video, so when we pass in a link, we’re going to want to strip it first and then pass it to youtube_dl. After that we’ll use youtube_dl and our options we made for it earlier to save the video and print and return the save location.
import youtube_dl
@click.argument('link')
@apis.command()
def download(link):
_id = link.strip()
meta = youtube_dl.YoutubeDL(ydl_opts).extract_info(_id)
save_location = meta['id'] + ".mp3"
print(save_location)
return save_location
Uploading the audio to Assembly
Cool, now we can download a YouTube video and save the audio file as a .mp3 locally. Now, we need to upload this audio file somewhere to host online. Luckily Assembly offers an easy to use upload endpoint and storage. At the end of this step, we will have something that looks like this.

To upload a file we’ll have to make a function that can read the data and send that as the “data” in the upload request. When we get the upload response we simply print that out and then return the url.
import requests
@click.argument('filename')
@apis.command()
def upload(filename):
def read_file(filename):
with open(filename, 'rb') as _file:
while True:
data = _file.read(CHUNK_SIZE)
if not data:
break
yield data
upload_response = requests.post(
upload_endpoint,
headers=headers_auth_only, data=read_file(filename)
)
print(upload_response.json())
return upload_response.json()['upload_url']
Transcribing the audio via Assembly
Alright, so now that we’ve got the audio uploaded, we can transcribe it via the Assembly API. When we’re done, we should have something that looks like this upon sending a request.

What we’re going to do in this step is create a transcript request and send that request off to the AssemblyAI transcription endpoint. Then, we sit and wait. I imported pprint in here specifically to get the printout to look nice, you can use regular print if you want a more condensed visual. I also included an option to pass a flag that is either -c or --categories, this controls whether or not we include a request to get the categories related to the text from our AssemblyAI transcription.
import pprint
@click.argument('audio_url')
@click.option('-c', '--categories', is_flag=True, help="Pass if you want to get the categories of this transcript back")
@apis.command()
def transcribe(audio_url, categories: bool):
transcript_request = {
'audio_url': audio_url,
'iab_categories': 'True' if categories else 'False',
}
transcript_response = requests.post(transcript_endpoint, json=transcript_request, headers=headers)
pprint.pprint(transcript_response.json())
return transcript_response.json()['id']
Getting the transcribed text file
We’re almost there, this is the last command we have to write. This command is used to poll our transcription endpoint to check if our transcription is done. Assembly AI’s docs say to expect 15-30% of the video length for transcription time - https://docs.assemblyai.com/overview/processing-times. At the end of this step, the poll command will return something exactly like what the transcribe command returned if the status of the response is ‘processing’ as outlined in red in the picture.

Or it will return the location of where the transcription has been saved locally if the status of the returned response is ‘completed’.

For this function, what we want to do is construct the endpoint to poll via the transcription endpoint and the ‘id’ parameter returned above(in this case, dx4mgdwjz-a413-4204-87e0-666d97727113), create a filename to save the transcribed text to, and finally check the response to see if we should display that the AssemblyAI model is still processing (notice the status: processing in the first image) or if the AssemblyAI model is done, then we just save the file.
@click.argument('transcript_id')
@apis.command()
def poll(transcript_id):
polling_endpoint = transcript_endpoint + "/" + transcript_id
polling_response = requests.get(polling_endpoint, headers=headers)
filename = transcript_id + '.txt'
if polling_response.json()['status'] != 'completed':
pprint.pprint(polling_response.json())
else:
with open(filename, 'w') as f:
f.write(polling_response.json()['text'])
print('Transcript saved to', filename)
return filename
Bonus Round
If you don’t need any of the in-between steps (except poll) as standalone functions, we can just make one big function that will do everything from downloading the YouTube video to uploading it to Assembly to transcribing it via Assembly. Sometimes the DNS will time out so we’ll use a try except that will return a wait time if the request fails. When we’re done it should look like this:

Or:

Then, we wait the estimated 120.8 seconds and call the poll command to get our transcription and voila!
@click.argument('link')
@click.option('-c', '--categories', is_flag=True, help="Pass True if you want to get the categories of this transcript back")
@apis.command()
def transcribe_from_link(link, categories: bool):
_id = link.strip()
def get_vid(_id):
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
return ydl.extract_info(_id)
meta = get_vid(_id)
save_location = meta['id'] + ".mp3"
duration = meta['duration']
print('Saved mp3 to', save_location)
def read_file(filename):
with open(filename, 'rb') as _file:
while True:
data = _file.read(CHUNK_SIZE)
if not data:
break
yield data
upload_response = requests.post(
upload_endpoint,
headers=headers_auth_only, data=read_file(save_location)
)
audio_url = upload_response.json()['upload_url']
print('Uploaded to', audio_url)
transcript_request = {
'audio_url': audio_url,
'iab_categories': 'True' if categories else 'False',
}
transcript_response = requests.post(transcript_endpoint, json=transcript_request, headers=headers)
transcript_id = transcript_response.json()['id']
polling_endpoint = transcript_endpoint + "/" + transcript_id
print("Transcribing at", polling_endpoint)
polling_response = requests.get(polling_endpoint, headers=headers)
while polling_response.json()['status'] != 'completed':
sleep(30)
try:
polling_response = requests.get(polling_endpoint, headers=headers)
except:
print("Expected wait time:", duration*2/5, "seconds")
print("After wait time is up, call poll with id", transcript_id)
return transcript_id
_filename = transcript_id + '.txt'
with open(_filename, 'w') as f:
f.write(polling_response.json()['text'])
print('Transcript saved to', _filename)
Wrapping Up
To recap, we’ve just made our own command line interface for downloading YouTube videos with youtube-dl and transcribing them with AssemblyAI using Python. AssemblyAI is a straightforward to use, fast, and powerful speech to text API. You can follow AssemblyAI on Twitter @assemblyai and you can follow me @yujian_tang.