Everyone wants to know how to predict the stock market. Everyone also knows that it’s basically impossible. Here at AssemblyAI, we wanted to know if the negative events in news podcasts could predict the stock market in some way. In this post, we’re going to walk through how to compare podcast data to stock market data and what the news negativity ratings last year of two prominent news podcasts, The Daily and Up First, had to say about the stock market last year, specifically the Dow Jones Industrial Average, the NASDAQ, and Royal Gold.
How Did We Use Negative News in Podcasts to Predict the Market?
We took the audio files of the podcasts The Daily and Up First and transcribed them with AssemblyAI’s automatic speech to text transcription API. AssemblyAI’s speech to text API also provides the option to enable content safety ratings which detect negative news. We used this option when passing the audio files to the speech to text API endpoint to detect negative news in each podcast episode. The returned response from the AssemblyAI API with this option looks like:
Each paragraph that is returned has a label with a confidence score and a timestamp.
We labeled each episode with a negativity rating that was equal to the proportion of negative news in the audio file, for example if there were 50 blocks of text returned and there were 5 mentions of negative news (above 85% confidence), the audio file would have a negativity rating of 0.1.
From the graphs, it looks like our most negative news days come 1-3 days before rises in gold as well as dips in the market. We can zoom in and confirm that this is true with the graphs below. Almost every blue line - demarcating a date with exceptionally negative news - is followed by a dip or series of red days on both DJIA and NDAQ, and is followed by a rise or a series of green days on RGLD.
So what does this mean for you practically? If you’re hearing really negative news maybe you want to buy the dip in the next 1-3 days unless you’re looking at gold prices. If you want to explore this on your own, we’ll take a deep dive into the code so you can do this process yourself.
- Create a Python Web Crawler to collect links to audio files
- Transcribe the audio files with AssemblyAI’s speech to text API
- Calculate and compile Negativity News Ratings
- Download Stock Data
- Graph negative News Ratings against the DJIA, NDAQ, and RGLD
Create a Python Web Crawler to Collect links to Audio Files
We’re going to use Selenium to crawl Listen Notes and get links to the podcasts of The Daily and Up First (in this example code the link is to Up First).
To Install Selenium, run:
We’ll use Selenium and Chromedriver to open up the link in Chrome. Then we’ll look for the ‘Load More’ button and click it 36 times in order to get all the podcasts from the last year. After we’ve extended our page, we’ll then get the ‘Download’ link from clicking the ‘MORE’ button and getting the ‘href’ value of the link to the ‘Download’ button. We’ll save all of these in a .csv file and use that to transcribe our links later.
When you’re done, you should have a .csv file that looks something like this:
Transcribe the audio files with AssemblyAI’s speech to text API
Now that we’ve got our links we will use AssemblyAI’s AI powered automatic speech recognition API endpoint and get content safety ratings for each of our podcasts. It’s important to note here that we are running a LARGE amount of audio through. We are running AI transcription on roughly 700 podcasts of half an hour each, that’s at least 300 hours of audio. It would take a person almost 13 days, likely more, to transcribe that. We’re going to do it with AssemblyAI’s automatic speech to text API in a day. We’ll need to sign up for an AssemblyAI API key, which will be located where I blocked out in the picture.
After we get our API key, we’ll create a function to actually allow us to transcribe the code. We’ll create multiple functions to do this. First we’ll create a function that allows us to transcribe the audio files using AssemblyAI’s speech to text API. We’ll need the transcript endpoint, headers with authorization, and some constants to help us run our program - one to let us know the status of the transcript processing and another to help us run multiple transcripts at a time. We’ll also make functions to poll the AssemblyAI speech to text API, and to save the transcript to a file if the status of the transcript is complete.
Now that we’ve created functions to handle transcribing our audio via AssemblyAI’s speech recognition API, we need to create some functions to handle running so many files through. We’ll make a function that will append files to a .csv that tracks the links we have already checked and transcribe and we’ll create a function that will keep track of the current transcription ids that are being processed via AssemblyAI’s speech to text API. Then comes the script.
Our script will open up a csv file containing links to the podcast audio files. It will check to see if we have already created a csv for checked links and check for a csv of currently running transcript ids. If the number of checked links is equal to the number of total links, we’re done. If not, we remove the checked links from our list of links.
If we already have a set of transcription ids from AssemblyAI’s speech to text transcription API, we’ll poll for a status. We’ll count the number of files that are completed, and then replace them in order. Of course this won’t always preserve chronological order, but we can actually do that by polling at a rate in which we can expect all the audio to already be transcribed. The AssemblyAI’s speech to text AI transcription service has a transcription time that can be safely estimated at ⅓ of the audio file length.
We run this script multiple times until we’ve finished transcribing all of the links in our file. After we’re done transcribing our links, we should end up with a folder that contains JSON files with blocks of text that look like this.
Calculate and compile Negativity News Ratings
After running all of our audio files through AssemblyAI’s speech to text API and getting JSON files back with content safety labels, we can now start to calculate and compile negative news ratings. This part gets a bit complex and is not entirely precise so buckle your seatbelts. I made the unfortunate mistake of not saving the dates corresponding to the podcasts when I initially downloaded them and I thought I was doomed, but then I realized that I could just backdate them by the order in which I downloaded them, starting with August 21, 2021. The first thing we’re going to do is get a list of the negativity ratings of each news podcast. This list will be in chronological order. The function that we’ll use to get these negativity ratings is:
We’re going to create three negativity ratings by running this on The Daily, Up First, and then creating a third that combines their ratings. We’re also going to normalize our data to be between 0 and 1 and turn it into a measure of the negativity of the news with 0 being the most negative days and 1 being the least negative days instead of 0 being no mentions of negative news and 1 being an entirely negative podcast.
We’ll find the indices corresponding to the minimum values in all three of these lists using a function that returns the 10 minimum indices. Then we’ll coalesce these indices and graph them against the graphs of the Dow Jones Industrial Average, the NASDAQ, and Royal Gold.
Compiling the days, we find that they are:
- August 13, 2021
- July 14, 15, and 16, 2021
- July 8, 2021
- July 3, 2021
- June 1, 2021
- May 4, 2021
- March 18, 2021
- February 14, 2021
- January 25, 2021
- October 21, 2020
Download Stock Data
We can download stock data with Yahoo Finance’s Python SDK. You can install yfinance by running:
Then we can download last year’s stock market data like so
Graph negative News Ratings against the DJIA, NDAQ, and RGLD
We’re going to create some candlestick charts in Python so that we can graph our stock data. We’ll need to have pandas, matplotlib, and mplfinance for this. To download these you can run
We'll read our file in and add labels.
Before we plot this, we’ll want to normalize our data.
Now we go back to the dates that we identified that had the most negativity in The Daily and Up First Podcasts that we transcribed with AssemblyAI’s speech to text API and plot them as vertical lines like
We should then see graphs like the ones below.
How to read these graphs
These candlestick graphs are zoomed in to the days of particular negative news. The y-axis is scaled from 0 to 1 because a) normalizing a set of nonnegative numbers to this scale is a shape preserving transformation and b) if we want, we can plot the frequency of negative news in the podcasts against this scale and see how it lines up in a more absolute manner. I’ve chosen not to include these graphs because they’re harder to read than the ones I have included. The negative news peaks are marked with vertical blue lines.
Let’s zoom in and examine the days around our particularly negative news. As I said earlier, it looks like the price of gold is inversely related to negative news peaks and the price rises for the days after and it looks like the NASDAQ and DJIA are directly correlated with negative news peaks and have their prices drop in the days after.
In this first graph, we see a sharp rise in gold with a lag of 1 day behind the negative news peak along with a surprise drop immediately the day after. We can also see that the NASDAQ had a pretty big dip in the days after. The DJIA does not reflect as much of a drop so I didn’t include the graph. This was around the time Google was facing anti-trust issues and global protests.
This graph shows us another jump in gold immediately after this negative news day. I already included this date for the NASDAQ in the graph above, and once again the DJIA showed not as drastic of a change. I think this was around the election.
Here we see a pretty big rise in the price of gold in the days following another negative news peak. This time we can see a response from both the DJIA and the NASDAQ with prices dropping on both indexes in the days after this cluster of negative news. I believe these news points were around the time when there was a vaccine controversy in Texas and a rise in racial violence against Asian Americans.
There’s another rise in the price of gold immediately after another negative news peak here. The NASDAQ and the DJIA did not respond strongly to this negative news. This was around the time that there were many protests around racial violence in the US.
This rise in gold price following negative news was around the time that the COVID delta variant was becoming more widespread. We can see that the DJIA and NASDAQ also responded to this news with drops the next few days with DJIA really taking a dip right after.
This last graph for gold prices shows rises around late July of 2021. We can see that the NASDAQ and DJIA both responded to this and took dips in the days after. This was around the wildfires in the west and more concern around climate change. I’m not sold that buying gold will protect anyone from climate change, but a rise in commodity prices as a fear response to negative news always makes sense.
Can you use podcasts to predict the stock market? Most of the time no, but on days with exceptionally bad news, you can expect to see a dip in the stock market in the next 1-3 days. How bad is exceptionally bad? I arbitrarily chose the top 10 worst days over the last year, and combined two sets of them to find that these days did indeed predict drops. Their negativity ratings via AssemblyAI’s content safety option on the speech to text API were all over 0.7. For more information on speech to text and other cool tutorials follow us @assemblyai and @yujian_tang on Twitter!