In Automatic Speech Recognition, or ASR, Sentiment Analysis refers to detecting the sentiment of specific speech segments throughout an audio or video file. Sentiment Analysis is sometimes referred to as Sentiment “Mining” because you are identifying and extracting--or mining--subjective information in your source material.
Sentiment Analysis is a well studied field with interesting and useful applications across a wide range of industries. At AssemblyAI, we just released our own Sentiment Analysis feature for our Speech-to-Text API to give our customers the ability to capitalize on this helpful tool.
In this post, we’ll look more closely at how Sentiment Analysis works, current models, applications and use cases, limitations, and future projections.
How Does Sentiment Analysis Work?
In Sentiment Analysis, the goal is to take your audio or video file or transcript and produce three potential outputs--positive, negative, or neutral.
To achieve this, our model outputs a number between -1 and 1 with:
- -1 = negative
- 0 = neutral
- 1 = positive
This is also referred to as sentiment polarity. Now, your model can either be set up to categorize these numbers on a scale or by probability. On a scale, for example, an output of .6 would be classified as positive since it is closer to 1 than 0 or -1. Probability instead uses multiclass classification to output certainty probabilities - say that it is 25% sure that it is positive, 50% sure it is negative, and 25% sure it is neutral. The sentiment with the highest probability, in this case negative, would be your output.
In our model at AssemblyAI, your Sentiment Analysis response on your transcript would look something like this:
Here’s how to interpret the above example:
- Text: The text being measured
- Start: Starting timestamps (ms) of the text in the transcript
- End: Ending timestamp (ms) of the text of the transcript
- Sentiment: The detected sentiment - POSITIVE, NEGATIVE, or NEUTRAL
- Confidence: Confidence score for the detected sentiment
- Speaker: If using dual_channel or speaker_labels (Speaker Diarization), then the associated speaker will be surfaced
You can find more information about how to use AssemblyAI’s Sentiment Analysis feature in our documentation.
Sentiment Analysis Models
Sentiment Analysis is a very active area of study in the field of Natural Language Processing (NLP). Mainly, Sentiment Analysis is accomplished by fine-tuning transformers since this method has been proven to deal well with sequential data like text and speech, and scales extremely well to parallel processing hardware like GPUs.Learn More: Fine-tuning Transformers for NLP
There are also strong open source datasets and benchmarks for training data to use to work with as you fine-tune. Review sites, such as Amazon, IMDB for movies, Yelp, and Twitter, all make excellent training data since sentiments are usually strong and lean more toward one side of our positive-negative scale.
Applications and Use Cases
What is Sentiment Analysis used for? A lot! At AssemblyAI, our telephony customers use Sentiment Analysis to extract the sentiments of customer-agent conversations. Then, they can track customer feelings toward particular products, events, or even agents. They can also use it to analyze agent behavior as well.
Other customers use Sentiment Analysis for virtual meetings to determine participant sentiments by portion of meeting, meeting topic, meeting time, etc.
This can be a powerful analytic tool that helps companies make better informed decisions to improve products, customer relations, agent training, and more.
Currently, Sentiment Analysis in ASR can only ascribe three attributes--positive, negative, or neutral. As we know, human sentiments are much more nuanced than this black and white output--we can’t currently label a speaker with a more descriptive adjective, such as enthusiastic, hateful, elated, etc.
Another limitation is in our open source datasets. While there is an abundance of datasets available to use to train our Sentiment Analysis models, the majority of them are text, not audio. Because of this, we lose some of the connotations in what may have been implied in an audio stream versus a text transcript.
For example, someone could say the same phrase “Let’s go to the grocery store” with enthusiasm, neutrality, or begrudgingly, depending on the situation.
In the Pipeline
Researchers are actively working to solve the limitations described above. For example, using Zero-shot text classification would let you assign more descriptive sentiments than simply positive, neutral, or negative. Instead, text could be classified as “upset”, “frustrated”, “excited”, “enthusiastic”, etc., making Sentiment Analysis an even more useful and powerful tool for analytics.
By sourcing more audio training data, we also hope to increase Sentiment Analysis model accuracy as well.
Already, our researchers and others are actively working to make both of these happen, unlocking even more analytical power for users.