Selling the Hype: Coding Sentiment Analysis for Stock Market News in 4 STEPS

14 min readFeb 13, 2022

HOW TO: write code that automatically finds worthy stock market insights and detects hype in news coverage at scale (in Python 🐍)

As you may know, we analyze market news coverage to detect sentiment and trends around stocks and cryptocurrencies. Today we will be answering one of our most frequently asked questions: how does one write code to analyze stock market news sentiment algorithmically? We’ll spend the rest of this issue explaining how we’ve done it, and how you can do it yourself. If you’d rather watch than read, check out our 15-minute presentation on the subject here (or you can find it at the bottom of this post). Here’s the breakdown:

Motivation: 🌟 why sentiment matters
Trading on Sentiment: 🔎 $KR case study
Python Demo: 🐍 how to code sentiment analysis in Python
Takeaways: ⏭️ what’s next if you don’t write code?

1. Motivation 🌟

To set the table, let’s start with why. Investors that are consistently successful in the markets do so by mastering the art of combing through noise. To be successful often, you must have a refined bullshit detector — ie. the ability to read through vast amounts of online information, finding what matters and what doesn’t by ultimately knowing where to look and who to trust. In today’s world, this is easier said than done.

Now more than ever before, stock market outcomes are driven by online conversation, with millions of posts and articles published daily about each and every stock in the market. This sheer abundance of conversation makes it exceedingly difficult — and frankly infeasible — to diligently find good “buy” or “sell” insights about a stock without spending vast amounts of time combing through the noise. But what if we could outsource some of the reading legwork to a computer? Could we identify when stocks were over/underhyped in the news? Could we all be consistently successful?

Let’s say that we do want to try outsourcing the reading to a computer; how do we do that? Well, over the past year, we’ve developed a framework for automatically distilling stock market insights from online conversation (news articles, social media posts, etc.) at scale using the following framework, which we’ll outline in more detail below:

In essence, we follow a four-step process: first, we need to gather a large volume of news data about stocks and the market. Once we’ve done this, the second step is to clean and format the raw text data that we’ve gathered into something that can be effectively read by a computer — essentially widdling away the irrelevant pieces to distill the keywords and phrases that truly matter with respect to a given stock. Then once we have the relevant pieces of cleaned text, the third (and most important) step is to create a framework for identifying valuable features within the text — names, topics, the mood about those names and topics, and the tense (past, present, future) that the author is referring to them in — to ultimately be able to quantify the author’s outlook about a given stock in terms of bullish or bearish sentiment. Then once we’re able to score the sentiment of an individual post, the last step is to compare these scores across posts, authors, and time to end up with the aggregate outlook of the stock.

Below, we’ll walk through each of these steps and the corresponding Python code you can use to do this yourself — but first, let’s take a look at a brief example of how we’ve used this algorithmic news sentiment process to make successful trades ourselves (skip to part 3 if you just want code!)

2. Trading on Sentiment 🔎

The case study here is Kroger. Back in September, we noticed that Kroger’s ($KR) stock was up 40% over a six-month period, trading at an all-time high price near $48 per share heading into their 2nd quarter earnings report. We supposed that the earnings report would bring some volatility to the stock, creating a potential opportunity to buy or sell. We also knew that the consensus estimates for their expected earnings numbers were encouraging. So given the interesting setup, we decided to take a look at Kroger’s sentiment in the news to help us make the case for whether the stock would continue to rise after the report, or if it would fall back to Earth.

Using our sentiment pipeline, we analyzed a few hundred articles written about Kroger over the three months leading up to the earnings call. To our surprise, our algorithm detected bearish sentiment about Kroger in the news, which was a bit counterintuitive to us — with such good stock results over the period, why would people be speaking pessimistically about Kroger? We did some more research and looked at the articles being written about the company, and found that people were quite pessimistic on Kroger, for a few primary reasons: first, authors were upset with the amount of debt the company had taken on, and second, they were generally upset with the company’s recent managerial decisions — particularly the appointment of Elaine Chao to their board of directors.

So with this research, informed by the bearish sentiment score from our algorithm, we felt relatively confident that Kroger was overhyped in its price relative to what people were saying, and that a good case could be made that the stock would go down after their earnings call. We decided to short Kroger’s stock into the end of September, then waited for the results. And low and behold, Kroger released their Q2 numbers, and despite exceeding many analysts’ expectations financially, $KR stock fell 20% over the next 20 days on the bearish sentiment, proving our bearish hypothesis correct. We cashed out on our shorts, and felt confident that our sentiment algorithm could truly be a useful tool for identifying over/underhyped sentiment scenarios. Now of course, this is just one, highly compressed example (you can read our full Kroger report from before their earnings here). With that, let’s jump into the actual code:

3. How To Do It In Python 🐍

First things first, if you’re unfamiliar with the Python coding language, check out this article from Coursera to get yourself up to speed. Now, here’s our four-step process for creating an algorithmic news sentiment analysis pipeline — note that this is not standalone code for implementing this process, for direct access to our code repository, email us at code@babbl.dev. To make things more tangible here, we’ll be analyzing this example article “Is Nike’s Stock Ready to Reach All-Time Highs?” from Benzinga and attempting to quantify the author’s outlook:

STEP 1: Getting Market News Data

There are a few ways to scrape stock market news data from the internet in Python — we can parse HTML directly using libraries like beautiful soup, or we can pay for an API that gives us news data directly without the hassle. Some great finance news API options include IEX Cloud, EOD Historical Data, and StockNewsAPI; each has its own pros and cons, which we talk more about in the video at the bottom of this page.

We’ve generally used IEX Cloud’s API in the past, mainly because it’s more straightforward than parsing HTML on our own, it’s relatively cheap, and it provides a decently consistent volume of news data. To start, you’ll need to go to their website to subscribe and create an API token; once you’ve done that, the implementation for getting articles is pretty simple:

## 1. get news data                                            
import iexfinance                                                   
from iexfinance.stocks import Stock## get last 50 Nike articles  
stonk = Stock('NKE')                                         
news_df = stonk.get_news(last=50)                                 
article = news_df.iloc[0] #get first article

The code above essentially calls IEX Cloud’s API and asks for the 50 most recent articles written about Nike ($NKE). We can do this for hundreds of different stocks by simply changing out the ticker from “NKE” to whichever stock we want data for. From here, the next step is to clean the data:

STEP 2: Cleaning and Formatting Raw Text

Once we’ve got some raw article text into Python, the next step is to clean it such that we can identify only the important sentences about Nike that actually matter with respect to its outlook. As you may know, many articles written today include advertisements inside the body and other things irrelevant to the stock itself — some articles mention many stocks, in this case, we only care about the phrases mentioning Nike.

To get there, we can utilize open-source Python libraries like spaCy, natural language tool kit (NLTK), and regular expressions (re) to do some of the heavy lifting for us. Here’s the gist for getting from raw text to a clean set of relevant Nike sentences:

import spacy                                                     import neuralcoref #spaCy resolution add-on                                     
from nltk import *                                                  
import re
import pandas as pd## clean and format text                                         
article = decontracted(article) #remove contractions          
article = re.sub('[,*)@#"(&_^]\\n', '', article).replace('&', 'and')
article = sent_tokenize(article) # separate sentences## isolate $NKE sentences                                         
nlp = spacy.load('en_core_web_sm')        nueralcoref.add_to_pipe(nlp) #pronoun resolution pipeline
article = nlp(article)._.coref_resolved #resolve pronouns         
article = [sentence for sentence in article if "Nike" in sentence]sentence_df = pd.DataFrame(article, columns=["text"])
print(sentence_df["text"]) #show Nike sentences

The code above takes a raw text article, then starts by removing things that are hard for a computer to consistently interpret. In the first few lines after we import our libraries, the code replaces any contractions in the text (“don’t” » “do not”), and removes any unnecessary special characters. From here, we want to use spaCy to take any ambiguous references within the text (ie. things like “it”, and “they”) and replace them with their underlying nouns — this will allow us to explicitly say whether or not a sentence is referring to Nike. Then, once we have resolved the nouns in the text, we filter out the sentences that don’t reference Nike. Finally, we create a Pandas dataframe of the Nike sentences as rows, so that we can add columns for each feature later. With the filtered and cleaned sentences, we can then move on to valuing the features within the text.

STEP 3: Sentiment Feature Valuation

Now, once we have our cleaned and filtered Nike sentences, we want to get to one singular score for how bullish or bearish the author’s sentiment outlook is for Nike. This is comprised of two main components: first, we want to measure the mood of the text (ie. optimism or pessimism about Nike), and second, we want to measure the tense of the text (ie. past, present, or future-tense) — then we can combine these measures to determine how optimistic or pessimistic the text is about Nike into the future.

Let’s start with mood. There are simple open-source coding libraries that can do this for you by scoring the amount of positive or negative language in text, but these end up being relatively inaccurate; instead, we’ll want to create a vocabulary of finance-language keywords that we’d consider to be “optimistic” or “pessimistic” about a stock. This can get complicated, and you can add as much nuance as you like (ie. handling n-grams, giving different words different scores, accounting for negation, training the words against a machine-learning model to handle extemporaneous text) — the language you choose will ultimately determine how your algorithm scores the text (we wrote an article about why traditional open-source mood algorithms are bad for finance news here). Assuming you’ve defined some optimistic / pessimistic vocabs, the code will look something like this:

import numpy as np
optimistic_vocab = ["soar","jump","positive","surge"]
pessimistic_vocab = ["fell","dip","drop","negative"]sentences_df["optimism"] = sentences_df["text"].apply(lambda sentence: np.intersect1d(row, optimistic_vocab))sentences_df["pessimism"] = sentences_df["text"].apply(lambda sentence: np.intersect1d(row, pessimistic_vocab))

The code above is relatively simplified. First, we define our sample optimistic or pessimistic vocabs, then we check each sentence for the presence of either type of word. If we find an optimistic or pessimistic word, we save those words into a column of our dataframe so that we can see them later. From there, we can count the portion of optimistic or pessimistic words — and the ratio between the number of each — to give a sense of the mood of each sentence.

Now onto classifying the tense of a particular sentence. Similar to classifying the mood, we can define vocabularies for words that are speculative (ie. relating to the future), or reactive (ie. relating to the past). However, we can also use the underlying part-of-speech tags for the words in a sentence (which we can identify with spaCy or NLTK) to help us find the tense. Below, we use the verb tags in each sentence to tell us if it’s written in past-tense or not — specifically the occurrence of a “VBD” verb tag indicates that the verb is written in the past:

def detect_past(sentence):
    sent = list(nlp(sentence).sents)[0]
    return (
        sent.root.tag_ == "VBD" or
        any(w.dep_ == "aux" and w.tag_ == "VBD" for w in sent.root.children))tense_weights = {"past-tense": 1, "non-past-tense":2}
sentences_df["tense"] = NKE["text"].apply(lambda sentence: "past-tense" if detect_past(' '.join(sentence)) == True else "non-past-tense")

The code above defines a function that searches the words (and corresponding part-of-speech tags) in each sentence for the presence of a “VBD” (past-tense verb). From here, we classify the text as either past-tense or not past-tense, and with that, we now have the necessary components to create an overall “sentiment” score for each sentence. Here, we are concerned with quantifying the author’s outlook about a particular stock, so in this case, we want to take the mood score, then weight it by the tense score to give us the author’s sentiment.

def score(row):
    opt = len(row["optimism"])
    pess = len(row["pessimism"])
        
    ## create score
    print(opt, ", ", pess)
    if opt == pess:
        mood = 0
    elif opt > pess: 
        mood = (opt)/(opt+pess) if opt+pess > 0 else 0
    elif opt < pess:
        mood = -1*((pess)/(pess+opt)) if opt+pess > 0 else 0
        
    sentiment = mood*tense_weights[row["tense"]]
    return sentimentsentences_df["score"] = sentences_df.apply(score, axis=1)

Above we define one simple way of creating an overall sentiment score for a given sentence. This function essentially takes the ratio of optimistic to pessimistic words in the sentence, then multiplies by whatever weight you give to the tense of the sentence (for us, since we are concerned more with the future, we want to give the non-past-tense sentences a higher weight than past-tense sentences). Once we have a score for each individual sentence, the final step is to aggregate the individual sentence scores into one singular score for the article, which we can do by summing them or averaging them. To make the individual more streamlined, we might also consider normalizing it to be within the range of -100% to 100%. With that, we have successfully created a pipeline to quantify the sentiment of a finance news article!

STEP 4: Calculate and Compare Scores

So there we have it, a rough shell for creating your own pipeline to automatically detect the sentiment of a finance news article — again, the code above is primarily demonstrative, meant to be more of a template than a plug-and-play. If you’d like access to our code repository for all of this, send us an email to code@babbl.dev.

Of course, the ability to analyze one article is great; but in reality, we need to apply this to thousands of articles across authors, platforms, stocks, and time to gain a truly aggregate perspective about the mood of the market. So the final step of the 4-step process is to replicate the code above for many articles about a stock across time. Once we’ve got a good representation for an individual stock, we can apply this replication to other stocks, and ultimately group stocks into different segments for industries, sectors, or anything of the like (we can even group by meme-stocks if that’s what we care about). A simple framework for doing this is shown here:

tickers = ["NKE","FL","UAA","CROX"]
output = {} for ticker in tickers:
    ticker_output = []
    for article in all_articles:
        ticker_output = ticker_output.append(sentiment_analysis(article))
    output[ticker] = ticker_output

In the example above, we create a list of tickers that we’re interested in analyzing (in this case, we’ve got four top consumer discretionary shoe stocks). From here, we can then use IEX cloud to pull articles for each of them, and create a variable (in this case, “all_articles”) to contain the articles. Then, we’ve created a function called “sentiment_analysis” to combine the pipeline steps outlined above for a given article, and we iterate through each of the tickers and each of the articles to run the analysis. Finally, we output the scores from the function into a dictionary with each ticker as a key, and the list of article scores as values. From here, we can parse the scores by averaging them over time for each ticker, to ultimately tell us which ones are most bullish or bearish, and if the sentiment is increasing or decreasing on a daily basis.

4. Takeaways ⏭️

There we have it, our process for taking news articles and detecting the sentiment about stocks — once we have this, what are the implications? In the grand scheme of things, this allows us to augment our human limitations by outsourcing some of the things that are hard for us to do as humans to a computer. By quantifying bullishness or bearishness reliably at scale, we can start to identify consensus opinions across different platforms, we can begin to detect articles that are written by bots, or articles that are merely clickbait trying to sell you ads. In essence, we can determine which articles are truly insightful, and which are not worthy of our time, to ultimately help us make better decisions in the market. But what if you don’t want to write the code yourself? Well, that’s where we come in:

Over the past few months, we’ve been creating a website that does all of the news analysis for you (shown above). Our website scrapes the entire internet for all stock market news articles to identify trends, detect events, and find consensus for individual stocks, sectors, and the markets as a whole — allowing you to save time, reduce your information biases, and gain actionable insight. We plan on rolling the site out by the end of 2021, so if you want to be one of our first users to take full advantage of stock market news at scale, join our waitlist below and we’ll send you an email once it’s finished:

SIGN ME UP FOR THE WAITLIST

To close, I’ll leave you with one of my favorite quotes: “life is too short to leave important words unread” — the insight is out there, we’ve shown that there are ways to find it more effectively using code, and we can do it all ourselves in Python to find what matters and what doesn’t, to ultimately make smarter decisions with our money. That’s all for today, thanks for being here and thanks for reading — if you liked this post let us know by liking / commenting below, or sharing with a friend!

Presentation at Minnebar:

This write-up was adapted from a more formal presentation that I gave last month at Minnebar — the nation’s largest and longest-running technology “unconference”. You can check out the original video of me presenting this topic below (and see a few of the other wonderful presentations from the event as well, if you’re interested). That’s all for today, thanks for reading, and feel free to reach out with any questions by commenting, replying, or emailing us!