Well-written financial research is full of nuance, contextual analysis, and supporting graphs and figures. But often, it can be boiled down to two conclusions: BUY, or SELL? Often, these one-word conclusions are explicitly stated by the author, giving us a natural target for report classification.
These conclusions, roughly equivalent to “sentiment”, are subtly baked into the writing of the report — the presence of certain words can indicate which way a report swings. We can count how often these words appear in our labelled (buy or sell) research and use that information to predict the conclusions of unlabeled reports.
There are a few use-cases for training this type of classifier. Perhaps we have a large volume of research from various sources but we don’t have time to read them all, and need the aggregated conclusions. Or maybe, in a very rough interpretation of generative adversarial networks, we’re writing financial fake news and need to verify that we’re writing positive or negative reports.
So we’ll walk through the basic creation of a Naïve Bayes classifier using NLTK, Natural Language Toolkit. We train this on 8,000+ reports labelled the equivalent of BUY, SELL, or HOLD. This will be more theoretical than technical, to explain the basic concepts of a NB classifier.
We’ll divide our reports into a training set and testing set. The training set will be used to train the classifier, and the testing set will be used to verify the accuracy of the classifier.
Our first step is to count ALL the words in the training set’s reports. This will allow us to get a list of the 300 most common words (or features) in the reports. Then we go back to each report and check for the presence of each feature in each report. We generate a dictionary for each report, consisting of each of the 300 features, and then a True or a False whether that feature/word is in the report.
Finally, we pass this dictionary and the report’s sentiment into the classifier. So we would expect a SELL report to have lots of True values for the features “weak”, “decline”, and “pressure”.
The classifier determines which features appear most frequently in reports of each of the three ratings. It will then use that to read unlabeled reports and return the most probable label. After training, we can see how the classifier performed on the testing set. It generates a nice report displaying the accuracy, as well as the most important features. For each feature, it lists out the ratio of appearances in reports in one label compared to another label.
If we’re unsatisfied with the most important features, we can fine-tune the training process by blacklisting certain domain-specific words, or adding specific words to the list of top 300.
With a completed classifier, we can gauge the sentiment of any new or unlabeled report, no matter where it came from!