Buy this Data Science! Coronavirus! (Part 2: Data Collection and Exploration)

Tim Lee
3 min readMar 18, 2020

This is Part 2 of my series on applied data science in the field of financial research.

As part of my exploration of financial research, I’ve begun analysis on the publications of a few online publications. I’ve collected and compiled over 12,000 articles from eight publications, saving each article into a text file and the article metadata into a CSV. Keep in mind that this is simply the foundation for further analysis, so these analyses are very simple, and you’ll watch as I try and work through foundational mistakes.

Library of Articles

My first analysis was to count the number of exclamation points within each article. Well-written journalism typically keeps a level-head and doesn’t need to shout at an audience. This was easy, simply counting the occurrences, and then seeing how many exclamation points exist per article. Publications that catered to a more “casual” audience, and publications that relied on member-written content seemed to have the most exclamation points.

Publications and various analyses

NLTK is a library used to facilitate natural-language processing. It can take in text and break it into individual words, called tokens. It’s capable of breaking these tokens into parts of speech, recognizing named-entities, and understanding semantic logic. I haven’t yet fully explored the full power of NLTK, so for now, I’ve simply broken down the words and counted the highest occurrences.

I’ve filtered out “stop words”, which are common words like articles that don’t contribute much to a sentence’s meaning. Each run counts up all the words per publication and then tallies the most frequently written words. I’d want to take a closer look at any weasel or buzzwords, like if “buy” is used a little too much. Right now, the hot thing seems to be “coronavirus”.

Lastly I used a library that “plays nicely” with NLTK, TextBlob. TextBlob has a sentiment analysis function, which analyzes individual sentences and assigns a score on a scale of -1 to 1. I disregarded all sentences with a score of zero, and took the mean of all other sentences’ sentiment for each particular publication. My assumption is that journalism should be level-headed and measured, though I couldn’t draw too much of a distinction amongst the various publications.

My next steps will be to analyze the article metadata, because the metadata may contain some good predictors, especially within the tags or keywords. I’ve been using CSVs, which was a mistake because many headlines, and sometimes author fields contain commas. My solution will be to convert these metadata sheets into MySQL database tables.

Many headlines contain commas, so using a CSV was a mistake

Laying the groundwork for clean, accessible data is important. I still don’t what exploratory and analytical techniques I’ll learn next, but my data will be there when I’m ready. Stay tuned for more!

--

--

Tim Lee

Unlocking the emotional knowledge hidden within the world of cinema.