Coronavirus is a Strong Buy! (Part 3: Text Generation and Language Models)
This is Part 3 of my series on applied data science in the field of financial research.
GPT-2 is a transformer-based language model released in February 2019. It’s revered for its ability to generate text nearly indistinguishable from text written by a human. Its accuracy comes from the large size of the model, which is in turn large because of its massive dataset. 10 times larger than GPT-1, GPT-2 was trained on 1.5 billion parameters, from a dataset of 8 million web pages. With such a large dataset, the transformer can not only generate realistic sequences of next-words and sentences, but it can create logical paragraphs of text with its signature long short-term memory.
It also has the potential to, in the wrong hands, generate incredibly persuasive fake news. OpenAI, the AI research organization that built GPT-2, refused to release the completed models, citing the danger of fake news proliferation. They’ve also refused to release the dataset, training code, or model weights, again fearing that this would allow malicious actors to write fake news. They did, however, release three smaller versions of the model.
So let’s write some fake news!
I conducted a small exercise in transfer learning, the concept of training a model to solve one problem, but then applying it to solve a different, related purpose. The data for GPT-2 was collected no later than 2017, but we’re now living in a brave new era of news: the age of non-stop coronavirus coverage. I’ve fine-tuned the smallest released version of the GPT-2 model on financial news articles written in March: 1,896 from CNBC and 1,822 from Motley Fool. The original model has never heard of “coronavirus”, but let’s see if it can write convincing articles about it. It’s also worth seeing if the writing style of the two publications is reflected in the results.
I used a third-party library “gpt-2-simple” (based on this ad hoc implementation) which allows for easy fine-tuning of the models. I installed Docker as a “container” to run Tensorflow 1.15 code, so I could easily use my NVIDIA video card for GPU-based training. I’m using a GeForce 1060 with only 3GB of VRAM, so I could only finetune the smallest model, and still had to tinker in the gpt-2-simple library to reduce the sample size for each batch.
After compiling the articles into two files, I trained two separate fine-tuned models for 15,000 training steps each. Loss and average loss are printed on the screen after each step, and a randomly-generated text after a certain amount of steps, allowing me to monitor the training progress.
After 1,500 training steps, the CNBC model was already talking about coronavirus, and citing an inaccurate figure for American deaths — we’ve only had 3,800 deaths, not 7,800. So it looks like we already have somewhat randomly-generated fake financial news, and it wrote about coronavirus, unprompted! Let’s keep going.
I watched the loss gradually fall during training from about 3 to about 2 after 14,500 steps. Again, there’s evidence of random generation: SBUX hasn’t traded at $22 in nearly a decade. Here it tried to talk about something that isn’t coronavirus but then couldn’t help itself.
Now the fun part: after training both the CNBC and Motley Fool articles, we can generate any fake news we want! We can give them a seed word, phrase, sentence, or paragraph, and the model will write the rest. To start, I gave them both the prompt “Economists are sounding the alarm”, which shouldn’t be too difficult given the bearish state of the news. It sounds smart, until we realize that April hasn’t happened yet.
Next, I tested “fake news” fake news, with the prompt “Coronavirus actually isn’t that bad”.
I wanted to see how far I could go with this. “Coronavirus actually isn’t that bad! I got it and then gave it to five other people, and then we all got rich, just by”. The Fool model seemed to pick up on my enthusiasm.
I learned that it was best if I pandered to my model. “Coronavirus is a strong buy! Futures of the virus are”
These results were cherry-picked, of course, and they aren’t quite Wall Street Journal-quality. But even the smallest model could produce text that could fool an average reader for a few moments, which is enough in today’s clickbait-headline, attention-deprived online atmosphere.
I’m sure you can see how dangerous this can get.