Fake news is all around us – whether we can identify it or not. Individuals and organizations publish fake news all the time, whether it be for a persuasion tactic, or to simply override unfavorable truths. Take the search for a Covid-19 vaccine for example, an issue that is especially relevant in our current times. Before a vaccine came out, there were some sources that stated there was already a fully effective vaccine available, some that stated it was coming very soon, and others that stated that it would take decades for a safe and functional one to be released. And trusting and following the wrong source can lead to more harm than good.
Now the question becomes, which websites do we trust, and which do we ignore? In most cases, it might not always be this clear as to which sites to trust or reject, and which sites are real or fake.
Fortunately, Big Data can save the day for us! In today’s world of ever growing data streams, one can imagine crunching through the volumes of data to detect patterns, which can then be analyzed to separate out real news from fake.
That is exactly the project I executed on - a fake news detection machine learning model that utilizes advanced natural language processing techniques to classify news websites as either fake or real.
This machine learning model utilizes binary classification to identify whether a news site is fake or real, in which an output of ‘1’ indicates that the website is most likely fake and ‘0’ indicates that the site is indeed trustworthy. It will take in a list of website URLs and corresponding raw HTML as input data and will train a logistic regression model to output a label of either 0 or 1 depending on whether the website is real or fake.
The core of this model comes in the form of the various natural language processing techniques deployed to transform the input data, previously in the form of words, into numbers that the machine can understand and learn from. I have transformed this data by creating and importing several functions generally referred to as featurizers. The purpose of these featurizers is to extract key features of the URL and HTML that may help predict the trustworthiness of the site and transform the data into numerical values to input into the logistic regression model.
To obtain the data necessary for my model, I scraped the web for news websites and compiled a set of *2557 sites, consisting of roughly 50% fake and 50% real. I then split my data into a training set, cross-validation set, and a test set.
I created my first baseline featurizer to be a domain featurizer that extracts basic features from the domain name extension of each website. This domain featurizer takes in a URL and an HTML and returns a dictionary mapping the feature descriptions to numerical features. The accuracy of this model was only 55%, which was not surprising as the domain extension, while might provide some clues, cannot be the deterministic predictor of a website’s trustworthiness.
The key problem with this model is that there is simply not enough information. To combat this issue, I decided my next step would be to make use of specific (and potentially predictive) keywords of the HTML in addition to the domain extension to feed into the logistic regression model. After a series of steps, I used a logistic regression model to get an accuracy of 73%.
The model performed considerably better than the domain method, but as this is still a relatively simple method, I started to think of more nuanced approaches. The meta descriptions of a website’s HTML is a great source of information conveying the core content of that website. As an improvement to my last keyword featurizer, I used the Bag-of-Words NLP model. Once I obtained my score reports for this model, I observed that all of the metrics yielded much higher percentages that before.
Now a shortcoming of the bag-of-words model was that it only looked at the counts of words in the description for each website. But then I pondered if there was a way to somehow understand the meaning of the words in the descriptions for each site. This is where word vectors come in. I utilized a model called GloVe, to accomplish this task. This model yielded an accuracy of about 87%.
Given that I tried out several different featurizers and observed the score reports for each, I was curious to find out if I would obtain improved results when I combine all of the featurization approaches. I then passed the concatenated vector into my logistic regression model and obtained an accuracy of 91%, which was the highest yet.
It was time to test it out on the unseen test data to obtain the real accuracy of my model. I obtained the score reports and observed my model predicted the trustworthiness of news websites with 91% accuracy.
As with any machine learning model, there are places for improving the score metrics even further, such as obtaining a larger dataset, developing more featurization approaches, etc