Unsupervised Machine Learning
September 2019 – December 2019
Link to Paper
As the Final Project for my Unsupervised Machine Learning class at Northeastern I used a combination Locality Sensitive Hashing (LSH) and clustering to find similar news articles. This project required me to perform Natural Language Processing on the articles, test different algorithms and tuning parameters, and evaluate different methods for assessing the accuracy of the results. These last two steps were particularly challenging because the results of the algorithms were subjective and there was no simple accuracy measurement.
A non trivial aspect of this project was to transform the documents into shingles which were used as the basis for similarity testing. The first step was cleaning the documents, such as removing punctuation and capitalized letters. Additionally the python text parsing library NLTK was used to reduce words to their base state. For example a word ending -ing such as “writing” would be converted to simply “write”.
Once this was done, shingles were created. A variety of methods were used and tested to create the shingles. This included using shingles of different lengths and using stop words as the start of the shingles. Stop words are generally defined as words which are the most common in a language. The length of shingling changed the results significantly, as very short shingles would group too many articles together, while long shingles returned very few similar articles. Finally, any duplicate shingles were removed resulting in a binary array based on if the shingle appeared or not. This was done partially due to the uneven length of the documents.
To determine similar articles, two main methods were used: Locality Sensitive Hashing (LSH) and Clustering. LSH is done by hashing documents into buckets, with the goal of hashing similar documents into the same bucket. The algorithm accomplishes this with the goal of achieving high probability that the articles in a bucket are actually similar to a certain degree, although it can produce both False Negatives and False Positives. LSH is especially useful in overcoming large amounts of data, such as this news article dataset with over 143,000 articles. The key parameters in LSH are the similarity threshold (t), determined by Jaccard similarity, the number of min hash bands used (b), and the number of rows per band (r). These can be represented in the following formula:
In terms of clustering, the k-means method was used, after testing other methods such as spectral clustering. In order to optimize the number of clusters (k) the elbow method was used, looking at the decrease in SSE (sum of squared error) with each increase in cluster size and choosing the number where SSE started decreasing significantly less. Once the clustering was complete, the clusters and LSH were checked against each other to confirm the results.
Results and Evaluation
After testing many different combinations of bands and rows, the final values used were a combination of 24 bands and 5 rows while looking for documents which were similar at a value of 60%. This combination of bands and rows should return documents at threshold value of 52.96%, which is significantly lower than 60%. This was done in an attempt to reduce the number of false negatives which can not be fixed in post processing. The S-curve for this combination can be seen below.
A shingle length of two words was determined to be the best length. The number of candidate pairs was large enough to be useful, but not too large that a check for false positives could not be run. The final number of candidate pairs was around 25,000. For some applications this number of candidate pairs may be considered too low, but for the purpose of this project it provides a good amount of candidate articles.
The clustering evaluation showed very interesting results. The majority of similar pairs returned by LSH were clustered together by the K-means clustering. About 76% of the pairs LSH returned as similar were clustered together by K-means. This shows that K-means picked up on the same similarity trends that LSH noticed. Cross checking between LSH and clustering increases the confidence in the results of each algorithm without having to manually check the matches for similarity.