Data Science Madness 5: First Recommendations

The above image was made with stable diffusion using the prompt 'letters falling from the sky.'

I'm mildly obsessed with news article recommendation algorithms. At WantToKnow.info, we have an archive of 12k news article summaries. Three-fourths of these are on high-level corruption and cover-ups, while the remainder are on inspiring topics. For fun on my own time, I've been playing around with a copy of our news archive, exploring relationships between articles based on article text.

My first post on the project described my initial encounter with the data. My second post described the trials and tribulations of preprocessing. My third post showed how I finally got all of the data into a pandas dataframe. My fourth post showed how I scrubbed and standardized the data to make deeper analysis possible.

Cosine Similarity

Our current data has several features: article date, upload date, ID number, title, text, publication, category tags, and manually-assigned rating. Right now on our website, article searches can be sorted by article date, upload date, and rating. This is great, but it would be better if sorting by publication were also possible. One happy byproduct of my project so far has been a publication standardization key for cleaning up our messy archive publication data.

These features on their own would be sufficient to generate a list of recommended articles for a selected article. But there's other good information buried in the data. So I decided to use a natural language processing technique that turns article text into mathematical coordinates and then measures the distance between these sets of coordinates. What I ended up with was a list of the ten closest matches for every article.

First, I had some cleanup to do on my preprocessed file. When all was said and done, about 10% of the entries were trimmed from the dataframe, mostly due to missing essential data. Then I used Scikit Learn to turn the article text into TF-IDF (Term Frequency - Inverse Document Frequency) vectors. After that, I computed the cosine similarity for all of the article vectors, and wrote a list of the ten closest articles to a new column in my dataframe.

The next step is to test out the results to see if the articles on the list are actually good recommendations. I'm still thinking about the best way to do this. Realistically, I'll probably just randomly pick a dozen to check manually. Even if the matches aren't great, I've added a nifty new feature to the data.

Below is my code. Note that this was computationally intensive and took several minutes to run.

import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#read in selected columns of preprocessed file
df = pd.read_csv("C:\\datasources\\WTKfullraw.csv", sep='|', usecols=['ArticleId','Title','PublicationDate','cpub','Links','wtkURL','Description','Note','tags','Priority'])

#deduplication and NaN cleanup
df.drop_duplicates('Title')
df = df[df['tags'].notna()]
df = df[df['Priority'].notna()]
df = df[df['wtkURL'].notna()]

#substituting multiple spaces with single space
df['Description']= df['Description'].apply(lambda x: re.sub(r'\s+',' ', str(x)))
#remove double quotes
df['Description']= df['Description'].apply(lambda r: r.replace('\"\"', '\"'))
#remove paragraph styling
df['Description']= df['Description'].apply(lambda r: r.replace('<p style=\"text-align: justify;font-size: 11pt; font-family: Arial;margin: 0 0 11pt 0\">', '<p>'))

#compute the TF-IDF vectors for the preprocessed article text
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Description'])

#compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#iterate through each article and find the most similar articles based on cosine similarity
related_articles = []
for i in range(len(df)):
    similar_articles = []
    
    # get the cosine similarity scores for the current article
    scores = list(enumerate(cosine_sim[i]))
    
    # sort the scores in descending order
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    
    # get the top 10 most similar articles (excluding itself)
    top_similar = scores[1:11]
    for j, _ in top_similar:
        similar_articles.append(df.iloc[j]['ArticleId'])
    related_articles.append(similar_articles)

#add the column of related articles to the DataFrame
df['Related'] = related_articles

#df.tail(6)
#df.to_csv('C:\\datasources\\WTKrelated.csv', index=False)

Until this point in the project, I was mostly revising code from five years ago, when I last worked on this archive. But today's code is all new. Now I'm trying to decide if I want to get serious enough about Pyscript to attempt making a webpage from this data. Right now, I'm on the fence.

Read my novels:

Small Gods of Time Travel is available as a web book on IPFS and as a 41 piece Tezos NFT collection on Objkt.
The Paradise Anomaly is available in print via Blurb and for Kindle on Amazon.
Psychic Avalanche is available in print via Blurb and for Kindle on Amazon.
One Man Embassy is available in print via Blurb and for Kindle on Amazon.
Flying Saucer Shenanigans is available in print via Blurb and for Kindle on Amazon.
Rainbow Lullaby is available in print via Blurb and for Kindle on Amazon.
The Ostermann Method is available in print via Blurb and for Kindle on Amazon.
Blue Dragon Mississippi is available in print via Blurb and for Kindle on Amazon.

See my NFTs:

Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.

Sort:

Trending

[-]

amberjyang (64) last year

So I decided to use a natural language processing technique that turns article text into mathematical coordinates and then measures the distance between these sets of coordinates.

You mentioned this earlier and my newbie mind is blown by what's possible in the data science world :) So grateful for your hard work on creating a publication standardization key for us!

Now I'm trying to decide if I want to get serious enough about Pyscript to attempt making a webpage from this data.

Is this what what you mean when you were talking about running Python code in HTML? If so, is this what would allow people to work with the code you created without needing to know Python or use its programs?

Your hard work here is super impressive!

$0.04

1 vote

mada (71) last year

is this what would allow people to work with the code you created without needing to know Python or use its programs?

Sort of. I'd have to write new code for a Pyscript webpage, but it would allow people to interact with the articles data in a variety ways. It might also be interesting to post stats on the whole collection, like how many articles from each publication, the distribution of ratings, etc.

$0.00