Data Science Madness 3: Into a Pandas Dataframe

in #programminglast year

K6TWIsN0lkvtIgAS4RsM--2--ljevp_2x.jpg

The above image was made with stable diffusion using the prompt 'Panda holding a newspaper.'

I want to start playing around with news article recommendation algorithms. I'm also interested in using machine intelligence to identify and map out connections between news reports to see what larger stories emerge. This is related to my work for WantToKnow.info, but I'm just doing it for fun on my own time.

My first post on the project described my initial encounter with the data. My second post described the trials and tribulations of preprocessing. This post ended with a problem I was stuck on, where NaNs were appearing instead of strings. I solved that problem by doing my preprocessing script in a slightly different way. Here's how the preprocessing code ended up:

import pandas as pd
import numpy as np
import os
import re

#read in category tables data 
dftemp_A = pd.read_csv("C:/datasources/CategoryArticle.csv", sep='|')
dftemp_B = pd.read_csv("C:/datasources/Category.csv", sep='|', usecols=['cid','url'])

#cast to str for merge
dftemp_A['cid'] = dftemp_A['cid'].astype(str)
dftemp_B['cid'] = dftemp_B['cid'].astype(str)

# Merge the two dataframes on the 'cid' column
merged_df = pd.merge(dftemp_A, dftemp_B, on='cid')

# Group the merged dataframe by 'ArticleId' and aggregate the 'url' values into a list
dftemp_C = merged_df.groupby('ArticleId')['url'].apply(list).reset_index()
dftemp_C.rename(columns={"url": "tags"}, inplace=True)

# Save the grouped dataframe to the alltags.csv file
dftemp_C.to_csv('C:/datasources/alltags.csv', index=False, encoding='utf-8')

#transform dataframe to dictionary
v = dftemp_C.groupby('ArticleId')['tags'].apply(lambda t: list(t)).to_dict()

#read in raw articles table data
dftemp_DB = pd.read_csv("C:\\datasources\\articlesScrubbed.csv", sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url','url2'])

#make some unique wtk urls 
makeurl = dftemp_DB['url'].astype(str)
dftemp_DB['wtkURL'] = "https://www.wanttoknow.info/a-" + makeurl
  
#make tags column and populate with dictionary v
dftemp_DB['tags'] = dftemp_DB['ArticleId'].map(v)  

#check it with something like dftemp_DB.head(12)

#remove rows without valid Title or tags values
dftemp_DB = dftemp_DB[dftemp_DB['Title'].notna()]
dftemp_DB = dftemp_DB[dftemp_DB['tags'].notna()]

#check it with something like dftemp_DB.tail(12)

#send to file
dftemp_DB.to_csv("C:\\datasources\\WTKpreprocessed.csv", sep='|', index=False, encoding='utf-8')

This code takes 3 input files and combines them into a big pandas dataframe, with a list of category tags for each article in one column. It also removes all rows without titles or tags to begin preparing the data for more rigorous scrubbing. But before I get there, I'd like to simplify the publication data in the frame.

Mediacondense

Every news article in our database has a publisher, and there are over 12k article summaries in our collection. Sometimes the same or similar news source is listed in more than one way. This is fine on our website, but it doesn't really work for the purposes of my project. "ABC," "ABC News," "ABC News Australia," etc may all be technically different, but they're identical in terms of my project. So I had to figure out a way to make them identical in the data.

What I ended up doing was creating a dictionary that maps dirty sources to their simplified clean counterparts. The dictionary doesn't cover all edge cases, but it covers most of them. I wrote it 5 years ago and it still works perfectly. WantToKnow.info probably isn't the only website out there with this issue. Please feel free to use my script in your own projects.

#'mediacondense' list of dictionaries is built for when there are more than 2 variations of a media source

mediacondense = []

abc = ['ABC News Australia', 'ABC News blog', 'ABC6', 'abcnews.com', 'WCPO - Cincinnatis ABC Affiliate', 'ABC News', 'ABC News blogs', 'ABC News Blog', 'ABC News Good Morning America', 'ABC New', 'ABC News Nightline', 'ABC15', 'ABC News 20', 'abc4.com', 'ABC Action News', 'ABCs Arizona Affiliate','WXYZ - Detroits ABC News Affiliate']
mcat1 = "ABC"
m1 = dict.fromkeys(abc, mcat1)
mediacondense.append(m1)

ode = ['Ode Magazine, June 2005 Issue', 'Ode Magazine, July 2005 Issue', 'Ode magazine']
mcat2 = "Ode Magazine"
m2 = dict.fromkeys(ode, mcat2)
mediacondense.append(m2)

nbc = ['NBC Milwaukee Affiliate', 'NBC Chicago', 'NBC Miami', 'NBC Washington', 'NBC Los Angeles', 'NBC Oklahoma City', 'NBC News']
mcat3 = "NBC"
m3 = dict.fromkeys(nbc, mcat3)
mediacondense.append(m3)

vfr = ['Vanity Fair August 2006 Issue', 'Vanity Fair September 2005 Issue', 'Vanity Fair magazine']
mcat4 = "Vanity Fair"
m4 = dict.fromkeys(vfr, mcat4)
mediacondense.append(m4)

nyt = ['The New York Times', 'New York Times Blog', 'New York Times blog']
mcat5 = "New York Times"
m5 = dict.fromkeys(nyt, mcat5)
mediacondense.append(m5)

unw = ['U.S. News and World Report', 'U.S. News & World Report', 'U.S. News & World Report blog', 'US News & World Report magazine', 'US News & World Report']
mcat6 = "US News and World Report"
m6 = dict.fromkeys(unw, mcat6)
mediacondense.append(m6)

nwk = ['Newsweek magazine', 'Newsweek blog', 'Newsweek Magazine', 'Newsweek magazine blog']
mcat7 = "Newsweek"
m7 = dict.fromkeys(nwk, mcat7)
mediacondense.append(m7)

bbc = ['BBC Radio', 'BBC News blog', 'BBC Blogs', 'BBC News']
mcat8 = "BBC"
m8 = dict.fromkeys(bbc, mcat8)
mediacondense.append(m8)

fop = ['Foreign Policy Magazine May', 'Foreign Policy Journal']
mcat9 = "Foreign Policy"
m9 = dict.fromkeys(fop, mcat9)
mediacondense.append(m9)

wap = ['Washington Post blog', 'washingtonpost.com', 'Washingon Post', 'Washginton Post', 'The Washington Post']
mcat10 = "Washington Post"
m10 = dict.fromkeys(wap, mcat10)
mediacondense.append(m10)

tlg = ['The Telegraph blogs', 'Daily Telegraph', 'Telegraph']
mcat11 = "The Telegraph"
m11 = dict.fromkeys(tlg, mcat11)
mediacondense.append(m11)

nsa = ['U.S. National Security Agency Website', 'National Security Agency  Website', 'NSA Technical Journal, Vol. XI', 'National Security Agency  Website, NSA Technical Journal, Vol. XI']
mcat12 = "NSA Website"
m12 = dict.fromkeys(nsa, mcat12)
mediacondense.append(m12)

msn = ['MSN Money', 'MSN of Australia', 'MSN Canada', 'MSN']
mcat13 = "MSN News"
m13 = dict.fromkeys(msn, mcat13)
mediacondense.append(m13)

tim = ['Time magazine', 'Time Magazine', 'Time Magazine blog']
mcat14 = "Time"
m14 = dict.fromkeys(tim, mcat14)
mediacondense.append(m14)

psc = ['Popular Science - March 2007 Issue', 'Popular Science Magazine', 'Popular Science magazine']
mcat15 = "Popular Science"
m15 = dict.fromkeys(psc, mcat15)
mediacondense.append(m15)

cnn = ['CNN blog', 'CNN Money', 'CNN International', 'CNN The Situation Room', 'CNN Lou Dobbs Tonight', 'CNN Video Clip', 'CNN Larry King Live', 'CNN News']
mcat16 = "CNN"
m16 = dict.fromkeys(cnn, mcat16)
mediacondense.append(m16)

cbs = ['CBS Las Vegas Affiliate', 'CBS Philly', 'CBS News', 'KCBS', 'CBS Atlanta', 'CBS Affiliate KUTV', 'CBS News Chicago, Associated Press', 'CBS News, Sacramento Affiliate', 'WCBS News - New York CBS Affiliate', 'CBS News 60 Minutes', 'CBS News 60 Minutes Overtime', 'CBS Los Angeles', 'CBS 60 Minutes', 'CBS News blog', 'CBS News, Stockton Affiliate']
mcat17 = "CBS"
m17 = dict.fromkeys(cbs, mcat17)
mediacondense.append(m17)

yho = ['Yahoo! News', 'Yahoo!', 'Yahoo! Finance', 'Yahoo! News Australia', 'Yahoo News', 'Yahoo Finance']
mcat18 = "Yahoo"
m18 = dict.fromkeys(yho, mcat18)
mediacondense.append(m18)

wsj = ['The Wall Street Journal', 'Wall Street Journal blog', 'Full Page Ad in Wall Street Journal', 'Wall Street Journal Article by Former FBI Director Louis Freeh', 'Wall Street Journal Blog']
mcat19 = "Wall Street Journal"
m19 = dict.fromkeys(wsj, mcat19)
mediacondense.append(m19)

fox = ['Fox News Chicago', 'Fox News video clip', 'FOX News', 'Fox 19', 'Fox News Affiliate']
mcat20 = "FOX"
m20 = dict.fromkeys(fox, mcat20)
mediacondense.append(m20)

icp = ['The Intercept With Glenn Greenwald', 'The Intercept with Glenn Greenwald']
mcat21 = "The Intercept"
m21 = dict.fromkeys(icp, mcat21)
mediacondense.append(m21)

lat = ['Los Angeles Times blog', 'The Los Angeles Times', 'LA Times']
mcat22 = "Los Angeles Times"
m22 = dict.fromkeys(lat, mcat22)
mediacondense.append(m22)

pbs = ['PBS Nova Program', 'PBS Frontline', 'PBS, CBS, Fox compilation', 'PBS News', 'PBS Bill Moyers Journal', 'PBS Newshour', 'PBS Blog']
mcat23 = "PBS"
m23 = dict.fromkeys(pbs, mcat23)
mediacondense.append(m23)

ecn = ['The Economist blog', 'The Economist Magazine', 'The Economist magazine']
mcat24 = "The Economist"
m24 = dict.fromkeys(ecn, mcat24)
mediacondense.append(m24)

npr = ['NPR All Things Considered', 'National Public Radio', 'NPR News', 'NPR blog', 'NPR Blog']
mcat25 = "NPR"
m25 = dict.fromkeys(npr, mcat25)
mediacondense.append(m25)

sfc = ['The San Francisco Chronicle', 'San Francisco Chronicle SFs leading newspaper)']
mcat26 = "San Francisco Chronicle"
m26 = dict.fromkeys(sfc, mcat26)
mediacondense.append(m26)

cbc = ['Canadian Broadcasting Corporation', 'CBC News', 'CBC News [Canadas Public Broadcasting System]']
mcat27 = "CBC"
m27 = dict.fromkeys(cbc, mcat27)
mediacondense.append(m27)
    
frb = ['Forbes Magazine', 'Forbes blog', 'Forbes magazine', 'Forbes.com', 'Forbes.com blog', 'Forbes India Magazine']
mcat28 = "Forbes"
m28 = dict.fromkeys(frb, mcat28)
mediacondense.append(m28)
    
rst = ['Rolling Stone blog', 'Rolling Stone magazine']
mcat29 = "Rolling Stone"
m29 = dict.fromkeys(rst, mcat29)
mediacondense.append(m29)
    
grd = ['A Guardian blog', 'The Guardian blog', 'Guardian']
mcat30 = "The Guardian"
m30 = dict.fromkeys(grd, mcat30)
mediacondense.append(m30)
    
ngc = ['NationalGeographic.com', 'National Geographic October 2004 Issue', 'National Geographic News', 'NationalGeographic.com blog']
mcat31 = "National Geographic"
m31 = dict.fromkeys(ngc, mcat31)
mediacondense.append(m31)
    
mbc = ['MSNBC News', 'MSNBC Today', 'MSNBC: Keith Olbermann blog', 'MSNBC The Rachel Maddow Show']
mcat32 = "MSNBC"
m32 = dict.fromkeys(mbc, mcat32)
mediacondense.append(m32)
    
rut = ['Reuters News Agency', 'Reuters News', 'Reuters Health', 'Reuters blog']
mcat33 = "Reuters"
m33 = dict.fromkeys(rut, mcat33)
mediacondense.append(m33)

blb = ['Bloomberg News Service', 'Businessweek', 'BusinessWeek', 'Bloomberg Businessweek', 'Bloomberg News']
mcat34 = "Bloomberg"
m34 = dict.fromkeys(blb, mcat34)
mediacondense.append(m34)

#the list of dictionaries is then turned into a new little dataframe

key = []
val = []
for i in mediacondense:
    for k,v in i.items():
        key.append(k)
        val.append(v)

mediakeys = pd.DataFrame({'asis': key,
                  'clean': val})

#replacement values are mapped from new dataframe to complete main dataframe 
mediakeys.set_index('asis', inplace=True)

mediakeys.to_csv("C:\\datasources\\mediakeys.csv", sep='|', index=True, encoding='utf-8')

Next Steps

With my data preprocessed, the next step is careful scrubbing. Article summaries and their notes have to be separated. Links have to be extracted and handled. Dates have to be formatted. A bunch of small things have to be find/replaced. After that, the real fun will begin.

On our website, there are already upload date, publication date, category, and point-system ranking in use to sort articles. In our database, there's also a publisher field that could theoretically be used to list/sort articles (with help from my mediacondense dictionary).

Every article has its own unique page, so we've got a ready made place to drop a recommendation widget. And this widget could easily just contain the most recent articles with the highest number of matching categories. But I think it would be interesting to try recommending based on more complex pattern matching.

Towards this end, I'm planning on using a couple of basic natural language processing techniques on the article summary text. Instead of finding the highest number of category matches, I want to experiment with sorting returned lists of articles by highest number of trigram matches. I may also play around with formulas that combine point-system ranking with publication date to return lists of articles that are both recent and highly rated. I'll keep you posted on my progress.


Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.
Sort:  
This site marks the way forward for a deeper understanding of the forces driving war, injustice, political corruption, and eroding quality of life.

I totally support this!

Wow, so fascinating to read about all of this. I'm learning so much more about how websites are structured, especially news media sites. Your persistence here is admirable!

Glad you like it! I'm having fun: )