Data Science Madness 4: Scrubbing and Sorting

The above image was made with stable diffusion using the prompt 'sorting colored boxes into rows.'

I want to start playing around with news article recommendation algorithms. I'm also interested in using machine intelligence to identify and map out connections between news reports to see what larger stories emerge. This is related to my work for WantToKnow.info, but I'm just doing it for fun on my own time.

My first post on the project described my initial encounter with the data. My second post described the trials and tribulations of preprocessing. My third post showed how I finally got all of the data into a pandas dataframe. Here, you'll find the code I used to scrub and standardize the data to make deeper analysis possible.

The code below does many things. It expands category tag lists into new columns, separates article summaries from notes, and extracts links from both summaries and notes. It also simplifies publication data with both one-off anonymous functions and the mediacondense dictionary I build in the precious step. Here it is:

from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
import os
import re

#read in preprocessed file
dfP = pd.read_csv("C:\\datasources\\WTKpreprocessed.csv", sep='|', encoding='utf-8')

#split raw story into summary and note. 
dfP['Summary'], dfP['Note'] = dfP['Description'].str.split('Note:', 1).str 

#create unique tag columns to make values quickly mappable for graph generation
dfP['tags'] = dfP['tags'].astype(str).apply(lambda v: v.replace('\'', ''))
dfP['tags'] = dfP['tags'].astype(str).apply(lambda v: v.replace('[[', '['))
dfP['tags'] = dfP['tags'].astype(str).apply(lambda v: v.replace(']]', ']'))
dfP.tags = dfP.tags.str[1:-1].str.split(',').tolist()
dfT = pd.DataFrame(dfP.tags.values.tolist(), dfP.index).add_prefix('tag_')

form = lambda f: 'tag_{}'.format(f + 1)
pd.DataFrame(
    dfP.tags.values.tolist(),
    dfP.index, dtype=object
).fillna('').rename(columns=form)

df = pd.concat([dfP, dfT], axis=1)

#add tagcount column for convenient advanced indexing
df['tagcount'] = df.tags.apply(lambda l: len(l))

#initialize some lists for data extraction and cleaning
Quotes = []
Summaries = []
Note_text = []
Adtl_ref_links = []
Note_ref_links = []
Media_sources = []

#set variables up for parsing markup
stories = df['Summary'].astype(str).values.tolist()
notes = df['Note'].astype(str).values.tolist()
medias = df['Publication'].astype(str).values.tolist()

#some loops to extract desired text and links
for i in stories:
    c = bs(i,'lxml')
    try:
        quote = c.strong.text
    except (AttributeError):
        quote = c.text.split('. ')[0]
    summary = c.text
    ref_links = []
    for n in c.find_all('a'):
        xlinks = n.get('href')
        ref_links.append(xlinks)
    Quotes.append(quote)
    Summaries.append(summary)
    Adtl_ref_links.append(ref_links)

#switch parser from lxml for notes parsing because different parsers are good for different things    
for i in notes:
    c = bs(i,'html.parser')
    note = c.text    
    note_links = []
    for n in c.find_all('a'):
        nlinks = n.get('href')
        note_links.append(nlinks)
    Note_text.append(note)
    Note_ref_links.append(note_links)

#initial cleanup of pub. bs4 282 User Warning may show up because some media sources are urls. no biggie    
for i in medias:
    c = bs(i,'html.parser')
    media = c.text
    Media_sources.append(media)
    
#make new dataframe from extracted data    
df3 = pd.DataFrame({'quote': Quotes,
                  'description': Summaries,
                  'note': Note_text,
                  'pub': Media_sources,   
                  'Summary_ref_links': Adtl_ref_links,
                  'Note_links': Note_ref_links})    

#join new dataframe to big raw dataframe
dfR = pd.concat([df, df3], axis=1)

dfR.drop(['Publication', 'Summary', 'url', 'url2'], axis=1, inplace=True)

#begin cleanup of media sources
dfR['pub'], dfR['pub2'] = dfR['pub'].str.split('/', 1).str
dfR['pub'], dfR['pubdetail'] = dfR['pub'].str.split('(', 1).str
dfR['pubdetail'] = dfR['pubdetail'].astype(str).apply(lambda u: u.strip(')'))
dfR['pub'] = dfR['pub'].str.strip()
dfR['pub'] = dfR['pub'].astype(str).apply(lambda u: u.strip('"'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('\'',''))

#use dfR.pub.unique() with pd.set_option('display.max_colwidth', 1000) to view raw media sources list
#dfR['pub'].value_counts()

#combine variations of media source into single item

#one-off translations are easy, but should eventually be moved to mediacondense list
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Seattle times', 'Seattle Times'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Scientific American Blog', 'Scientific American'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('The Sacramento Bee', 'Sacramento Bee'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Mother Jones Magazine', 'Mother Jones'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Sydney Mountain Herald', 'Sydney Morning Herald'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('The Nation magazine', 'The Nation'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Chicago Sun-Times News Group', 'Chicago Sun-Times'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Wired magazine', 'Wired'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('The New Yorker magazine', 'The New Yorker'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Fortune magazine', 'Fortune'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('The Atlantic Monthly', 'The Atlantic'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('Tikkun Magazine - March', 'Tikkun Magazine'))
dfR['pub'] = dfR['pub'].astype(str).apply(lambda r: r.replace('The Daily Mail', 'Daily Mail'))

#dfR.pub.head(3)

#read in preprocessed mediacondense file
mkeys = pd.read_csv("C:\\datasources\\mediakeys.csv", sep='|', usecols=['asis','clean'], encoding='utf-8')

mkeys.set_index('asis', inplace=True)

dx = pd.Series(mkeys.clean.values,index=mkeys.index).to_dict()

dfR['cpub'] = dfR['pub'].map(dx).fillna(dfR['pub'])

dfR.head(3)

#save full raw dataframe
dfR.to_csv("C:\\datasources\\WTKfullraw.csv", sep='|', index=False, encoding='utf-8')

Next Steps

Now that the data has been scrubbed and standardized, it's time to take a step back and think deeply about the algos I want to create. I have a pretty good idea of what I want to do, but I'm feeling like I should be thinking more about the links contained in our summaries and notes. The note links in particular could in many cases already recommend related stories in our archive, but the use of such links is by no means standardized.

I also need to think more about the end use. When I started this project, I was sort of picturing just translating any algo I come up with into javascript, but something tells me it won't be that easy, especially if I end up doing more advanced natural language processing. So there's a lot to think about. In the meantime, I'll probably get back to my usual blogging subjects soon.

Read my novels:

Small Gods of Time Travel is available as a web book on IPFS and as a 41 piece Tezos NFT collection on Objkt.
The Paradise Anomaly is available in print via Blurb and for Kindle on Amazon.
Psychic Avalanche is available in print via Blurb and for Kindle on Amazon.
One Man Embassy is available in print via Blurb and for Kindle on Amazon.
Flying Saucer Shenanigans is available in print via Blurb and for Kindle on Amazon.
Rainbow Lullaby is available in print via Blurb and for Kindle on Amazon.
The Ostermann Method is available in print via Blurb and for Kindle on Amazon.
Blue Dragon Mississippi is available in print via Blurb and for Kindle on Amazon.

See my NFTs:

Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.

Sort:

Trending

[-]

amberjyang (64) last year

Sounds like you've made some excellent progress! Interesting thought about the links in the notes and summaries. Are you thinking you can create code that would even scan the links in the notes and summaries themselves?

On another note, I was thinking about how the HAARP and UFO Cover-Up pages are some of the top visited, and we could even have recommended/related news articles on those pages.

I'm totally impressed by you. We're so blessed to have you as part of our PEERS team!

$0.04

1 vote

mada (71) last year

For sure! Scanning links is uncomplicated, the hard part is creating procedures to do something meaningful with the information. I can also extract the bolded text, which might somehow be useful.

I'll think about the HAARP and UFO Cover-Up pages. Super excited to see them with the new look.

$0.00