Data Science Madness 2: Preprocessing

The above image was made with stable diffusion using the prompt 'python in a forest of code.'

I want to start playing around with news article recommendation algorithms. I'm also interested in using machine intelligence to identify and map out connections between news reports to see what larger stories emerge. This is related to my work for WantToKnow.info, but I'm just doing it for fun on my own time. My first post on the project describes my initial encounter with the data. Here's what happened next.

Encoding Errors

Our tech guy sent me copy of WantToKnow.info's news database and I started preparing the files. Two of the tables I needed were in an sql file while the main file was a csv that arrived with tens of thousands of character encoding errors. These arose from the mismatch between the database's latin1 encoding and the standard utf8 encoding used everywhere else. Fortunately, most of these errors involved characters that could be copy/pasted into Notepad++'s find and replace. I ran across a handy Stack Overflow that showed what to replace with what:

$final = str_replace("Â", "", $final);
$final = str_replace("â€™", "'", $final);
$final = str_replace("â€œ", '"', $final);
$final = str_replace('â€“', '-', $final);
$final = str_replace('â€', '"', $final);

Beyond these and other garbage characters, the document contained thousands of little black boxes that said 'OSC' scattered seemingly at random. These boxes couldn't be copied, so I had to hunt down a regex pattern to use in the find/replace menu. This page told me that I needed to use regex search "\x{009D}" to find the little black boxes. I tried it and it worked, replacing over 2,000 of the unwanted characters. Overall, there were about 40k characters to fix or remove.

All of this find/replace may have negatively impacted the news stories in the file. Once I get everything into pandas I'll have a better sense of any problems. That's also the point where I'll do a bunch of find/replace with python to correct double quotes and other issues. Before getting there, however, I had to create a couple of new files.

Combining Files

The articles database I'm working with is stored in multiple tables. There's a big table with all of the articles, a table with all of the article categories, and a table connecting articles with their categories. So I made the latter two tables into files suitable for incorporation into a pandas dataframe. Manually creating new csv files from sql tables was a little tedious, but I was able to format them correctly with a simple find/replace algorithm.

Once I had all of the files, I put them all together with the following script:

import pandas as pd
import numpy as np
import os
import re

#initial data obtained from dump.sql, syntax converted to sqlite for easy .sqlite export to .csv
#read in category tables data 
dftemp_A = pd.read_csv("C:/datasources/CategoryArticle.csv", sep='|')
dftemp_B = pd.read_csv("C:/datasources/Category.csv", sep='|', usecols=['cid','url'])

#combine key tables
#  d = pd.Series(dftemp_B.url.values,index=dftemp_B.cid).to_dict()
d = pd.Series(dftemp_B.url.values.astype(str),index=dftemp_B.cid).to_dict()
dftemp_A['tag'] = dftemp_A['cid'].map(d)

#create multiindex hierarchy
dftemp_A.set_index(['ArticleId','cid'], inplace=True)
dftemp_A.sort_index(inplace=True)

#transform dataframe to dictionary where an ArticleId input returns correct tags list
v = dftemp_A.groupby('ArticleId')['tag'].apply(lambda t: list(t)).to_dict()

#check it by inserting ArticleId for XXXXX in "print(v[XXXXX])" 

#create file for building category to store edges in graph
dftemp_A.to_csv("C:\\datasources\\tagedges.csv", sep='|', index=True, encoding='utf-8')

#read in raw articles table data
dftemp_DB = pd.read_csv("C:\\datasources\\articlesScrubbed.csv", sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url','url2'])

#make some unique wtk urls 
makeurl = dftemp_DB['url'].astype(str)
dftemp_DB['wtkURL'] = "https://www.wanttoknow.info/a-" + makeurl
  
#make tags column and populate with dictionary v
dftemp_DB['tags'] = dftemp_DB['ArticleId'].map(v)  

#check it with something like dftemp_DB.head(12)

# dftemp_DB.to_csv("C:\\datasources\\WTKpreprocessed.csv", sep='|', index=False, encoding='utf-8')

I wrote this script 5 years ago for this exact purpose and it executed without error, producing a well-formed preprocessed file. I was overjoyed, but my joy was short lived. When I inspected the data, I found that it had been corrupted. There should have been a tags column populated by a list of tags associated with every story. Instead, the column was populated by lists of NaNs. So something changed in python or pandas that prevented the string data type from being preserved.

The issue arose in these lines:

d = pd.Series(dftemp_B.url.values.astype(str),index=dftemp_B.cid).to_dict()
dftemp_A['tag'] = dftemp_A['cid'].map(d)

Somehow, the data in column "url" in "dftemp_B" is being changed from string dtype to object dtype. After spending an hour with ChatGPT trying to fix it, I still don't know exactly what the problem is or how to find a solution. If you can see what I'm doing wrong, do please let me know. Until then, I guess I'll be reading documentation.

Read my novels:

Small Gods of Time Travel is available as a web book on IPFS and as a 41 piece Tezos NFT collection on Objkt.
The Paradise Anomaly is available in print via Blurb and for Kindle on Amazon.
Psychic Avalanche is available in print via Blurb and for Kindle on Amazon.
One Man Embassy is available in print via Blurb and for Kindle on Amazon.
Flying Saucer Shenanigans is available in print via Blurb and for Kindle on Amazon.
Rainbow Lullaby is available in print via Blurb and for Kindle on Amazon.
The Ostermann Method is available in print via Blurb and for Kindle on Amazon.
Blue Dragon Mississippi is available in print via Blurb and for Kindle on Amazon.

See my NFTs:

Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.

Sort:

Trending

[-]

amberjyang (64) last year

Fascinating to know that there's a data science term called Pandas. This sounds complicated yet very intriguing. Maybe this can be a casual topic for tomorrow's meeting :)

I'm still cheering you on from the sidelines. Thanks for putting in all this work. I'm hoping this will be the start to an awesome system for our site :)

$0.04

1 vote

mada (71) last year

Thanks for the encouragement. Pandas is like the Python language version of Excel, but with vastly greater capabilities. I'm happy to mention it at the meeting, but feel like it might be a little off topic: )

$0.00

stemsocial (63) last year

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.