Data Science Madness (Updated)

in #programminglast year (edited)

8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg

The above image was made with stable diffusion using the prompt 'a shower of book pages.'

Today I'm going a little nuts because of a stupid problem. I'm trying to convert a sql dump file into a csv with pipe delimiters and can't seem to find a reasonable way to do this. I've so far tried Google, Stack Exchange, ChatGPT, and all of my friends. No solutions have been forthcoming.

UPDATE: our tech guy was able to export a pipe separated csv from the database.

Context

The file in question is WantToKnow.info's news database. 5 years ago, I wrote a bunch of code for a news mapping project using this database. At that time, I was able to turn the sql dump into sqlite using an online service, then export the sqlite to csv with pipe delimiters so I could play around with the data in Pandas.

Now, 5 years later, our database has grown to over 12k entries, and I want to tinker with news story recommendation algorithms. My thinking is that trigram vector proximity might make for better recommendations than simple keywords. And maybe entity recognition has improved to the point where it alone could work. Either way, I've already written most of the code necessary to start trying stuff out. But before I can even start tinkering, I need the data to be in usable form.

Issues

This go around, the first issue I encountered was that there was no longer an easy way to convert the sql into sqlite. My file is 30m, which was too big for any of the free converters I could find. The most popular conversion tool online now charges $89 for the service. There is a program called DB Browser that claims to be able to import sql files and export csv, but this tool didn't work for me.

Whenever I tried importing the database, the software failed because it detected syntax errors in the file, which came from MariaDB 5.5.68 with latin1 encoding. I'll keep looking for a program that works for my situation, but I'm not hopeful.

Python

The obvious next step is to use python to convert the file. This would be easy if my data were simpler. Some data can just be split into rows by parentheses and into columns by commas. Not mine. My data includes parentheses and commas in the column text, as well as character encoding errors throughout.

So I've been trying to use regular expressions to transform the sql into csv and it feels like a road to madness. Here's as far as I've gotten:

import csv
import re

def convert_sql_to_csv(sql_file, csv_file):
    with open(sql_file, 'r', encoding='utf-8') as f:
        sql_data = f.read()

    # Extracting column names using regular expressions
    column_pattern = r"`(.*?)`"
    column_names = re.findall(column_pattern, sql_data)

    # Extracting data rows using regular expressions
    data_pattern = r"\((.*?)\)"
    data_rows = re.findall(data_pattern, sql_data)

    # Cleaning and splitting data rows
    data_rows = [row.replace("', '", "|") for row in data_rows]
    data_rows = [row.split("|") for row in data_rows]

    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|')
        writer.writerow(column_names)
        writer.writerows(data_rows)

    print(f'Successfully converted {sql_file} to {csv_file}!')

This seems like the kind of problem ChatGpt could easily solve, but I went back and forth a dozen times with the AI and no solution was forthcoming. Even so, the exchange was helpful, and provided me with an easy structure to work with. And now I'm examining the data more and more closely in hopes of identifying character patterns that could be used instead of parentheses and commas to correctly arrange the data for a csv file.

All I want to do is get the news archive into a pandas dataframe, and instead I'm stuck in preprocessing. My rational mind tells me to be patient until a solution presents itself. Yet my experience has been that these kinds of particular technical issues don't just fix themselves. Regardless, I'll keep working on it.


Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.
Sort:  

This sounds complicated and way above my understanding of reality. I'm cheering you on from the sidelines :)

On another note, thanks for working on this! It'll be soooo awesome to have articles and links people can explore while they traverse our site. Sounds like you're staying patient and diligent. Hope you're also having fun and remembering to hydrate and eat :)

Haha I'm definitely having fun with this. And I plan to keep posting my progress as it's made.