A Pyscript App Backed with a Pandas Dataframe

in #programming11 months ago

02oNiBaTDayTnEJtn20O--1--5kp64_2x.jpg

The above image was made with stable diffusion using the prompt 'python with code in the background.'

For fun on my own time, I've been playing around with a copy of the WantToKnow.info news archive, exploring relationships between articles based on article text, documenting my progress as I go.

My first post on the project described my initial encounter with the data. My second post described the trials and tribulations of preprocessing. My third post showed how I finally got all of the data into a pandas dataframe. My fourth post showed how I scrubbed and standardized the data to make deeper analysis possible. My fifth post described how I used Scikit Learn to turn article text into TF-IDF (Term Frequency - Inverse Document Frequency) vectors, which made it possible to compute the cosine similarity for all of the article vectors and generate a list of the ten closest matches for a given article.

Enter Pyscript

Having done all of that, I had to figure out a way to share the article recommendations with my colleagues. So I made a Pyscript app! Pyscript is a relatively new Anaconda project that makes it possible to use python and many of its scientific computing packages right in the html of a webpage. Because the project is new, it's changing fast, and there aren't yet many good resources available for learning how it all works.

For my project, I kept things as simple as I possible could, using Pyscript's built in repl as the user interface instead of standard html form inputs. With this repl, users can pick a target article and generate/print a list of the 10 most closely related articles by amending and executing the code that prepopulates the repl. Users can also write and execute their own python queries of the dataset.

The app I made is very simple. It's a one-pager that I've pasted in its entirety below. But getting it to this point was a bumpy road. It took me forever to get my configuration right. Eventually Stack Overflow came to the rescue. It turned out that I was trying to install packages that were already included by default, namely re and ast.

Once my config was working, I was having no luck accessing my csv data file. It's a 30 mb file, which is too big for Github to host. I put it on Google Drive but requesting the file from there ended in network errors. So I put it on my server space with 1&1 and encountered similar errors. Ultimately I ended up putting the file on IPFS and requesting this file using open_url. Then, like magic, I had an app backed with a pandas dataframe.

Most of action happens in a single function that I'd prototyped in a jupyter notebook. That code uses iteritems, which is apparently depreciating, according to warning text that appeared along with desired text output when I first hooked everything up. Getting rid of these error messages was easy enough. Just a single command. After that, I changed some print statements to Pyscript display statements, hid the default terminal, added some text, and applied some styling.

The result was a minimally sufficient demonstration of news article recommendations based on TF-IDF vector cosine similarity. You can check it out here.

<!DOCTYPE html>
<html lang="en">
<head>
    <title>WantToKnow Archive Interface</title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link rel="stylesheet" href="https://pyscript.net/releases/2023.05.1/pyscript.css" />
    <script defer src="https://pyscript.net/releases/2023.05.1/pyscript.js"></script>
    
<style>
    body {
        margin-left: 20%;
        margin-right: 20%;
    }

    #mainstory {
        color: white;
        background-color: black;
        padding: 10px
    }
</style>
</head>
<body>
    <py-config>
        packages = [
        "pandas"
        ]

        terminal = false
    </py-config>    
    <script type="pyscript">
        import pandas as pd
        import ast
        import re
        from pyodide.http import open_url
        from pyodide.ffi import create_proxy
        import warnings
        warnings.filterwarnings("ignore")
        
        url = 'URL of my .csv file on IPFS'
        df = pd.read_csv(open_url(url), sep='|')

        #substituting multiple spaces with single space
        df['Description']= df['Description'].apply(lambda x: re.sub(r'\s+',' ', str(x)))
        #remove double quotes
        df['Description']= df['Description'].apply(lambda r: r.replace('\"\"', '\"'))
        #remove paragraph styling
        df['Description']= df['Description'].apply(lambda r: r.replace('<p style=\"text-align: justify;font-size: 11pt; font-family: Arial;margin: 0 0 11pt 0\">', '<p>'))
        df['Description']= df['Description'].apply(lambda r: r.replace('<p style=\"text-align: justify;font-size: 11pt; font-family: Arial;margin: 0 0 10pt 0\">', '<p>'))
        df['Description']= df['Description'].apply(lambda r: r.replace('<p>', ''))
        df['Description']= df['Description'].apply(lambda r: r.replace('<\/p>', ''))

        def get_related_articles(input_value, input_type):
            df['Related'] = df['Related'].apply(lambda x: [int(i) for i in ast.literal_eval(x) if isinstance(i, int)])

            if input_type == "Title":
                filtered_df = df.loc[df['Title'] == input_value]
            elif input_type == "ArticleId":
                filtered_df = df.loc[df['ArticleId'] == input_value]
            else:
                raise ValueError("Invalid input_type. Please choose either 'Title' or 'ArticleId'.")

            if filtered_df.empty:
                return pd.DataFrame()  # Return an empty DataFrame if no rows match the input value

            related_ids = filtered_df['Related'].values[0]
            dfselected = df[df['ArticleId'].isin(related_ids)]
            # Print the cell values of filtered_df
            #print("filtered_df:")
            for _, row in filtered_df.iterrows():
                for column, value in row.iteritems():
                    display(f"{column}: {value}", target="mainstory")

            # Print the cell values of dfselected
            #print("dfselected:")
            for _, row in dfselected.iterrows():
                for column, value in row.iteritems():
                    display(f"{column}: {value}", target="relatedstories")
            return dfselected
    </script>
    <h1>WantToKnow.info Archive Interface</h1>
    <p>Find news article recommendations based on term frequency-inverse document frequency (TF-IDF) vector cosine similarities. A search for a given article summary returns the target summary and the 10 most closely related summaries.</p>
    <p><strong>Instructions:</strong> there are two valid input types, 'Title' and 'ArticleId'. input_type must be in quotes. Exact article title must be entered in quotes. Exact article ids must be entered as a number without quotes. Once your query is correctly entered in the text box below, press the green triangle button that appears when yyou hover over the bottom right corner of the text box. Press this button only once and wait for the data to be crunched. If nothing happens, it means that the query was incorrectly formed or the target article didn't make it into the dataset.</p>
    <p><strong>Notes:</strong> The page must be refreshed between searches. In addition to generating recommendations, the text area also accepts arbitrary Python queries of the Pandas dataframe 'df'.</p>
    <py-repl>
        input_value = 'Ex-Air Force Personnel: UFOs Deactivated Nukes'
        input_type = 'Title'
        dfselected = get_related_articles(input_value, input_type)
    </py-repl>
    <div id="mainstory"></div>
    <div id="relatedstories"></div>
</body>

Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.
Sort:  

Congratulations on getting your project to a functional state! It was interesting to watch your (at times complicated) process over the past few weeks. You didn't give up, even when you were on your own dealing with things within a rapidly changing and evolving data science world. Now you have more experience and knowledge in your toolbelt it seems!

Thanks! It was a challenging project, but I feel good about how it came out. The app probably won't interest the general public at all, but it's a fun toy for our team to play with: )

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.