Vector Search App MVP

in #programming17 days ago

photo_2024-05-15_21-35-17.jpg

The above image was made by @amberjyang with Midjourney using the prompt 'a blue python slithering through computer coding numbers.'

Background

Last year, I wrote a news recommendation algorithm for WantToKnow.info. You can read about the project here and test out the recommendations by clicking on any article title in our archive. The recommendations are based on something called TF-IDF vector cosine similarity, which is to say the mathematical relationships between news stories.

More recently I was inspired to expand the underlying tech to vector search. WantToKnow has good search already, but it's keyword based. My thinking is that vectorizing search queries and then comparing query vectors with news article vectors could potentially surface good stories in situations where keywords alone aren't cutting it.

Success

Today I got a vector search app to the minimum viable product stage. I made a web page that takes any detailed question or description about any conspiracy-related topic as input and outputs a list of the 20 most relevant news article summaries. All of the logic is python, glued to the html with Pyscript, with a csv file stored on IPFS instead of a database.

<!DOCTYPE html>
<html lang="en">
<head>
    <title>WantToKnow Archive Vector Search</title>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link rel="stylesheet" href="https://pyscript.net/releases/2023.05.1/pyscript.css" />
    <script defer src="https://pyscript.net/releases/2023.05.1/pyscript.js"></script>
    <style>
        body {
            margin-left: 20%;
            margin-right: 20%;
        }

        #mainstory {
            color: white;
            background-color: black;
            padding: 10px
        }

        textarea {
            width: 100%;
            height: 150px;
            padding: 12px 20px;
            box-sizing: border-box;
            border: 5px solid black;
            background-color: #f8f8f8;
            font-size: 16px;
            resize: none;
        }

        button {
            width: 100%;
            color: white;
            background-color: black;
            font-size: 24px;
            text-align: center;
            padding: 12px;
        }

        button:hover {
            color: black;
            background-color: white;
        }
    </style>
</head>
<body>
<py-config>
    packages = [
        "pandas",
        "scikit-learn"
    ]
    terminal = false
</py-config>
    
    <h1>WantToKnow.info Archive Vector Search</h1>
    <p>Find news article recommendations based on term frequency-inverse document frequency (TF-IDF) vector cosine similarities. A search returns the 20 most closely related summaries.</p>
    <p><strong>Instructions:</strong> enter a question or statement. When it comes to conspiracies and cover-ups, what do you most want to know? Be as detailed as possible. Five or six sentences is optimal. Press the submit button only once and wait for the data to be crunched.</p>
    
    <textarea id="askit">What do you want to know?</textarea>
    <button id="submit-btn">Submit Query for Processing</button>
    <div id="mainstory"></div>
    <div id="relatedstories"></div>
    
<script type="pyscript">
import pandas as pd
import re
from js import console
from pyscript import when, display
from pyodide.http import open_url
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

@when('click', '#submit-btn')
def query():
    question = Element('askit').element.value
    Element('mainstory').write(question)
    url = 'the IPFS url of my csv file'
    df = pd.read_csv(open_url(url), sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url'])
    
    # Deduplication and NaN cleanup
    df = df.drop_duplicates('Title')
    df = df[df['Priority'].notna()]

    # Substituting multiple spaces with single space
    df['Description'] = df['Description'].apply(lambda x: re.sub(r'\s+', ' ', str(x)))

    # Remove double quotes
    df['Description'] = df['Description'].apply(lambda r: r.replace('\"\"', '\"'))

    # Remove paragraph styling
    df['Description'] = df['Description'].apply(lambda r: r.replace('<p style=\"text-align: justify;font-size: 11pt; font-family: Arial;margin: 0 0 11pt 0\">', '<p>'))
    df['Description'] = df['Description'].apply(lambda r: r.replace('<p style=\"text-align: justify;font-size: 11pt; font-family: Arial;margin: 0 0 10pt 0\">', '<p>'))
    df['Description'] = df['Description'].apply(lambda r: r.replace('<p>', ''))
    df['Description'] = df['Description'].apply(lambda r: r.replace('</p>', ''))

    query_row = pd.DataFrame({'ArticleId': '54321','Title': 'Search Terms','PublicationDate': '','Publication': '','Links': '','Description': 'Variable','Priority': '','url': ''}, index=[0])
    df = pd.concat([query_row, df]).reset_index(drop=True)
    df.at[0, 'Description'] = question

    # Compute TF-IDF vectors and cosine similarities
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(df['Description'])
    cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix).flatten()

    # Find the 20 most similar articles
    similar_indices = cosine_similarities.argsort()[-21:-1][::-1]
    similar_items = df.iloc[similar_indices]

    # Display the results in the specified format
    result_html = ""
    for index, row in similar_items.iterrows():
        for col in df.columns:
            result_html += f"<b>{col}:</b> {row[col]}<br>"
        result_html += "<br>"

    display(result_html, target="relatedstories")

</script>

</body>
</html>

As of now, the results display needs work, but the thing is basically operational. Calling the main function with an event-listening decorator still seems weird to me, but this was the only way I could get it to work. I ended up using gpt-4 to get the cosine similarities computed efficiently and was surprised by how much better gpt-4 is compared to gpt-3.5.

When I first started this project, my plan was to pre-compute the vectors to conserve browser resources. But storing the vectors in the csv made its size balloon from 27MB to 3.5GB. So I instead went with browser-computed vectors and it actually seems okay. A search takes well under a minute, with excellent results relevance.

As for next steps, after cleaning up the display, there are a few directions I could take the project in. I'd like to embed a Telegram group discussion in the page, but the available embed widget doesn't work, so I could try to do something with their API. I'm also looking at trying to send search results to gpt to generate a 500 word summary brief of the material. That might be pretty cool.


Read Free Mind Gazette on Substack

Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.
Sort:  

Kudos to you on making that web page
You’re amazing

Hey thanks!

It’s been a week or so, and you’re thriving with this project!! I’d love to try it out :) so even through a browser computed vector, it takes up tons of space or can anyone experiment with it regardless of how much RAM they have?

After a little cleanup anyone will be able to try it out. The ram requirements aren't outrageous with the way it works now. I'll send you a copy soon: )

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.