Contemplating Vector Search

in #programming19 days ago

nuQjRNINIEqJXWVABqkh--1--9wwba.jpg

The above image was made with stable diffusion using the prompt 'closeup blue python and a magnifying glass.'

Last year, I wrote a news recommendation algorithm for WantToKnow.info. You can read about the project here and test out the recommendations by clicking on any article title in our archive. The recommendations are based on something called TF-IDF vector cosine similarity, which is to say the mathematical relationships between news stories.

The recommendations generated by my code are great so I started wondering about expanding the underlying tech to vector search. WantToKnow has good search already, but it's keyword based. My thinking is that vectorizing search queries and then comparing query vectors with news article vectors could potentially surface good stories in situations where keywords alone aren't cutting it.

Yesterday, I decided to get into vector search for real. My goal is to make a web page that takes any detailed question or description about any conspiracy-related topic as input and outputs a list of the 20 most relevant news article summaries. As a fun added constraint, I want to make this search app front end only, and I want to write all of the logic using python in the html with Pyscript. For this project, instead of a database, I plan to use a csv file stored on IPFS.

Diving in, I created a pandas dataframe of our news archive, then computed TF-IDF vectors for all articles, storing these vectors in a new column. Here's the code I started with:

import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


df = pd.read_csv("C:\\datasources\\ArticleScrubbed.csv", sep='|', usecols=['ArticleId','Title','PublicationDate','Publication','Links','Description','Priority','url'])

#deduplication and NaN cleanup
df.drop_duplicates('Title')
df = df[df['Priority'].notna()]

#substituting multiple spaces with single space
df['Description']= df['Description'].apply(lambda x: re.sub(r'\s+',' ', str(x)))

#Compute the TF-IDF vectors for the preprocessed article text
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Description'])

df['Vector'] = tfidf_matrix.toarray().tolist()
df.tail(6)

The code executed without incident, but when I wrote the dataframe to a new csv file, it took forever. Then I looked at the size of the file and it was 3.5GB. Without the vectors, the entire news archive is only about 27MB. And suddenly it became clear why everybody didn't use vector search.

Now I'm left wondering if it would be better to proceed with the pre-computed vectors or to use browser resources to calculate and compare vectors on the fly. Neither path is optimal as both will add GB worth of load to user systems. All I really want to do is get to the point where I can test vector search to see if it might be worth it to implement on our site. I'm sure I'll get there, but it might be a bumpy ride.


Read Free Mind Gazette on Substack

Read my novels:

See my NFTs:

  • Small Gods of Time Travel is a 41 piece Tezos NFT collection on Objkt that goes with my book by the same name.
  • History and the Machine is a 20 piece Tezos NFT collection on Objkt based on my series of oil paintings of interesting people from history.
  • Artifacts of Mind Control is a 15 piece Tezos NFT collection on Objkt based on declassified CIA documents from the MKULTRA program.
Sort:  

Wow, I can’t imagine any technical work like this NOT being a bumpy ride. And your patience, perseverance, and cognitive prowess will lead you to meaningful insights, wherever this complex project takes you. I’m in awe of the potential of this project of yours! And rooting for you every step of the way :)

Your kudos are well received: ) At this point I'm pretty close to a minimum viable product. Should be fun!

Please, how are codes created? Do you just form some codes in your head or how do you go about it?

Usually I start with some data, map out the logic involved in transforming the data into what I need, then write the code. Sometimes I know enough to just write the code straight from my head, but I also use a combination of Stack Overflow and gpt to get the code right.

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support.