Part 3/13:
The backbone of the engine is built on the arXiv dataset, a colossal repository of academic preprints and papers. With over 1.7 million articles, the dataset provides rich metadata—titles, authors, publications dates, abstracts, and full PDFs—that form the core input for the engine.
Shapiro downloaded and parsed the dataset, which comprises 2.1 million JSON entries, each representing a paper. By indexing essential metadata, especially titles and abstracts, he lays the groundwork for semantic search without immediately requiring the full texts, which are more cumbersome to process due to their size.