Part 3/10:
The project harnesses the Simple English Wikipedia as its foundational data source. Compared to the full Wikipedia's colossal size (~81GB uncompressed), the simplified version is leaner (~1GB), smaller and easier to process, making it ideal for initial development. For testing and iteration, a 4.2MB subset is used to streamline experimentation.
Using XML exports of Wikipedia, the process involves parsing and cleaning the data into an accessible format—plain text free from web formatting marks, hyperlinks, and unnecessary markup. This involves writing scripts that strip away complicated wiki markup, HTML tags, references, citations, and categories, leaving only the core article content.
From Raw Data to Cleaned, Searchable Text
The processing pipeline includes several key steps: