Part 5/8:
Running this script on large documents—such as technical manuals or textbooks—transforms them from cumbersome PDFs to manageable, clean text files. For example, a 2.3MB PDF of a technical guide was converted into a 353KB text file, preserving page delineations and making the content suitable for LLM ingestion.
Extracting Text from Word Documents (DOCX)
Why Microsoft Word is Superior
When dealing with Word files, David advocates using Microsoft Word itself for conversion, especially when installed locally. PowerShell scripts automate this process effectively.
PowerShell Solution
He provides a PowerShell script to:
Loop through all DOCX files in a source directory.
Open each file invisibly (
headlessmode).Save it as a plain text file in a destination directory.