Turning a Physical Book/Doc into an ebook/doc with OCR

in #tech6 months ago

Here I go into how to scan physical books and documents, run optical character recognition (OCR) using Tesseract to make those scans searchable, and optionally dump that text into a text file that can be shared or converted into another format (such as an epub). There’s a few way s to go about this, and I go over a few different alternatives, but I personally prefer to use the ocrmypdf tool which uses Tesseract to run OCR on a PDF, then use poppler utilities to convert the PDF into a text document.

Usually I would do a post in both video and written form, but in this case I think most of the value is me recording the process. If this is something that would interest you then the video would probably be best.

That said, to summarize everything we can use Tesseract bundled with some tools to convert a PDF of a scanned book or document into a PDF that has searchable text and is readable by TTS software and other accessibility software. Optionally, we also have the ability to dump the text itself from that PDF to save space or turn it into something like a web page or an EPUB ebook.

There are a few ways to go about this. The best way (IMO) is to use the OCRmyPDF and Poppler Utils software, however, they are limited to Linux. If you're on Windows or Mac there's a few options to achieve the same results:

James Villemaret wrote a Python Script (Convert.py) that does a very good job at at getting very similar results on Windows using Tesseract and a few other peices of software. This works on Windows only, and is a tad more complicated of a setup though.

There is also the Windows Subsystem for Linux, which can run Linux software from within Windows as of you're running it on Linux, which is probably the best way assuming you're fine with the learning curve that comes with it. Last, of course on any operating system a virtual machine with shared folders can accomplish these tasks just fine as well.

If you're already well versed with this sort of stuff this quick summary might be all you need to jump in, but if you're not and still interested the resources section and video above might be worth checking out.

Convert.py Tutorial
Setting up a Debian VM*
*you only need ~1-2 GB of ram and ~8-12 GB of storage
Libre Office

Commands (requires Debian or Debian based OS, VM, or Subsystem)
OCR Tool install command: apt install ocrmypdf
PopplerUtils install command: apt install poppler-utils
ORC a PDF: ocrmypdf [filelocation, eg: ./input.pdf] [outputfilename]
PDF to Text: pdftotext [filelocation, eg: ./OCRed_imput.pdf] [outputfilename]