I finally took my bookkeeping paperless, and scan paper invoices with the Microsoft Lens mobile app. This works but Lens doesn’t OCR the photo, meaning the text of the scans is not searchable. It would be nice to find the scanned PDF when searching for a word contained in the text.
Scanned PDFs can be made searchable by adding a layer of text over the actual image. The text layer positions the characters over the image so it looks like you can select the words in the scan by highlighting them. To extract the text from an image we need OCR software. The OCR program will guess the characters found in the image. Tesseract is free and open-source software that can detect text in an image.
Tesseract can not read PDFs, nor does it clean the images before attempting to extract the text. Scanned PDFs are often skewed with black bands on the sides which can make it harder for Tesseract to interpret the text. pdfsandwich is command-line software that combines unpaper (to straighten and clean the PDFs), ImageMagick (for image manipulation), and Tesseract in a single package. The end result is a “sandwiched PDF” with the scanned image as the bottom layer and the OCRed text on the top. All without touching Adobe Acrobat!
It’s important to specify the language of the document as it greatly improves the text recognition. Run pdfsandwich with the -list_langs flag to list the available language codes (see the Tesseract documentation to install extra languages). These 3 letter language codes can be passed to pdfsandwich to improve the text parsing, even multiple languages in the same document.
$ pdfsandwich -lang eng+nld document.pdf
The result will be stored in a new ‘document_ocr.pdf’ file (can be changed by passing the -o flag). The -rgb flag preserves color (careful though, the documentation mentions it could cause problems with some color spaces). The -rgb option would sometimes create white boxes in the resulting PDF. I solved this by normalizing the image with a simple -coo "-normalize" flag.
Happy CTRL+f’ing! You now have a machine readable PDF.
This works fine for a single file but we can do better. This little bash script converts all PDF files in a folder. Since we don’t want to OCR PDF files we already converted we check if the PDF file includes fonts. If it does, we know the PDF contains some sort of text and we don’t spend time converting it again.
This can now easily be ran as a cronjob to periodically OCR all scanned PDF invoices.
This post is open source. Did you spot a mistake? Ideas for improvements? Contribute to this post via Github. Thank you!