Disclaimer: I am not an investment advisor. When I describe my own trading activities, it is not intended as advice or solicitation of any kind.

26 January 2012

Document Scanning in Linux Using Perl

I decided I wanted to take all the old bank statements, credit card bills, and paystubs that we keep in a file cabinet, and scan them into digital format - PDF, to be precise. This is more secure, because we can easily make backup copies and encrypt anything sensitive. And it's less clutter, because even a small hard drive can hold many lifetimes of statements, records, and bills. It was a pretty straightforward plan:
  1. Buy a scanner
  2. Scan all the documents
  3. Shred all the documents
Of course, being me, I had to improve the process a little.

First off, we chose the Fujitsu ScanSnap S1500. There is also an identical S1500M, the only difference being whether the bundled software is written for Windows (S1500) or Mac (S1500M). Notice there is no S1500L (L is for Linux, boys and girls!). No big deal, though, because after only a few minutes of Googling I was able to be pretty certain there were drivers on my real operating system that would handle it. It was a bit more expensive than I would have liked, but having used it for a while now, I couldn't be happier with it. It scans fast, it scans well, and it does multiple pages and full duplex like a champ. Oh, and it opens like a frickin' Transformer. Awesome!

It uses USB, and I use Linux, so I politely ignored all the step by step instructions and software, and instead powered it on and jammed the USB cable into my machine to see what Ubuntu thought of it. Ubuntu thought it looked like a scanner, and it might like to do some scanning with it. No downloads, no crapware, no driver hell, and no problems. Nice. After a few test scans to get my resolution, orientation, and whatnot the way I liked it, I grabbed a handful of monthly bank statements and started scanning.

OK, that's tedious, and I hated having to re-type the damn file name every time. Chase_2007-01.pdf, Chase_2007-02.pdf, Chase_2007-kill-me-now.pdf. It would be nice to simply tap a key on the keyboard every time I get a document ready. I don't need a GUI, so maybe scripting is the right answer... more Googling.

To take input from a scanner and turn it into a multi-page PDF, (at least) 3 steps are necessary:

Step 1: Grab Images From the Scanner
On the Linux command-line, there's a great command called scanadf that will scan all the pages in an automatic document feeder (ADF) and store them as PNM files. The command for doing that for the May 2007 Chase bank statement is:

scanadf -o Chase_2007-02_%d.pnm --source "ADF Duplex" --mode Lineart --resolution 150

Notice the "%d" in the file name. If it's a multiple page statement, each page will be a separate PNM file, and each file name will replace the "%d" with sequential page numbers. PNM files are essentially graphics files.

Step 2: Convert the PNM Files to PostScript
This is a dumb intermediate step, in my opinion. I don't see why no one has simply written a direct PNM-to-PDF conversion utility. But whatever, another command called pnmtops gets this done:

pnmtops -noturn -rle Chase_2007-02_1.pnm > Chase_2007-02_1.eps

This command gets repeated once for each page, varying the file name like scanadf did. The result is a bunch of Encapsulated PostScript files, one for each page.

Step 3: Convert the EPS Files to a Single PDF
I had never used GhostScript before, but I had certainly heard of it. It always sounded kind of glamorous and mysterious - it's definitely mysterious. I tried to read its man-page and got lost nearly immediately (Adobe's docs always do this to me, too), so eventually I just found a recipe for doing what I wanted and stopped asking questions:

gs -q -dSAFER -dNOPAUSE -dBATCH -sOutputFile=Chase_2007-02.pdf -sDEVICE=pdfwrite Chase_2007-02_*.eps

With all these new-found powers at my disposal, I wrote a Perl script. I had it bump the month up by one and scan in all the pages every time I hit the enter key. I gave it the ability to pass in the starting month/year and the base file name ("Chase") on the command-line. Then I found I was missing the odd statement now and then, so I gave myself the ability to type "skip 1" instead of just hitting enter - this skips a month (guess what "skip 4" does) and continues scanning. Then I came across a monthly statement that was in color and on a single side of each page, so I added more command-line arguments to switch to "Photo" mode at 300dpi and to use "ADF Front" instead of duplex. Then I started scanning statements from a bank that for some reason liked to end its months on the 15th. Being a precise kind of guy, I added the ability to optionally include the day in the statement date: Chase_2007-02-15.pdf.

Then I ran into a brokerage statement that wanted to be landscape.

Step 1.5: Rotate the Damn Page 90 Degrees
Believe it or not, you need another program for this, called unpaper. This is a powerful and feature-rich utility, but I only use it to rotate the page. So the script only executes the following command if rotation is selected:

unpaper --pre-rotate -90 --no-processing 1 Chase_2007-02_1.x.pnm Chase_2007-02_1.pnm

Notice that the input file has that extra ".x." in it - I add that to the output of scanadf if rotation is selected.

What else could this amazing script possibly need? Well, I didn't like that the title displayed in my PDF Viewer was "Chase_2007-02_1.eps" - that seemed a little amateur. So the script also creates and then uses a PDFMarks (link is a PDF) file that sets the Title to "Chase_2007-02", and also sets the CreationDate property to February 15, 2007, because really, why not.

Perfect. I will never need to change this script again. Now let's scan some bi-weekly paystubs. Oops.

The Wrong Way
OK, I'm a professional developer, and I'm pretty good at what I do. I know the right thing at this point would have been to change the script to be able to specify the period. But at that moment, for some reason, I chose to copy the entire script, and change that copy into a bi-weekly one. Ugh. 

Later that same day I realized that manually scanning even the annual statements from retirement accounts and whatnot was kind of cumbersome. So of course that's when I fixed my mistake, combined the two scripts, and added annual periodicity, right? No. I cloned the script again. I'm so ashamed.

I lived with this abomination for nearly a week before its software equivalent of screamed obscenities in a silent church was finally too much for me to handle. Never mind that it was perfectly functional - that's not the point. It needed refactoring. Since it is just a little utility script, after all, I compromised on the perfection. There are still three separate scripts, but they now only do the monthly/yearly/bi-weekly work. They use a common module that does the actual scanning and common option management, so while the design is still terrible, at least I have some code reuse.

I'm publishing the full source for it right here (zip) in case anyone wants to use it, adapt it, or improve it. Let me say one thing up-front, though: I am a terrible Perl programmer. My syntax is nearly non-existent, requiring many trips to perldocs to figure out how to do the simplest things. And I realize that my Perl code looks like a C++ developer wrote it - there's a good reason for that. So if you've stumbled on this blog looking for scanning info, and you're a Perl master, have a good laugh at my expense. But please don't tell me about it.

Now if you'll excuse me, I have some quarterly (oh crap!) bills to scan.

4 comments:

  1. I'm going spin class, but I did read this.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Noticed the link to the script is dead. Any chance you would fix it. Thanks

    ReplyDelete
  4. Link is fixed, thanks for letting me know.

    ReplyDelete