Disclaimer: I am not an investment advisor. When I describe my own trading activities, it is not intended as advice or solicitation of any kind.

26 January 2012

Document Scanning in Linux Using Perl

I decided I wanted to take all the old bank statements, credit card bills, and paystubs that we keep in a file cabinet, and scan them into digital format - PDF, to be precise. This is more secure, because we can easily make backup copies and encrypt anything sensitive. And it's less clutter, because even a small hard drive can hold many lifetimes of statements, records, and bills. It was a pretty straightforward plan:
  1. Buy a scanner
  2. Scan all the documents
  3. Shred all the documents
Of course, being me, I had to improve the process a little.

First off, we chose the Fujitsu ScanSnap S1500. There is also an identical S1500M, the only difference being whether the bundled software is written for Windows (S1500) or Mac (S1500M). Notice there is no S1500L (L is for Linux, boys and girls!). No big deal, though, because after only a few minutes of Googling I was able to be pretty certain there were drivers on my real operating system that would handle it. It was a bit more expensive than I would have liked, but having used it for a while now, I couldn't be happier with it. It scans fast, it scans well, and it does multiple pages and full duplex like a champ. Oh, and it opens like a frickin' Transformer. Awesome!

It uses USB, and I use Linux, so I politely ignored all the step by step instructions and software, and instead powered it on and jammed the USB cable into my machine to see what Ubuntu thought of it. Ubuntu thought it looked like a scanner, and it might like to do some scanning with it. No downloads, no crapware, no driver hell, and no problems. Nice. After a few test scans to get my resolution, orientation, and whatnot the way I liked it, I grabbed a handful of monthly bank statements and started scanning.

OK, that's tedious, and I hated having to re-type the damn file name every time. Chase_2007-01.pdf, Chase_2007-02.pdf, Chase_2007-kill-me-now.pdf. It would be nice to simply tap a key on the keyboard every time I get a document ready. I don't need a GUI, so maybe scripting is the right answer... more Googling.

To take input from a scanner and turn it into a multi-page PDF, (at least) 3 steps are necessary:

Step 1: Grab Images From the Scanner
On the Linux command-line, there's a great command called scanadf that will scan all the pages in an automatic document feeder (ADF) and store them as PNM files. The command for doing that for the May 2007 Chase bank statement is:

scanadf -o Chase_2007-02_%d.pnm --source "ADF Duplex" --mode Lineart --resolution 150

Notice the "%d" in the file name. If it's a multiple page statement, each page will be a separate PNM file, and each file name will replace the "%d" with sequential page numbers. PNM files are essentially graphics files.

Step 2: Convert the PNM Files to PostScript
This is a dumb intermediate step, in my opinion. I don't see why no one has simply written a direct PNM-to-PDF conversion utility. But whatever, another command called pnmtops gets this done:

pnmtops -noturn -rle Chase_2007-02_1.pnm > Chase_2007-02_1.eps

This command gets repeated once for each page, varying the file name like scanadf did. The result is a bunch of Encapsulated PostScript files, one for each page.

Step 3: Convert the EPS Files to a Single PDF
I had never used GhostScript before, but I had certainly heard of it. It always sounded kind of glamorous and mysterious - it's definitely mysterious. I tried to read its man-page and got lost nearly immediately (Adobe's docs always do this to me, too), so eventually I just found a recipe for doing what I wanted and stopped asking questions:

gs -q -dSAFER -dNOPAUSE -dBATCH -sOutputFile=Chase_2007-02.pdf -sDEVICE=pdfwrite Chase_2007-02_*.eps

With all these new-found powers at my disposal, I wrote a Perl script. I had it bump the month up by one and scan in all the pages every time I hit the enter key. I gave it the ability to pass in the starting month/year and the base file name ("Chase") on the command-line. Then I found I was missing the odd statement now and then, so I gave myself the ability to type "skip 1" instead of just hitting enter - this skips a month (guess what "skip 4" does) and continues scanning. Then I came across a monthly statement that was in color and on a single side of each page, so I added more command-line arguments to switch to "Photo" mode at 300dpi and to use "ADF Front" instead of duplex. Then I started scanning statements from a bank that for some reason liked to end its months on the 15th. Being a precise kind of guy, I added the ability to optionally include the day in the statement date: Chase_2007-02-15.pdf.

Then I ran into a brokerage statement that wanted to be landscape.

Step 1.5: Rotate the Damn Page 90 Degrees
Believe it or not, you need another program for this, called unpaper. This is a powerful and feature-rich utility, but I only use it to rotate the page. So the script only executes the following command if rotation is selected:

unpaper --pre-rotate -90 --no-processing 1 Chase_2007-02_1.x.pnm Chase_2007-02_1.pnm

Notice that the input file has that extra ".x." in it - I add that to the output of scanadf if rotation is selected.

What else could this amazing script possibly need? Well, I didn't like that the title displayed in my PDF Viewer was "Chase_2007-02_1.eps" - that seemed a little amateur. So the script also creates and then uses a PDFMarks (link is a PDF) file that sets the Title to "Chase_2007-02", and also sets the CreationDate property to February 15, 2007, because really, why not.

Perfect. I will never need to change this script again. Now let's scan some bi-weekly paystubs. Oops.

The Wrong Way
OK, I'm a professional developer, and I'm pretty good at what I do. I know the right thing at this point would have been to change the script to be able to specify the period. But at that moment, for some reason, I chose to copy the entire script, and change that copy into a bi-weekly one. Ugh. 

Later that same day I realized that manually scanning even the annual statements from retirement accounts and whatnot was kind of cumbersome. So of course that's when I fixed my mistake, combined the two scripts, and added annual periodicity, right? No. I cloned the script again. I'm so ashamed.

I lived with this abomination for nearly a week before its software equivalent of screamed obscenities in a silent church was finally too much for me to handle. Never mind that it was perfectly functional - that's not the point. It needed refactoring. Since it is just a little utility script, after all, I compromised on the perfection. There are still three separate scripts, but they now only do the monthly/yearly/bi-weekly work. They use a common module that does the actual scanning and common option management, so while the design is still terrible, at least I have some code reuse.

I'm publishing the full source for it right here (zip) in case anyone wants to use it, adapt it, or improve it. Let me say one thing up-front, though: I am a terrible Perl programmer. My syntax is nearly non-existent, requiring many trips to perldocs to figure out how to do the simplest things. And I realize that my Perl code looks like a C++ developer wrote it - there's a good reason for that. So if you've stumbled on this blog looking for scanning info, and you're a Perl master, have a good laugh at my expense. But please don't tell me about it.

Now if you'll excuse me, I have some quarterly (oh crap!) bills to scan.

02 January 2012

Happy 1986, Part 1

And Happy New Year! It's 1986 in the Year-a-Month project, and there are so many albums needed to flesh out my collection that I decided to split 1986 into two separate months to control my costs. 1986 is a tough year, because so many hard-rock bands from the 1970s and early 1980s started selling out and turning into pop-rock bands (I'm looking at you, Ted Nugent).

 Black Sabbath: Seventh Star - Wikipedia says that this album was a major, intentional departure from the classic Sabbath sound. As usual, Wikipedia is absolutely right. I like the change of pace, though. The style is a lot more groove-metal than previous Sabbath albums, and by using a (yet another) different lead singer - in this case Glenn Hughes - it lets Sabbath get away with sounding like an entirely different band. Sort of a vacation from themselves.

Joe Satriani: Not of This Earth - This one was a tough decision, because some the track previews were pretty promising. But overall it was a little too electronica-ish for me, and I decided with all the other music that needed buying this month, I could afford to miss it. Does anyone else think he looks a lot like Adam Sandler in this photo?
Judas Priest: Turbo - In the spring of 1989, I was finishing up my freshman year of college, and my school had a compressed-schedule "Spring Term" in which students took only a single class, but that class was a full day every day for a month. Students in other majors went cool places like France or the Galapagos, but I opted to take a 400-level Compiler Construction class. We were using Turbo C to build our compilers, and every time I invoked the "turbo" command on my development machine, I caught myself humming the title song to this album. I did very little partying that term.

 Motorhead: Orgasmatron - This might get me blasted by die-hard Motorhead fans, but bear with me. I'm starting to notice that all Motorhead songs pretty much sound the same. Normally I would slam a band for this transgression (I'm looking at you, AC/DC). But for some reason I love the way Lemmy belts these songs out like he's just finished vomiting up the pills he popped a few minutes ago, and is thinking about popping some more.

Ozzy Osbourne: The Ultimate Sin - Ozzy is not the most exciting heavy metal act in the world, but this is a solid album. He seems to be a little more in the game than the last one, and I caught myself tapping my foot from time to time, which is more than I could say for Bark At The Moon.
Ted Nugent: Little Miss Dangerous - As alluded to above, I have very little patience for Ted these days. It only took a few samples sporting electronic drums and prominent keyboards from this album to make me move on. Nugent's next album comes in 1988, and we'll see what he's up to then.

Next month I will conclude 1986 with Accept, Iron Maiden, Megadeth, and a bonus catch-up Motorhead.