Going paperless
Oct 10, 2008 Filed in: Mac
I recently bought a scanner and have “gone paperless” both at home and at the office. The way I have things set up, I have full text searchable PDF files of every paper document that passes through my hands. It’s working out pretty well, and I wanted to share how I do it. I scan each document as it comes in, using the Fujitsu ScanSnap 300M. The ScanSnap creates a PDF image file in a location I set up ahead of time, so each time I scan a document I only have to press one button. Periodically, I run a batch script in Adobe Acrobat Pro to recognize the text inside these files, and create fully text-searchable PDFs in another location. Finally, I use Yep, a shareware PDF “shoebox” program, to tag the files and ultimately to view them. Since the search function in Yep makes use of Apple’s Spotlight, searching the full text of thousands of PDF files is very fast.
It’s great that the one-touch feature lets you set everything up just once, and then scan through documents at the scanner without having to fiddle with things on the computer. Unfortunately, the ScanSnap Manager insists on jumping to the foreground for every single page. So, for example, if you are scanning a 15-page document and want to browse the web or get some other work done in the meantime, the ScanSnap Manager window will pop up 15 times, blocking your view of your work and intercepting your mouse clicks (since the foreground window has mouse focus).
Here’s some free advice for the folks developing ScanSnap Manager. Keep ScanSnap Manager in the background! Since I have already told the program what I want it to do, I don’t need further notification unless something goes wrong. Even when something goes wrong, don’t make ScanSnap Manager jump to the foreground. In the Mac world, when your software program has an issue that needs my attention, it should bounce its Dock icon, and then wait for me to call it to the foreground.
I think Acrobat Pro is a poorly-designed program in general, but that is a blog post for another day. However, I am very happy with the performance of its optical character recognition. For augmenting scanned documents with searchable text, Acrobat Pro is more than satisfactory.
Auto-tagging is a great idea, but the implementation leaves a lot of room for improvement. The tags often have little to do with the document contents, and look more like a random sample of words rather than words that appear often or occupy important positions. It is rather mysterious that Yep generates only a few tags for each document, even long documents. For example, the 87-page Declaration of Restrictions for the condominium complex where I live has just thirteen tags, including “reference,” “bill,” and “Smith,” but not including “San Diego,” “condominium,” “restrictions,” “building,” or “plan.” Whatever algorithm Yep is using (I could not find any documentation on it), it is not very helpful. It would be much more useful for Yep to treat every word in the document as a potential tag, unless it’s on the excluded list. Sure, each document could have many tags, but the tag cloud helps keep the user’s focus on the most common tags.
So tags hold a lot of potential, but for now I mainly use the full text search. If I want to find my dog’s rabies vaccination certificate, I just search for “rabies vaccination,” and Yep almost instantaneously finds 6 documents—all of them relevant. Note that Yep does not recognize “rabies” and “vaccination” as tags in any of these documents.
Scanning
The scanner I use is the Fujitsu ScanSnap S300M, which is a tiny sheetfed scanner. It folds up to about the size of a footlong submarine sandwich. It is a duplex scanner, which means even if you are scanning a double-sided document you only have to feed it through once. The ScanSnap Manager software, although it has some annoying aspects that I’ll describe below, is set up for “one touch” operation. You set up your preferences once, of how you want your documents scanned and where you want them saved. Then for each document all you have to do is feed it into the scanner and press the button.It’s great that the one-touch feature lets you set everything up just once, and then scan through documents at the scanner without having to fiddle with things on the computer. Unfortunately, the ScanSnap Manager insists on jumping to the foreground for every single page. So, for example, if you are scanning a 15-page document and want to browse the web or get some other work done in the meantime, the ScanSnap Manager window will pop up 15 times, blocking your view of your work and intercepting your mouse clicks (since the foreground window has mouse focus).
Here’s some free advice for the folks developing ScanSnap Manager. Keep ScanSnap Manager in the background! Since I have already told the program what I want it to do, I don’t need further notification unless something goes wrong. Even when something goes wrong, don’t make ScanSnap Manager jump to the foreground. In the Mac world, when your software program has an issue that needs my attention, it should bounce its Dock icon, and then wait for me to call it to the foreground.
Reading
ScanSnap turns my documents into PDF image files. In order to be able to search the text inside one of these files, I use Adobe Acrobat Pro to recognize the text and embed it into the file. Specifically, I use the “Batch Processing” feature to run a macro with the following steps: “Recognize Text Using OCR,” with the options “PDF Output Style: Searchable Image” and “Downsample: Lowest (600 dpi);” and then “Embed All Page Thumbnails.” The files created by this batch script look just like the originals, except they contain fully searchable and selectable text.I think Acrobat Pro is a poorly-designed program in general, but that is a blog post for another day. However, I am very happy with the performance of its optical character recognition. For augmenting scanned documents with searchable text, Acrobat Pro is more than satisfactory.
Organizing
To keep track of these files, I use Yep, from Ironic Software. Yep can handle thousands of files (I have over 2700 right now), and organizes them by tags. Given such a large number of documents, there’s no way I could possibly tag them by hand, so I use Yep’s Auto-Tag feature. It generates a set of tags for each document based on the contents. It displays the most common tags in a blog-like “tag cloud.” Not surprisingly, my last name and my wife’s last name are the largest tags in the cloud for my scanned documents. When you click on one tag in the cloud, focus is restricted to documents with that tag, and the entire cloud is recalculated based on the focus set.Auto-tagging is a great idea, but the implementation leaves a lot of room for improvement. The tags often have little to do with the document contents, and look more like a random sample of words rather than words that appear often or occupy important positions. It is rather mysterious that Yep generates only a few tags for each document, even long documents. For example, the 87-page Declaration of Restrictions for the condominium complex where I live has just thirteen tags, including “reference,” “bill,” and “Smith,” but not including “San Diego,” “condominium,” “restrictions,” “building,” or “plan.” Whatever algorithm Yep is using (I could not find any documentation on it), it is not very helpful. It would be much more useful for Yep to treat every word in the document as a potential tag, unless it’s on the excluded list. Sure, each document could have many tags, but the tag cloud helps keep the user’s focus on the most common tags.
So tags hold a lot of potential, but for now I mainly use the full text search. If I want to find my dog’s rabies vaccination certificate, I just search for “rabies vaccination,” and Yep almost instantaneously finds 6 documents—all of them relevant. Note that Yep does not recognize “rabies” and “vaccination” as tags in any of these documents.
Summary
- The Fujitsu ScanSnap S300M is a great little scanner, but its software needs more work
- Adobe Acrobat Pro does a good job of recognizing text in a scanned document, and making text-searchable PDF files
- Yep is a convenient way to keep track of scanned documents, but it needs more sensible auto-tagging





