Going paperless
Oct 10, 2008
I recently bought a scanner and have “gone paperless” both at home and at the office. The way I have things set up, I have full text searchable PDF files of every paper document that passes through my hands. It’s working out pretty well, and I wanted to share how I do it. I scan each document as it comes in, using the Fujitsu ScanSnap 300M. The ScanSnap creates a PDF image file in a location I set up ahead of time, so each time I scan a document I only have to press one button. Periodically, I run a batch script in Adobe Acrobat Pro to recognize the text inside these files, and create fully text-searchable PDFs in another location. Finally, I use Yep, a shareware PDF “shoebox” program, to tag the files and ultimately to view them. Since the search function in Yep makes use of Apple’s Spotlight, searching the full text of thousands of PDF files is very fast.
It’s great that the one-touch feature lets you set everything up just once, and then scan through documents at the scanner without having to fiddle with things on the computer. Unfortunately, the ScanSnap Manager insists on jumping to the foreground for every single page. So, for example, if you are scanning a 15-page document and want to browse the web or get some other work done in the meantime, the ScanSnap Manager window will pop up 15 times, blocking your view of your work and intercepting your mouse clicks (since the foreground window has mouse focus).
Here’s some free advice for the folks developing ScanSnap Manager. Keep ScanSnap Manager in the background! Since I have already told the program what I want it to do, I don’t need further notification unless something goes wrong. Even when something goes wrong, don’t make ScanSnap Manager jump to the foreground. In the Mac world, when your software program has an issue that needs my attention, it should bounce its Dock icon, and then wait for me to call it to the foreground.
I think Acrobat Pro is a poorly-designed program in general, but that is a blog post for another day. However, I am very happy with the performance of its optical character recognition. For augmenting scanned documents with searchable text, Acrobat Pro is more than satisfactory.
Auto-tagging is a great idea, but the implementation leaves a lot of room for improvement. The tags often have little to do with the document contents, and look more like a random sample of words rather than words that appear often or occupy important positions. It is rather mysterious that Yep generates only a few tags for each document, even long documents. For example, the 87-page Declaration of Restrictions for the condominium complex where I live has just thirteen tags, including “reference,” “bill,” and “Smith,” but not including “San Diego,” “condominium,” “restrictions,” “building,” or “plan.” Whatever algorithm Yep is using (I could not find any documentation on it), it is not very helpful. It would be much more useful for Yep to treat every word in the document as a potential tag, unless it’s on the excluded list. Sure, each document could have many tags, but the tag cloud helps keep the user’s focus on the most common tags.
So tags hold a lot of potential, but for now I mainly use the full text search. If I want to find my dog’s rabies vaccination certificate, I just search for “rabies vaccination,” and Yep almost instantaneously finds 6 documents—all of them relevant. Note that Yep does not recognize “rabies” and “vaccination” as tags in any of these documents.
Scanning
The scanner I use is the Fujitsu ScanSnap S300M, which is a tiny sheetfed scanner. It folds up to about the size of a footlong submarine sandwich. It is a duplex scanner, which means even if you are scanning a double-sided document you only have to feed it through once. The ScanSnap Manager software, although it has some annoying aspects that I’ll describe below, is set up for “one touch” operation. You set up your preferences once, of how you want your documents scanned and where you want them saved. Then for each document all you have to do is feed it into the scanner and press the button.It’s great that the one-touch feature lets you set everything up just once, and then scan through documents at the scanner without having to fiddle with things on the computer. Unfortunately, the ScanSnap Manager insists on jumping to the foreground for every single page. So, for example, if you are scanning a 15-page document and want to browse the web or get some other work done in the meantime, the ScanSnap Manager window will pop up 15 times, blocking your view of your work and intercepting your mouse clicks (since the foreground window has mouse focus).
Here’s some free advice for the folks developing ScanSnap Manager. Keep ScanSnap Manager in the background! Since I have already told the program what I want it to do, I don’t need further notification unless something goes wrong. Even when something goes wrong, don’t make ScanSnap Manager jump to the foreground. In the Mac world, when your software program has an issue that needs my attention, it should bounce its Dock icon, and then wait for me to call it to the foreground.
Reading
ScanSnap turns my documents into PDF image files. In order to be able to search the text inside one of these files, I use Adobe Acrobat Pro to recognize the text and embed it into the file. Specifically, I use the “Batch Processing” feature to run a macro with the following steps: “Recognize Text Using OCR,” with the options “PDF Output Style: Searchable Image” and “Downsample: Lowest (600 dpi);” and then “Embed All Page Thumbnails.” The files created by this batch script look just like the originals, except they contain fully searchable and selectable text.I think Acrobat Pro is a poorly-designed program in general, but that is a blog post for another day. However, I am very happy with the performance of its optical character recognition. For augmenting scanned documents with searchable text, Acrobat Pro is more than satisfactory.
Organizing
To keep track of these files, I use Yep, from Ironic Software. Yep can handle thousands of files (I have over 2700 right now), and organizes them by tags. Given such a large number of documents, there’s no way I could possibly tag them by hand, so I use Yep’s Auto-Tag feature. It generates a set of tags for each document based on the contents. It displays the most common tags in a blog-like “tag cloud.” Not surprisingly, my last name and my wife’s last name are the largest tags in the cloud for my scanned documents. When you click on one tag in the cloud, focus is restricted to documents with that tag, and the entire cloud is recalculated based on the focus set.Auto-tagging is a great idea, but the implementation leaves a lot of room for improvement. The tags often have little to do with the document contents, and look more like a random sample of words rather than words that appear often or occupy important positions. It is rather mysterious that Yep generates only a few tags for each document, even long documents. For example, the 87-page Declaration of Restrictions for the condominium complex where I live has just thirteen tags, including “reference,” “bill,” and “Smith,” but not including “San Diego,” “condominium,” “restrictions,” “building,” or “plan.” Whatever algorithm Yep is using (I could not find any documentation on it), it is not very helpful. It would be much more useful for Yep to treat every word in the document as a potential tag, unless it’s on the excluded list. Sure, each document could have many tags, but the tag cloud helps keep the user’s focus on the most common tags.
So tags hold a lot of potential, but for now I mainly use the full text search. If I want to find my dog’s rabies vaccination certificate, I just search for “rabies vaccination,” and Yep almost instantaneously finds 6 documents—all of them relevant. Note that Yep does not recognize “rabies” and “vaccination” as tags in any of these documents.
Summary
- The Fujitsu ScanSnap S300M is a great little scanner, but its software needs more work
- Adobe Acrobat Pro does a good job of recognizing text in a scanned document, and making text-searchable PDF files
- Yep is a convenient way to keep track of scanned documents, but it needs more sensible auto-tagging
How I made this site
Jun 22, 2008
I created this website using RapidWeaver 4, from Realmac Software, and CSSEdit, from MacRabbit software. I previously used GoLive, from Adobe, but had grown frustrated with it. I tried Dreamweaver, also from Adobe, but found its interface too complicated. Later on I tried iWeb, from Apple, but found it too limited. RapidWeaver offers a great combination of handholding and flexibility, and generates web pages automatically from individual blog entries. Although the blog format is not exactly what I want, it works well enough and offers lots of benefits. Most importantly, using CSSEdit I was able to modify the site template (included with RapidWeaver) to suit my own needs.
When I went on the econ job market in 2003-2004, I created my online resume website using Adobe GoLive. I wanted a simple design that looked professional without being slick or pretentious. It also had to be easy for people to find and download my papers. I had used GoLive for a few years at that time for some personal projects that were definitely more pretentious and hopefully more slick, but I found that GoLive didn’t work particularly well for my text-heavy resume. I ended up using the source code editor view almost exclusively, since the page layout view made it too difficult to move different pieces around without messing up the div tags.
I stuck with GoLive until recently just because I didn’t have any other good alternatives. About a year ago, Adobe announced that GoLive would be deprecated in favor of Dreamweaver. I tried Dreamweaver out, but found its interface completely impenetrable. What’s more, both GoLive and Dreamweaver are complicated development environments where the user is supposed to have a complete and detailed understanding of every aspect of web page design. I am not a real web designer, nor do I have the time to become one. I just want a website that looks nice and does what I need it to. (Real web designers may want to look at Coda, from Panic.)
Having given up on Dreamweaver, I took a look at Apple’s iWeb program. iWeb comes with lots of templates, and just needs you to fill in the blanks with your content. I thought the blog module was particularly nice, since all you need to do is write your blog posts and store them in iWeb’s database; iWeb will then generate the blog website for you. I began to realize that what I really need is a way to automatically generate a web site from a database of my papers. In this paradigm, I would design the website once, and then just add or update each paper as I progressed in my research. I created an entire website offline in iWeb in just a day; this was a revelation compared to GoLive.
Unfortunately, iWeb turned out to be far too limited for my purposes. The blog posts look ugly if the post title runs over to a second line, unless I retouch each post by hand. I also can’t edit the templates. There is no way to organize blog posts by category.
I started looking for an alternative, and found good reviews for RapidWeaver. Since version 4 would soon be released, I bided my time. When version 4 finally came out about a month ago, I downloaded it and got to work. RapidWeaver works a lot like iWeb, but with a lot more flexibility. Like iWeb, RapidWeaver comes with a variety of themes, but there are lots more options to play with. It handles both categories and tags, which I think will be really helpful when I have more papers. Even better, I can edit the themes directly (I already had to learn the rudiments of HTML and CSS when I used GoLive).
I played around with the themes for a while, and settled on this one, which is my own modified version of “Caribou.” Based on the recommendation of Real Mac Software, I bought CSSEdit to make these modifications. It’s not that CSS code is hard to write; it’s just nice to have an editor that is aware of the syntax, as well as a user interface that makes all of the CSS properties available so I don’t have to look them up on the web.
In the end, RapidWeaver still isn’t quite ideal for what I want to do, but I think I’ve managed to fit most of what I wanted to do into RapidWeaver’s blog paradigm. I highly recommend RapidWeaver for anyone who can fit their web page into the paradigm of a blog, is willing to work with preformatted templates, and doesn’t want to spend a lot of time.
Summary
When I went on the econ job market in 2003-2004, I created my online resume website using Adobe GoLive. I wanted a simple design that looked professional without being slick or pretentious. It also had to be easy for people to find and download my papers. I had used GoLive for a few years at that time for some personal projects that were definitely more pretentious and hopefully more slick, but I found that GoLive didn’t work particularly well for my text-heavy resume. I ended up using the source code editor view almost exclusively, since the page layout view made it too difficult to move different pieces around without messing up the div tags.
I stuck with GoLive until recently just because I didn’t have any other good alternatives. About a year ago, Adobe announced that GoLive would be deprecated in favor of Dreamweaver. I tried Dreamweaver out, but found its interface completely impenetrable. What’s more, both GoLive and Dreamweaver are complicated development environments where the user is supposed to have a complete and detailed understanding of every aspect of web page design. I am not a real web designer, nor do I have the time to become one. I just want a website that looks nice and does what I need it to. (Real web designers may want to look at Coda, from Panic.)
Having given up on Dreamweaver, I took a look at Apple’s iWeb program. iWeb comes with lots of templates, and just needs you to fill in the blanks with your content. I thought the blog module was particularly nice, since all you need to do is write your blog posts and store them in iWeb’s database; iWeb will then generate the blog website for you. I began to realize that what I really need is a way to automatically generate a web site from a database of my papers. In this paradigm, I would design the website once, and then just add or update each paper as I progressed in my research. I created an entire website offline in iWeb in just a day; this was a revelation compared to GoLive.
Unfortunately, iWeb turned out to be far too limited for my purposes. The blog posts look ugly if the post title runs over to a second line, unless I retouch each post by hand. I also can’t edit the templates. There is no way to organize blog posts by category.
I started looking for an alternative, and found good reviews for RapidWeaver. Since version 4 would soon be released, I bided my time. When version 4 finally came out about a month ago, I downloaded it and got to work. RapidWeaver works a lot like iWeb, but with a lot more flexibility. Like iWeb, RapidWeaver comes with a variety of themes, but there are lots more options to play with. It handles both categories and tags, which I think will be really helpful when I have more papers. Even better, I can edit the themes directly (I already had to learn the rudiments of HTML and CSS when I used GoLive).
I played around with the themes for a while, and settled on this one, which is my own modified version of “Caribou.” Based on the recommendation of Real Mac Software, I bought CSSEdit to make these modifications. It’s not that CSS code is hard to write; it’s just nice to have an editor that is aware of the syntax, as well as a user interface that makes all of the CSS properties available so I don’t have to look them up on the web.
In the end, RapidWeaver still isn’t quite ideal for what I want to do, but I think I’ve managed to fit most of what I wanted to do into RapidWeaver’s blog paradigm. I highly recommend RapidWeaver for anyone who can fit their web page into the paradigm of a blog, is willing to work with preformatted templates, and doesn’t want to spend a lot of time.
Summary
- RapidWeaver 4 is a great way to generate a powerful, good looking website, especially in a blog format
- CSSEdit is a convenient way to modify RapidWeaver templates
- iWeb is fun but not very flexible
- GoLive is outdated, complicated to use, and makes it too easy to write bad code
- Dreamweaver is poorly designed and complicated to use
Software I use
Jun 21, 2008
I’m planning to blog occasionally about Mac software that I use in the course of my work. To get started, though, I’ll simply list the main software programs that I use professionally, divided by whether I use them for research purposes or only for general purposes. Within each category, they are organized from greatest use to least use.
Research
Research
- Safari (Apple)
- TeXShop (open source)
- TeXLive (open source)
- Preview (Apple)
- Keynote (Apple)
- BibDesk (open source)
- Mathematica (Wolfram)
- OmniGraffle Pro (Omni Group)
- Illustrator (Adobe)
- Mail (Apple)
- iCal (Apple)
- Things (Cultured Code)
- Address Book (Apple)
- RapidWeaver (Realmac)
- CSSEdit (MacRabbit)
- Transmit (Panic)
- Acrobat Pro (Adobe)
- Numbers (Apple)
- OmniOutliner Pro (Omni Group)
- Pages (Apple)





