I discovered gocr the other day. Conveniently there’s a fink package install (albeit of version 0.39-11 versus the current 0.41). It’s pretty easy to use (for a commandline app), and it does actually manage to do some recognition. Unfortunately it seems that there’s lots of empty whitespace and frequent misrecognition of lookalike glyph pairs.
At pretty much the same time there was an announcement on Slashdot that Google had open sourced Tesseract, a technology originall from HP in the 1993 era. Unfortunately attempting to compile it under Mac OS X resulted in it wanting Linux’s limits.h file. At that point I gave up. Interesting to note that Google are hunting OCR experts though. More food for their search engine.
Then it’s over to Spamassassin where there seem to be several different options for Spamassassin OCR plugins. I’ve had a go at doing an installation but am not convinced it’s working. And I’m currently collecting gifs from spam messages to experiment with. Some of them even have the same checksum which should make it pretty easy to detect them via a lookup.