ergophobe - 6:14 am on Dec 19, 2010 (gmt 0)
»(Bach,Mozart,Beethoven) - start 1700
Good one - the peak due to the spread of radio?
>>"long" or "medial s" character
that's one i was thinking of, but lots of serif fonts that are very thin at the tops of the circles, so "o" can become "ii"
I have to say, that 15 years ago I needed to publish an edition of some older articles and to save typing, did OCR on them. In some cases I got entire pages of iiiiis.
Recently, I had to do something similar and ended up with something around 95% accuracy or higher. OCR has improved dramatically. Even within Google Books, I think it has gotten noticeably better over the course of the project.
So I'm not knocking it. I am able to find things in one hour that as recently as five years ago I would have only been able to find over months of sending query letters to leading experts hoping to find someone who could answer (recent mystery solved thanks to Google books: meaning of the "mal de saint Eloi"). It is phenomenal and it's great to finally have something like this in English.
As I say, though, despite the much smaller set of texts, the French project still has some great advantages. For example, I can search for, to take something from your last post, all occurrences where "language" and "consciousness" occur within four words of each other.
I expect to see the sort of thing get better and better, combining the best features of existing systems and adding in intelligence you and I have perhaps not even thought of. In 15-20 years, people will look at this discussion and marvel at how primitive and brutish our methods are.