Forum Moderators: phranque
In my multimedia gallery, some of the image exhibits contain text that might be helpful to extract and index for searches (internal and external).
To that end I've been looking for a while for pure-Java OCR package that is not too expensive or slow, but can do a half-reasonable job (does not have to be anything like perfect!) of extracting mainly-English text in a mixture of layouts and orientations from photographs (eg of roadsigns, posters, etc).
Does any such thing exist? I cannot see it if so... There might be an academic/student/masters project that I've missed as I bet this is still something of a research topic!
PLEASE DO NOT drop URLs or product names into the thread or it will be nuked (SM me if you have such details!), but I am interested in general experiences if any.
Rgds
Damon
Yes, it more-or-less has to be Java because the site is distributed in a WAR and runs on three different platforms (at once) ie Windows/Intel, Linux/Intel and Solaris/SPARC.
Now a Java interface to equivalent native libraries (eg DLLs, shared objects, etc) for the appropriate platforms would be OK, though less safe and more of a pain to manage.
At a pinch I *could* extract the information just once on the master (Solaris) server, but that would be less good for various reasons.
Performance and accuracy are less critical than safety (ie I can't have one bring the whole site down).
But in any case, thanks for the input so far!
Rgds
Damon