Msg#: 4496488 posted 1:25 am on Sep 18, 2012 (gmt 0)
IP: 22.214.171.124/25 within 126.96.36.199/19 (France, Jaguar. I assume this is jaguar-network dot com and not the car) UA: HttpComponents/1.1
I wouldn't normally notice a robot I've only met twice,* but this one's got an oddity: it only eats PDFs. First saw it in May when it ate all three PDFs from an extremely obscure directory (took it a month to find them). Just showed up again to eat a brand-new PDF that's only been indexed a few days.
What do you suppose it does with them?
And, for that matter, how does it find them? Does it hit up the search engines every 24 hours, or possibly every five minutes, for a list of newly indexed PDFs?
* Found one recently that had been slipping under the radar almost daily for several months.
Msg#: 4496488 posted 5:30 am on Sep 18, 2012 (gmt 0)
FWIW, Google indexes PDF's on a separate crawl.
Yes, they use a special PDFbot that wears less clothes than the usual googlebot, presumably because PDFs tend to be heavy. In fact google indexed this particular pdf before it finished the three fat pages of which the pdf is only a snippet.
Msg#: 4496488 posted 7:43 am on Sep 18, 2012 (gmt 0)
The Apache HttpComponents™ project is responsible for creating and maintaining a toolset of low level Java components focused on HTTP and associated protocols. This project functions under the Apache Software Foundation (http://www.apache.org), and is part of a larger community of developers and users.