homepage Welcome to WebmasterWorld Guest from 107.22.45.61
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
do jaguars eat pdfs?
lucy24




msg:4496490
 1:25 am on Sep 18, 2012 (gmt 0)

IP: 85.31.219.0/25 within 85.31.192.0/19 (France, Jaguar. I assume this is jaguar-network dot com and not the car)
UA: HttpComponents/1.1

I wouldn't normally notice a robot I've only met twice,* but this one's got an oddity: it only eats PDFs. First saw it in May when it ate all three PDFs from an extremely obscure directory (took it a month to find them). Just showed up again to eat a brand-new PDF that's only been indexed a few days.

What do you suppose it does with them?

And, for that matter, how does it find them? Does it hit up the search engines every 24 hours, or possibly every five minutes, for a list of newly indexed PDFs?


* Found one recently that had been slipping under the radar almost daily for several months.

 

wilderness




msg:4496492
 1:38 am on Sep 18, 2012 (gmt 0)

FWIW, Google indexes PDF's on a separate crawl. Furthermore and when crawling PDF's, google is not robots.txt compliant.

I store my PDF's (used to have many) in image folders which are requested for omission by all bots and google wouldn't even slow down when entering such a directory for a PDF.

Thus I'm assuming the your-Jaguar is simply following the leader.

lucy24




msg:4496552
 5:30 am on Sep 18, 2012 (gmt 0)

FWIW, Google indexes PDF's on a separate crawl.

Yes, they use a special PDFbot that wears less clothes than the usual googlebot, presumably because PDFs tend to be heavy. In fact google indexed this particular pdf before it finished the three fat pages of which the pdf is only a snippet.

keyplyr




msg:4496583
 7:43 am on Sep 18, 2012 (gmt 0)



The Apache HttpComponents™ project is responsible for creating and maintaining a toolset of low level Java components focused on HTTP and associated protocols. This project functions under the Apache Software Foundation (http://www.apache.org), and is part of a larger community of developers and users.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved