homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

meet the pdfbot
New Variation of Googlebot?

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Msg#: 4451193 posted 7:26 am on May 9, 2012 (gmt 0)

Look what I found while
stop me if you've heard this one
looking for something else. I'm quoting this in full because you have to look closely. - - [05/May/2012:20:59:01 -0700] "GET /directory/subdirectory/subsubdirectory/file_two.pdf HTTP/1.1" 200 125835 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
... - - [05/May/2012:22:43:12 -0700] "GET /directory/subdirectory/subsubdirectory/file_one.pdf HTTP/1.1" 200 109385 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:49:50 -0700] "GET /directory/subdirectory/subsubdirectory/file_four.pdf HTTP/1.1" 200 329390 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:49:53 -0700] "GET /directory/subdirectory/subsubdirectory/file_three.pdf HTTP/1.1" 200 193709 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:54:29 -0700] "GET /directory/subdirectory/subsubdirectory/frontfile.pdf HTTP/1.1" 200 19849 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

Note the IP. This is not a spoofer.

You can tell the pdfbot got its shopping list from somewhere else, because it didn't start with the front file-- the only one that's linked from an html page.

So go back a day: - - [04/May/2012:09:05:07 -0700] "GET /directory/subdirectory/subsubdirectory/frontfile.pdf HTTP/1.1" 200 19849 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
... - - [04/May/2012:09:32:06 -0700] "GET /directory/subdirectory/subsubdirectory/file_three.pdf HTTP/1.1" 200 193709 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - - [04/May/2012:09:32:07 -0700] "GET /directory/subdirectory/subsubdirectory/file_four.pdf HTTP/1.1" 200 329390 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - - [04/May/2012:09:32:09 -0700] "GET /directory/subdirectory/subsubdirectory/file_one.pdf HTTP/1.1" 200 109385 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
... - - [04/May/2012:09:47:16 -0700] "GET /directory/subdirectory/subsubdirectory/file_two.pdf HTTP/1.1" 200 125835 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

That UA is familiar. I don't have a lot of pdfs, but the Googlebot-- and also the Googlebot-Mobile-- will occasionally pick up one that's attached to an e-book.

This particular batch of pdfs is new to the site. They used to live somewhere else, with only a nofollow link pointing to the top one ("frontfile.pdf"). The link is in such an obscure location, it took Google almost two weeks to discover the change. Followed, apparently, by 27 minutes to find and process the links within the pdf ... and then a day and a half to make contact with the pdfbot.

I have never seen that stripped-down Googlebot UA before. Granted, I didn't fine-tooth-comb raw logs going back to the dawn of time. But I looked at some pretty representative chunks. Never anything but the normal Mozilla-clad Googlebot. Or Googlebot-Mobile, which wears even more clothes. Only spoofers use the minimalist version.




5+ Year Member

Msg#: 4451193 posted 12:30 pm on May 9, 2012 (gmt 0)

I've recorded encounters with precisely that stripped-down G-bot UA.

First, 132 hits on 05 and 07/Aug/2011, various IPs (none real G IPs), and all were RFI or traversal exploit attempts.

Then, 8 hits on 05/Mar/2012 from (server2.fightstrike.com), all were zeroboard exploit attempts.


WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member

Msg#: 4451193 posted 7:48 pm on May 9, 2012 (gmt 0)

Thanks for the heads-up, Lucy.

I've had that UA enabled for a long time without following how it worked. Trawling through my security logs the UA only hits a few pages of one site - pages that have PDFs.

Now blocked. I also added
to the blocked list. We don't use adsense anywhere, although G does visit to check (surely it should know?!) and we do not (usually) allow image-cutting.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved