Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

meet the pdfbot

New Variation of Googlebot?

7:26 am on May 9, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
votes: 459

Look what I found while
stop me if you've heard this one
looking for something else. I'm quoting this in full because you have to look closely. - - [05/May/2012:20:59:01 -0700] "GET /directory/subdirectory/subsubdirectory/file_two.pdf HTTP/1.1" 200 125835 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
... - - [05/May/2012:22:43:12 -0700] "GET /directory/subdirectory/subsubdirectory/file_one.pdf HTTP/1.1" 200 109385 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:49:50 -0700] "GET /directory/subdirectory/subsubdirectory/file_four.pdf HTTP/1.1" 200 329390 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:49:53 -0700] "GET /directory/subdirectory/subsubdirectory/file_three.pdf HTTP/1.1" 200 193709 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" - - [05/May/2012:22:54:29 -0700] "GET /directory/subdirectory/subsubdirectory/frontfile.pdf HTTP/1.1" 200 19849 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

Note the IP. This is not a spoofer.

You can tell the pdfbot got its shopping list from somewhere else, because it didn't start with the front file-- the only one that's linked from an html page.

So go back a day: - - [04/May/2012:09:05:07 -0700] "GET /directory/subdirectory/subsubdirectory/frontfile.pdf HTTP/1.1" 200 19849 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
... - - [04/May/2012:09:32:06 -0700] "GET /directory/subdirectory/subsubdirectory/file_three.pdf HTTP/1.1" 200 193709 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - - [04/May/2012:09:32:07 -0700] "GET /directory/subdirectory/subsubdirectory/file_four.pdf HTTP/1.1" 200 329390 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - - [04/May/2012:09:32:09 -0700] "GET /directory/subdirectory/subsubdirectory/file_one.pdf HTTP/1.1" 200 109385 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
... - - [04/May/2012:09:47:16 -0700] "GET /directory/subdirectory/subsubdirectory/file_two.pdf HTTP/1.1" 200 125835 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

That UA is familiar. I don't have a lot of pdfs, but the Googlebot-- and also the Googlebot-Mobile-- will occasionally pick up one that's attached to an e-book.

This particular batch of pdfs is new to the site. They used to live somewhere else, with only a nofollow link pointing to the top one ("frontfile.pdf"). The link is in such an obscure location, it took Google almost two weeks to discover the change. Followed, apparently, by 27 minutes to find and process the links within the pdf ... and then a day and a half to make contact with the pdfbot.

I have never seen that stripped-down Googlebot UA before. Granted, I didn't fine-tooth-comb raw logs going back to the dawn of time. But I looked at some pretty representative chunks. Never anything but the normal Mozilla-clad Googlebot. Or Googlebot-Mobile, which wears even more clothes. Only spoofers use the minimalist version.

12:30 pm on May 9, 2012 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 1, 2006
posts: 66
votes: 0

I've recorded encounters with precisely that stripped-down G-bot UA.

First, 132 hits on 05 and 07/Aug/2011, various IPs (none real G IPs), and all were RFI or traversal exploit attempts.

Then, 8 hits on 05/Mar/2012 from (server2.fightstrike.com), all were zeroboard exploit attempts.
7:48 pm on May 9, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
votes: 4

Thanks for the heads-up, Lucy.

I've had that UA enabled for a long time without following how it worked. Trawling through my security logs the UA only hits a few pages of one site - pages that have PDFs.

Now blocked. I also added
to the blocked list. We don't use adsense anywhere, although G does visit to check (surely it should know?!) and we do not (usually) allow image-cutting.