homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
when robots get mad
lucy24




msg:4428756
 7:54 pm on Mar 13, 2012 (gmt 0)

[Moderators: I waffled between robots.txt and SSID. There's some of both.]

Just the other day I said something somewhere about new ways for the googlebot to get mad at you. I thought I was kidding.

Backtrack: Over the weekend I roboted-out a few more directories. This feels a little suicidal when you're as small as I am, but honestly, I don't need them. They're the images belonging to assorted public-domain e-books. People can find them anywhere-- same surrounding text, same <alt>, so there should be a whole slew of them in any batch of search results.

All they do is look at the picture. But, because of the way Image Search works, the whole page gets loaded up. Along with anywhere from 10 to 60 images, because they're mostly picture books. The searcher never notices. Worst case, they're just looking for hotlink fodder. I blocked the two worst ones ages ago. Nobody ever goes on to look at other books.

Each e-book has its own images, so it had to go like this.

User-Agent: *
Disallow: /ebooks/title1/images
Disallow: /ebooks/title2/images
Disallow: /ebooks/title3/et cetera (four in all)

and

User-Agent: Googlebot-Image
Disallow: /ebooks
et cetera.

That's on the assumption that Google is accurate when it says "Googlebot-Image" is used for Image Search. (They say so in GWT.) The regular googlebot gets images, but I can't remember the imagebot ever getting text.

Waited 24 hours and then checked in Crawler Access to make sure it was working as intended. Check #1, using plain googlebot plus Googlebot-Image: ask for an image in /ebooks/blind/. Both blocked. Check #2, same two: ask for an image elsewhere in /ebooks. Googlebot gets in, Image doesn't.


But now the computer has noticed some hanky-panky. Logs of the day show:

nn.nn.aa.bb - - [12/Mar/2012:16:14:13 -0700] "GET /ebooks/blind/ThreeBlindMice.html HTTP/1.1" 200 7111 "-" "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0"

immediately followed by

nn.nn.cc.dd - - [12/Mar/2012:16:14:14 -0700] "GET /ebooks/blind/images/music.png HTTP/1.1" 403 988 "-" "-"
nn.nn.ee.ff - - [12/Mar/2012:16:14:14 -0700] "GET /ebooks/blind/images/page18.jpg HTTP/1.1" 403 988 "-" "-"

and so for a total of 30 routine blocks for blank UA. The book actually has 60 images. No discernible pattern. They also got the favicon, which everyone is allowed to get, but forgot the stylesheet and midi.*

I obfuscated the nn.nn. just to build suspense. It's

:: drumroll ::

74.125. The rest varies. I have never met this IP doing ordinary Search before. I have never seen it without a UA when asking for anything but the favicon. And, afaik, I have never seen the Firefox UA anywhere. (I mostly block by IP but ignore by UA, so it would have jumped up and hit me.)

Two minutes later, Googlebot swings by to pick up Three Blind Mice in its own name. Text only.

Over the next 15 seconds, three more pickups:

209.85.aa.bb - - [12/Mar/2012:16:16:08 -0700] "GET /ebooks/blind/ThreeBlindMice.html HTTP/1.1" 200 7167 "-" "Mozilla/5.0 (compatible) Feedfetcher-Google;(+http://www.google.com/feedfetcher.html)"

Documentation says that Feedfetcher-- like Preview and Translate-- is not a robot. Detour to raw logs confirms that I've never seen it before. Amazing that three separate people requested the same file within fifteen seconds of each other-- without the images, stylesheet and midi that make the book worth reading.

An hour goes by. Things are quiet. Apparent human, with complete set of images:

67.183.nn.nn - - [12/Mar/2012:17:13:52 -0700] "GET /ebooks/blind/ThreeBlindMice.html HTTP/1.1" 200 7167 "http://plus.url.google.com/url? {search string}" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11"

The search string contains no q-- not even an empty one-- but the results magically come up with a page title all the same.

Wait, there's one more. Again a flawless imitation of a human, apart from the UA-triggered piwik block:

67.183.nn.nn - - [12/Mar/2012:19:16:16 -0700] "GET /ebooks/blind/ThreeBlindMice.html HTTP/1.1" 200 7167 "-" "Mozilla/5.0 (iPhone4CDMA; U; CPU iPhone OS 5_1 like Mac OS X; en_US) com.google.GooglePlus/4565 (KHTML, like Gecko) Mobile/N92AP (gzip)"

That IP looks awfully familiar doesn't it? (The final nn.nn are also identical.) It's not every day you see an ordinary browser sharing an IP with a cell phone. Comcast, which could mean absolutely anything.

Anyone know what com.google.GooglePlus is? No surprise, the single word "GooglePlus" turns out to be flatly impossible to search for. All I can say is, I've never seen it before. Innocuous add-on or something more nefarious?


* Just as well. The midi is probably still exhausted from a couple of days last week when some British robot unaccountably picked up several dozen copies, always in sets of 5-- and then just as unaccountably stopped.

 

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved