And Now Google's Doing It. JS Stats Show GoogleBot

Forum Moderators: open

Message Too Old, No Replies

And Now Google's Doing It. JS Stats Show GoogleBot

TheMadScientist

7:08 pm on May 13, 2011 (gmt 0)

Hmmm ... Well a couple weeks ago I noticed M$ showing up in my JavaScript stats, and today I've got GoogleBot doing the same thing ... They're not quite as far along as M$ (or choose not to be) because they're missing the jQuery grab of the variable from the source code of the page, and they're really F'ing up my stats and they're COMPLETELY disregarding the robots.txt ... I think G's about to get banned from the files I don't want them to call in the .htaccess.

This is the first time in nearly 9 years I've seen G blatantly disregard robots.txt and they're doing it with a GoogleBot UA. I've given these guys the benefit of the doubt on not crawling robots.txt so many times it's not even funny, but this JS file is disallowed, so there's no way a GBot user-agent should be in my JS Stats.

keyplyr

3:36 am on May 14, 2011 (gmt 0)

but this JS file is disallowed, so there's no way a GBot user-agent should be in my JS Stats.

I've never understood the robots.txt disallow standard to be interpreted as "don't crawl" but only "don't index."

TheMadScientist

7:02 am on May 14, 2011 (gmt 0)

Then you misunderstood:

(Emphasis Mine)

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

[robotstxt.org...]

g1smd

7:57 am on May 14, 2011 (gmt 0)

The robots noindex meta tag is all about "do not index".

The robots.txt disallow is "do not crawl" (but the URL might still appear as a URL-only entry in the SERPs).

keyplyr

7:30 pm on May 14, 2011 (gmt 0)

I'm aware of it's theoretical intention, I'm talking real world. G, Y, M$ all crawl disallowed files. I only fret when they're indexed.

TheMadScientist

7:40 pm on May 14, 2011 (gmt 0)

Uh, there's nothing about the robots.txt block that would keep the file from being indexed, so I'm not sure what theory you're working with ... They will index those pages as URL only or they (Google for sure) will use titles and descriptions from external sources if they can to keep them from being URL only.

If yours are not being indexed it's probably because there aren't enough links to them for them to be indexed without knowing the content anyway, but it's got nothing to do with the robots.txt block, because those pages will be indexed as URL only ... Check out the Google forum if you haven't ... This is one of the biggest points of confusion ... If you have seen the Google forum, then, uh, you should know locations blocked in the robots.txt are easily (and often) indexed.

g1smd

8:09 pm on May 14, 2011 (gmt 0)

Are you sure the crawling isn't because the robots.txt file is malformed in some way?

In particular, if you have a

User-agent: *

section and a

User-agent: GoogleBot

section, Google completely IGNORES the

User-agent: *

section.

That is, you must duplicate all the rules from the

User-agent: *

section into the

User-agent: GoogleBot

section if you want Google to see them.

The same holds true for other searchengines.

TheMadScientist

8:14 pm on May 14, 2011 (gmt 0)

This is the whole robots.txt file:
Feel free to point out the errors I'm missing if there are any.
I copied & pasted (examplified 2 of the names, but that's it) from the live site, so I know it's there and accessible too.

User-agent: *
Disallow: /link-out.php
Disallow: /directoryname-pics
Disallow: /img
Disallow: /accounts
Disallow: /functions.js
Disallow: /directory.js

Sitemap: http://www.example.com/sitemap-index.xml

Samizdata

8:34 pm on May 14, 2011 (gmt 0)

there's no way a GBot user-agent should be in my JS Stats

Anyone can use a GoogleBot user-agent.

The first thing to establish is whether it is a genuine GoogleBot.

...

TheMadScientist

8:36 pm on May 14, 2011 (gmt 0)

Feel Free: 66.249.67.86

scooterdude

8:41 pm on May 14, 2011 (gmt 0)

as far as i know, robots bypass robots.txt when they are following a direct link to a webpage/file.

Some web hosting packages show a linked list of all the files in that hosting package in certain circumstances, this can have interesting results

TheMadScientist

8:49 pm on May 14, 2011 (gmt 0)

Do they all make POST requests too?
This is the way you get into my JS Stats:

variables = 'example&'+Math.random();
req.open('POST','example?'+encodeURI(variables),true);
req.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
req.setRequestHeader("Content-length", extras.length);
req.setRequestHeader("Connection", "close");
req.send(extras);

Double checking now to make sure they are actually making a POST request, but they are DEFINITELY processing / adding the variables to the GET portion...

EDITED: Changed example info slightly, because I use something slightly different than is posted here, but the only changes are to 'hard-coded text', nothing else ... Nearly every character above is an exact copy and paste of the code that has to be processed to get into the JS Stats.

[edited by: TheMadScientist at 9:02 pm (utc) on May 14, 2011]

TheMadScientist

8:57 pm on May 14, 2011 (gmt 0)

Yep, it's a POST request...

TheMadScientist

9:10 pm on May 14, 2011 (gmt 0)

Got It!

It's their web preview ... I guess they think it's cool to break protocol and standards when it comes to making their visitors happy and possibly keeping them on their site rather than just saying 'this site does not allow previews' and sending them to the site in the results.

They're actually making the request too ... The system checks for an X-Forwarded-For so if they were 'proxy requesting' for visitors that click on the preview it should show the visitor's IP Address, not theirs.

scooterdude

9:13 pm on May 14, 2011 (gmt 0)

several folk regularly see googlebot footprint in js stats

lucy24

10:13 pm on May 14, 2011 (gmt 0)

It's their web preview ... I guess they think it's cool to break protocol and standards when it comes to making their visitors happy and possibly keeping them on their site rather than just saying 'this site does not allow previews' and sending them to the site in the results.

They're actually making the request too ... The system checks for an X-Forwarded-For so if they were 'proxy requesting' for visitors that click on the preview it should show the visitor's IP Address, not theirs.

But this makes it easy to lock them out.

Far as I can tell, 64.233.x.x, 72.14.x.x, 74.125.x.x are strictly Google Web Preview. Yesterday they threw me a curve ball with 66.249.85.x from the range they normally use for the Googlebot. I hope they don't make a habit of this. Anyway they're perfectly open about it-- or brazen, if you prefer-- so you can always do it by UA.

I wouldn't mind GWP except that the raw stats don't say what search term the user entered, so there's no way of knowing what it is that the user was almost interested in enough to visit your page. And I'm glad I don't pay for bandwidth, since they insist on pulling up every image, whether or not it will actually be displayed in the preview. Can't they get the ### thing from their own cache?

The first thing to establish is whether it is a genuine GoogleBot.

Yup, just found a spoofer at 110.44.29.76 (in Australia, apparently). Their sole request was for robots.txt, which was a dead giveaway. :: snicker ::

TheMadScientist

10:37 pm on May 14, 2011 (gmt 0)

The first thing to establish is whether it is a genuine GoogleBot.

That's what php and even parsing non-php extensions is for, isn't it? ;)

if(stripos($_SERVER['HTTP_USER_AGENT'],'GoogleBot')!==FALSE
|| stripos($_SERVER['HTTP_USER_AGENT'],'Slurp')!==FALSE
|| stripos($_SERVER['HTTP_USER_AGENT'],'BingBot')!==FALSE
) {
$botip = $_SERVER['REMOTE_HOST'];
$bothost = gethostbyaddr($botip);
$verifiedbotip = gethostbyname($bothost);
if($botip == $verifiedbotip && (substr($bothost, -14) == '.googlebot.com'
|| substr($bothost,-15) == 'crawl.yahoo.net'
# Not sure if Y! still crawls from inktomi search, but not a big deal to check for it
|| substr($bothost,-18) == '.inktomisearch.com'
# AFAIK Bing still crawls from msn.com. May need to be updated at some point
|| substr($bothost,-15) == '.search.msn.com')
) {
# What to do if it's a real bot
}
else {
# What to do if it's an imposter
}

Modified from jcoronella's post here: [webmasterworld.com...]

NOTE: The JS file / stats are two of the few I haven't been running a full verification on, but I think it might be time to start.

[edited by: TheMadScientist at 10:45 pm (utc) on May 14, 2011]

Samizdata

10:45 pm on May 14, 2011 (gmt 0)

Far as I can tell, 64.233.x.x, 72.14.x.x, 74.125.x.x are strictly Google Web Preview

Nothing is ever so straightforward in the Google botworld.

72.14.x.x includes Google Wireless Transcoder, Google Translate and Feedfetcher-Google.

74.125.x.x includes Google-Sitemaps and Google Keyword Tool. It is also used for automated checks on disallowed files (so that GoogleBot can claim to honour the robots protocol).

Some here may block them all, but others allow some or all of them.

Google Web Preview was introduced in typical Google land grab fashion with a threat to punish those who served their own visual preview.

All your base are belong to Mountain View.

...

TheMadScientist

10:52 pm on May 14, 2011 (gmt 0)

I guess the conclusion I have to draw is:

I still haven't seen GoogleBot disregard robots.txt, but Google does with their other user-agents and if you get the UA from a Reverse Look Up, then you might THINK GoogleBot did it, but it might (likely?) have been Google using a different automated 'non-bot' user-agent, so they can throw protocol to the wind when it suits their purposes.

It's wasn't a bot that accessed my disallowed pages...
It was an automated web page grabber they use to show previews!

Samizdata

11:19 pm on May 14, 2011 (gmt 0)

they can throw protocol to the wind when it suits their purposes

That sums it up - if you don't want Google to take your files, don't put them on the web.

But you want the Faustian pact to work for you, and keyplyr had it right:

I only fret when they're indexed

...

lucy24

11:24 pm on May 14, 2011 (gmt 0)

72.14.x.x includes Google Wireless Transcoder, Google Translate

###. Forgot to check for those-- and Google Translate is used a lot by legitimate visitors to one page.

I still haven't seen GoogleBot disregard robots.txt

I have. Grumbled about it here [webmasterworld.com].

66.249.71.109 - - [10/May/2011:13:35:00 -0700] "GET /robots.txt HTTP/1.1" 200 480 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
66.249.71.218 - - [10/May/2011:23:45:29 -0700] "GET /{off-limits directory}/{filename1}.html HTTP/1.1" 200 3485 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:01:18:02 -0700] "GET /{off-limits directory}/{filename2}.html HTTP/1.1" 200 2693 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
66.249.71.218 - - [11/May/2011:02:28:06 -0700] "GET /{off-limits directory}/{filename3}.html HTTP/1.1" 200 4324 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:03:38:07 -0700] "GET /{off-limits directory}/{filename4}.html HTTP/1.1" 200 3841 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I decided a while back that google is "outsourcing" its robots.txt handling. That is, instead of visiting robots.txt, assimilating its contents and keeping it in mind as you continue your crawl, the robots.txt is picked up by one robot, processed behind the scenes, and only later passed on to the general-crawling robots.

A closer look at some randomly selected raw logs shows a curious pattern. Each request for robots.txt is immediately followed by one request for a page-- generally one that google already knows no longer exists (301 or 410)-- and then the Googlebot goes away, to be followed later in the day by other Googlebots doing their stuff without reference to robots.txt.

Samizdata

11:42 pm on May 14, 2011 (gmt 0)

Googlebots doing their stuff without reference to robots.txt

Or using a previously cached copy, perhaps.

To borrow a line from Johnny Depp, the rules of the pirate code (Robots Exclusion Protocol) are only guidelines. There are no sanctions for misbehaviour.

It's all something of a charade, we just do what we can to get the right files in the index (and keep the wrong ones out). To quote keyplyr again:

I'm talking real world. G, Y, M$ all crawl disallowed files.

There are worse bots to worry about - ones that offer no potential benefits.

...

TheMadScientist

12:05 am on May 15, 2011 (gmt 0)

There are worse bots to worry about - ones that offer no potential benefits.

I wouldn't have noticed or even cared if they weren't totally messing up my stats ... It's not that I care about the page in the index, or visits to the page; what I care about is the stats having some semblance of accuracy, and I would rather not have to 'code around' companies that CLAIM to run compliant bots, and do SEEM to when they put 'bot' in the UA string, but otherwise just do what suits them.

If they just came out and said: Hey, GoogleBot is compliant, but not all of our bots are (what is a preview fetcher if it's not a bot?), so make sure you account for them in your stat keeping ... They don't even send an X-Forwarded-For to indicate it's not them specifically making the request, so for all I know they may be 'pre-fetching' the page when someone with previews on visits a SERP with the page they're fetching included, or even when someone clicks on a preview above / below and their data shows that person is 'more likely' to click on the one they fetched so they get it early ... IDK how they decide which page to fetch or when and I really don't want to take the time to research it and find out, because I have plenty of other things I could be doing, and I don't even really know how to count a 'preview' except as a preview and who knows WTF that means, except Google decided they wanted to get the information on the page for a preview, which could mean it's used or maybe cached and used in the future, or may be for internal use only if no one actually clicks on the preview, but they pre-fetch on some occasion, and there are a HUGE number of 'non-visits' in the JS stats now, so I have to rework either the stat system or the information displayed to work around their interpretation of what defines a bot, robots exclusion protocol, and web standards compliance, which is obviously a bit different than mine...

I EXPECT rogue bots to do stuff like this, but not companies like Google where they are actually fairly honest (brazen? lol) about what they're doing and actually seem to follow standards and protocol for the most part ... It's the WTF? are they doing in here when IMO I should not have to worry about them in there that got me.

ADDED:
And why would it reverse to a GoogleBot IP Address if it's not a bot?
And either way, why not give it a range of IPs and call it what it is for a reverse look up?

g1smd

12:43 am on May 15, 2011 (gmt 0)

Googlebots doing their stuff without reference to robots.txt

Or using a previously cached copy, perhaps.

Several years ago a comment was made here that if you want a new URL to not be crawled, make sure it is disallowed at least 24 hours before the URL is live and/or linked to.

You'd think that if a link pointing to a new URL is discovered within a site, the robots.txt file would be fetched some time between discovery and attempted spidering.

lucy24

12:51 am on May 15, 2011 (gmt 0)

IDK how they decide which page to fetch or when and I really don't want to take the time to research it and find out

I, on the other hand, desperately need to procrastinate working on Volume V of the eleven-volume German translation of Macaulay's History of England. Unfortunately, this is really true.

Look at the logs for the site named in your user profile, about four minutes before the time of this post. Do you see a whole slew of hits from Google Web Preview, followed by a single visit from a UA containing the word "Camino"?

TheMadScientist

1:51 am on May 15, 2011 (gmt 0)

Nope Lucy24 ... The last page fetch by the preview bot before your visit was: Fri, 06 May 2011 19:12:31 -0600 GMT

So, basically the preview 'not-a-bot' visit tells us nothing except Google wanted the page and who knows how many times it's been viewed or not viewed as a preview since ... I've only got to change the way the stats are stored and the databases are set up so I can select efficiently and keep them from being counted ... No big deal ... Grrr! ... (See if I ever go relational on DB tables to save storage space again! lol) ... Of course I DON'T have the user-agent in an 'easy to access' place in that set of stats, so I can't just 'not count' the preview grabs in the visit information I display ... Yes, if you really want to know, it's tucked away in a great big related 'text' field that's 'not pretty' to search through for a specific string ... It's very easy to take apart and display with PHP, but saying 'don't count this UA inside this field' is another story!

Another question on this of course is: Why on earth do they feel the need to fire off the JavaScript POST to hit the stat counter anyway? Just LEAVE IT ALONE! lol

BTW: The site they're messing up the stats on is not the one in my profile ... The one they're messing them up on is about 20,000 pages and they're REALLY messing up the stats!

lucy24

2:33 am on May 15, 2011 (gmt 0)

Fascinating. I tried it first with my own front page, which nobody ever visits because it's not a "top-driven" site. Google Web Preview grabbed the page and all its associated image files, and two of the six pages linked from it with their associated images, and wrapped it up with two e-books that nobody ever reads, plus their associated files-- but not the intervening /ebooks/index.html. And then showed me my preview without batting an eye.

I swear I once read on a google page that good robots wait a second or more between each hit. Anyone got a calculator? 100 pickups (exactly!) � 6 seconds = ...

In any case, definite mental association here with "as long as I'm at the store, I may as well pick up..."

TheMadScientist

2:51 am on May 15, 2011 (gmt 0)

As on-the-fly rendering is only done based on a user request (when a user activates previews), it�s possible that it will include embedded content which may be blocked from Googlebot using a robots.txt file.

Uh, it doesn't say when a user would like to preview your page, but when a user activates previews ... Which user on which SERP I wonder? My guess is they're telling the truth: When a user [any user] activates previews [on any SERP] ... lol ... They might not be quite that bad about it, but the wording they use allows them to be...

If that's really the case and the reason why they disregard robots.txt, then why not send an X-Forwarded-For header? Because the user doesn't actually have to request it, maybe? They're so good at the wording ... If you don't 'split hairs' it reads like the user has to request the page preview, but if you really look at the way it's worded, they disregard robots.txt on the CHANCE 'a user' might want to see it when 'a user' activates previews ... They do not do it 'at the request of the user', so they don't send the X-Forwarded-For [because it's not]; they do it on the chance [activation of previews] the user might want to see it.

Basically, the wording is a well worded: 'Hey, this user-agent completely disregards robots.txt, because it helps us out and we feel like it...' statement ... Nice!

In order for images to be embedded in previews, it is important that they are not disallowed by your robots.txt file. In order to block crawlable images from being indexed, you can use the "noindex" x-robots-tag HTTP header element.

[ClearsThroat] B***S*** ... What is it likely Disallow: /img relates to in the robots.txt I posted? The previews are fine...

[sites.google.com...]

Yeah, I went and did the searching I didn't feel like doing.

I guess at least they are really pretty much up front about throwing robots.txt out the window, because a user might want them to after activating previews, and we all know their users are way more important than web standards and OUR users ... Plus, we can feel free to disallow GoogleBot from files, and they don't even mind, because when 'a user' activates previews they're going to fetch the disallowed files anyway ... Hmmmm ... Do you think they really wait(ed) for a user to land on a SERP with the page they request on it every time or do you think it's more likely they wait(ed) for exactly what they say: A USER to ACTIVATE PREVIEWS and then immediately start(ed) spidering pages?

keyplyr

4:45 am on May 15, 2011 (gmt 0)

If yours are not being indexed it's probably because there aren't enough links to them for them to be indexed without knowing the content anyway, but it's got nothing to do with the robots.txt block, because those pages will be indexed as URL only ... Check out the Google forum if you haven't ... This is one of the biggest points of confusion ... If you have seen the Google forum, then, uh, you should know locations blocked in the robots.txt are easily (and often) indexed.

I never said I was having problems getting links indexed. I'm not the one having the problems, this is your thread. I will however be sure to remember not offer any insight to you again.

lucy24

4:50 am on May 15, 2011 (gmt 0)

So... How long before the Search Queries page in GWT adds a "previews" column between "impressions" and "clicks" so you've got some way of homing on the people you almost got? Or are you going with option B,

BrowserMatch Google\ Web\ Preview keep_out
Deny from env=keep_out

This 74 message thread spans 3 pages: 74