And Now Google's Doing It. JS Stats Show GoogleBot

Forum Moderators: open

Message Too Old, No Replies

And Now Google's Doing It. JS Stats Show GoogleBot

TheMadScientist

7:08 pm on May 13, 2011 (gmt 0)

Hmmm ... Well a couple weeks ago I noticed M$ showing up in my JavaScript stats, and today I've got GoogleBot doing the same thing ... They're not quite as far along as M$ (or choose not to be) because they're missing the jQuery grab of the variable from the source code of the page, and they're really F'ing up my stats and they're COMPLETELY disregarding the robots.txt ... I think G's about to get banned from the files I don't want them to call in the .htaccess.

This is the first time in nearly 9 years I've seen G blatantly disregard robots.txt and they're doing it with a GoogleBot UA. I've given these guys the benefit of the doubt on not crawling robots.txt so many times it's not even funny, but this JS file is disallowed, so there's no way a GBot user-agent should be in my JS Stats.

TheMadScientist

4:57 am on May 15, 2011 (gmt 0)

keyplyr

I meant if your Disallowed pages were not included in the index, they might not have been included if they were allowed to be crawled OR they might just not quite have enough 'importance' without the content being known to be included, IOW: robots.txt is not at all a 'sure fire' way to keep pages out of the index, in fact it's often the opposite.

One of the biggest, most frequently repeated complains I've seen in the Google Forum (I think g1smd can likely verify this) is robots.txt disallowed pages being included in the index, so yours seem to be an exception rather than the rule if they are not included.

I don't know which forums you read or do not read, but I don't remember seeing you in the Google Forum, which is where I normally read and post (and have been for Years longer than my join date indicates), so I didn't know if you knew your pages are more of 'the exception' than 'the rule' if they are not included because of a robots.txt disallow.

You can be defensive if you feel the need, but you stated your understanding of the protocol to be incorrect as compared to what the protocol says, and I wasn't (am not) having any issues with Google including or not including pages in their index, I was (am) having issues with Google visiting disallowed pages, contrary to the Robots Exclusion Protocol.

lucy24

I haven't decided yet, and it's not on my site, so I don't get to make the final decision ... It's likely I'll end up re-working the stat system to account for their visitor accommodation, since they may not treat the site(s) blocking their non-compliant POS kindly.

Samizdata

12:38 pm on May 15, 2011 (gmt 0)

'Hey, this user-agent completely disregards robots.txt, because it helps us out and we feel like it...'

I think the actual excuse used by search engines is that the Robots Exclusion Protocol explicitly refers to "spiders", and that pre-fetchers and others are not "spidering":

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages

The justification would be that the Google Web Preview bot is not "recursively retrieving linked pages" and is therefore not covered by the Robots Exclusion Protocol at all.

Webmasters see them all as bots, but some bots are apparently more equal than others.

As I said, it's all something of a charade.

...

blend27

2:02 pm on May 15, 2011 (gmt 0)

I have most product pages fetch UPS, USPS and fedex rates info on ecom site, JQuery/DIV populated when a link clicked. Several month back I wrote a logic to say 403(BIG SALE :)) just for what TMS is talking about.

We speak theory, requests are requests, and get banned if not needed in my book.

Added: Was that a White Hat thing to do, I say SO: [google.com...]

wilderness

3:38 pm on May 15, 2011 (gmt 0)

I haven't decided yet, and it's not on my site, so I don't get to make the final decision ... It's likely I'll end up re-working the stat system to account for their visitor accommodation, since they may not treat the site(s) blocking their non-compliant POS kindly.

That's not entirely accurate!

If you implement use of Google Stats and/or Ads (and other Google Services), than you "open the door" for these types of abuses by Google.

On Aug 10, 2006, Google began crawling images on my websites, even though the image folders had been disallowed in robots.txt since 1999.
I contacted goggle and their response was not immediates, thus I denied access by their bot (Googlebot-Image) immediately.
Eventually, they contacted me and stopped the crawls.
Unfortunately in October 2006, the same abuses started again.

From that point forward, and on these particular sites, I denied access to all IP ranges and/or tools, with the exception of their primary bot:
Googlebot
66.249.64.0 - 66.249.95.255

It was also my intention on these particular sites, to also NOT have my pages "cache", thus each page was required a meta-tag specific to Google and per their FAQ (I don't recall when I initiated this meta-tag, however I can assure you it was long before 2006):

<META NAME="robots" CONTENT="noarchive">

None of these restrictions to the numerous google tools (for lack of another name) hampered or restricted in any fashion, either my long established pages or NEW pages as those same NEW pages were added, crawled and indexed.

IMO, utilizing tools from an SE (whether Google Stats or any other SE) and then expecting the SE to comply with robots.txt and meta-tags, simply opens the door for abuse by SE's to websites.

Call me paranoid, however I've no use for Google APPs either.

Samizdata

4:16 pm on May 15, 2011 (gmt 0)

It's likely I'll end up re-working the stat system

Apologies if I have misunderstood, but you seem to have initially claimed that your problem was caused by GoogleBot misbehaviour (title of the thread) and then changed to say it was caused by Google Web Preview (which is a rather different animal).

It now seems that your problem was actually caused by your obsolete stat system.

"There are no problems, only solutions." (John Lennon)

...

TheMadScientist

5:11 pm on May 15, 2011 (gmt 0)

is not "recursively retrieving linked pages"

If it's not, how is it including the contents of my Disallowed JavaScript file?

That file is NOT included in the results, and that's NOT what people are previewing ... Just because it's a 'short spidering run' does not mean it's not a spidering run ... If it's not a spidering run all they get is the initial URL.

...since they may not treat the site(s) blocking their non-compliant POS kindly.

That's not entirely accurate!

It's not entirely accurate to think the sites not allowing them to show previews MAY get some off-centered 'poor quality' text that says we don't have an image for this site?

OR to think if I do block it, I'm NOT showing it them EXACTLY THE SAME CONTENT as GoogleBot, which means I'm ignoring what they say on the page I linked.

From the link above:

Can I show different content in the preview?
A: No. You must show Googlebot and the Google Web Preview the same content that users from that region would see (see our Help Center article on cloaking).

How can I block previews from being shown?
A: You can block previews using the "nosnippet" robots meta tag or x-robots-tag HTTP header.

[edited by: incrediBILL at 7:09 am (utc) on May 16, 2011]
[edit reason] thread clean up [/edit]

wilderness

5:23 pm on May 15, 2011 (gmt 0)

Are the regulars in here always this nice to new visitors to the Forum?

Most of the longtime participants of this forum, long ago learned to differentiate between what the SE's "say they require" and what they actually require".

TheMadScientist

5:24 pm on May 15, 2011 (gmt 0)

So far, among other things, I've been told I'm 'not entirely accurate' to think I need to keep an 'open mind' about what Google [A 3rd party outside of my control] MAY [<-- operative word in my statment] or MAY NOT [<-- implied] do [today & in the future] WRT to my showing their one user-agent [Preview 'Non-Bot'] 'different' [403 Forbidden Error Page Message] content on a page than GoogleBot when they say not to.

And, requesting 100 pages from a single link hover is NOT making recursive requests, so their Preview 'thing' is not a bot:

I swear I once read on a google page that good robots wait a second or more between each hit. Anyone got a calculator? 100 pickups (exactly!) � 6 seconds = ...

lucy24 (previously in this thread)

I've learned a bunch in this thread.
Thanks everyone!

[edited by: incrediBILL at 7:23 am (utc) on May 16, 2011]
[edit reason] thread clean up [/edit]

dstiles

8:27 pm on May 15, 2011 (gmt 0)

The facts hinge, it seems, on the combination of UA and IP/rDNS. IF it's a proper googlebot UA then the rDNS MUST match (ie it should include indication that it's a bot). If it's NOT a bot UA (ie not googlebot itself) then you are entittled, in my view, to reject access.

I have a permanent block on web-preview UAs (for pages - I can't handle pics). I get a LOT of hits from web preview, suggesting that they do not cache the results (namely a 403). If the IP involved is a bot IP then I do not block the IP itself - it will probably be used later by a real bot. If it is not a true bot IP then the IP itself is banned UNLESS/UNTIL they reset the rDNS to show it may be a true bot.

I don't THINK we get a visitor penalty for missing images on web previews - the stats do not seem to have changed significantly since they were introduced and I've never allowed web preview to take anything away from the site (possibly excepting pics). I have image folders disallowed in robots.txt.

And that is the nub: web-preview is supposed to visit to obtain missing pics and other files that were not allowed to googlebot (eg images disallowed in robots.txt). It comes in with the USER'S IP using google's IP as a proxy - easy enough to detect.

The snippet tag CAN be used to allow web preview in order that google can get images BUT if you do google will also take your images for other purposes: there is no tag to block web preview and only web preview!

Translates I let through but not transcoders or apps.

lucy24

8:36 pm on May 15, 2011 (gmt 0)

It now seems that your problem was actually caused by your obsolete stat system.

Apples and oranges. One thing is what is actually happening, the other is what you or your customers are interested in seeing. It sounds as if MS wrangles his stats on about the same principle as mine, though presumably using more grownup methods. Filter out any authorized robots and subsidiary files (like images or css called from a page), and what's left are the real visits from humans. But you don't just deal with bad robots by filtering them out of your visible stats: out of sight, out of mind. You, um, Deal With them.

Samizdata

11:27 pm on May 15, 2011 (gmt 0)

The fact is that GoogleBot and the Google Web Preview bot are two very different beasts.

Like it or not, GoogleBot is is considered a spider that should obey robots.txt (though it has been known to make mistakes) while the Google Web Preview bot is considered a pre-fetcher which is not subject to robots.txt instructions.

There are a number of threads on WebmasterWorld concerning Google Web Preview which were posted at the time it was launched (November 2010), and in most of them you will find me stating something like "the Google Web Preview bot only exists in order to circumvent robots.txt instructions".

I really don't want to take the time to research it and find out

I respectfully suggest that this is at the root of your (technical) problems.

The launch of Google Web Preview clearly made some methods of stats analysis obsolete, unless they were upgraded to take the new bot into account.

Are the regulars in here always this nice to new visitors to the Forum?

I have always found them extremely helpful.

And patient.

...

[edited by: incrediBILL at 7:15 am (utc) on May 16, 2011]
[edit reason] thread clean up [/edit]

Sgt_Kickaxe

9:49 pm on May 16, 2011 (gmt 0)

Gbot regularly causes 404 errors in my stats by trying to load pages that don't exist and NEVER existed. The pages G tries to visit either have parameters in which Gbot is adding random variables to all by itself or the pages are generic script backend pages that NOBODY wants indexed(ie:wordpress core when not on a wordpress site, to test?).

Google gathers information, it's what they do.

P.S. a VERY popular 404 triggered on my sites is the remote publishing script for wordpress. That page gives the wordpress version even if the user has deleted the version number everywhere else. That's impressive detective work on a one page plain html domain :)

Samizdata

10:56 pm on May 16, 2011 (gmt 0)

Gbot regularly causes 404 errors in my stats

Any method of compiling stats relies upon precise and up-to-date identification of bots and some knowledge of their habits if the results are to be anywhere close to accurate.

As we have seen in this case, both failing to identify the bot correctly and failing to keep the stat system up-to-date with the never-ending "innovations" of the bot world can have embarrassing consequences.

I strongly recommend reviewing the raw site logs, which makes differentiating GoogleBot from Google Web Preview very easy. I also recommend using the WebmasterWorld search function for information on bots you are not familiar with.

The regulars here will tell you more than any search engine ever would.

...

g1smd

1:25 am on May 17, 2011 (gmt 0)

I noticed an email address shown in the snippet of a site listed in the Google SERPs a few minutes ago.

That's not unusual, except that I know the email address is served on that site by obfuscated Javascript code located in a separate .js file.

It seems like Google has spidered it, decoded it, and shown it in the SERPs.

The whole idea of the JS was to show the email address only on the site and in a way that spammers were unlikely to bother harvesting it. Now there it is in plain text in the SERPs.

TheMadScientist

2:44 am on May 17, 2011 (gmt 0)

The launch of Google Web Preview clearly made some methods of stats analysis obsolete, unless they were upgraded to take the new bot into account.

It's funny, you keep talking about my stat system like it's old. I only installed this version 2 or so weeks ago, and my guess is anyone who writes a brand new one misses a few at the beginning. (Google's Web Preview didn't even show up right away, which I find interesting, because if it had I wouldn't have 'just let it run' for a couple of weeks without posting.)

I guess you just don't know me very well, because I'm not at all 'embarrassed' by having redundant PHP + JS quasi-real time (20 seconds after the visit beings) visitor stats that show not only landing page, but referrer and link clicked on the page, which you might think you get with GA, but last I checked they can't tell you which link on the page a visitor clicked if there's more than one link to the secondary page to choose from, while I can ... Meaning I can tell you in nearly real time, the visitor, their referrer, the landing page, the time on the landing page and the exact link on the page a visitor clicked (if any), plus I can tell you how many visitors have JavaScript enabled vs. those who don't, plus visitors with cookies compared to those without and a few other 'little details' about visits you just don't get from most stat systems, like how many IP Addresses have visited with the same browser configuration.

I was on a bit of a rant when I said 'some semblance of accuracy' the only part of the system that is not SELECTED 'accurately' is the overall JS visits (meaning there is ONE select that isn't functioning properly), but that's a simple number to get out of analytics (as long as you don't mind missing the people who have extensions installed to block it) and it's actually really easy to 'eliminate' the Web Preview visits in a way most would be satisfied with, because there's a variable missing from each of their visits (as I stated in a previous post), so I could simply and easily select the count of all visitors where the variable is included and effectively eliminate the Web Preview visits, but then I won't know if there's another visitor 'dropping' the same variable will I? No, so that's probably not the way I'll go about it.

Anyway, if you wish to think my system is 'obsolete' or I'm in any way embarrassed by writing a stat system most people probably wish they could have, you're mistaken, but like I said, you just don't know me very well, and I could see how that being the case you could jump to a completely inaccurate conclusion making me think you haven't even read the thread, so my apologies, and thanks everyone for the 'constructive criticism' I've received throughout this thread. It's been very enlightening.

A bit more directly on the topic of this thread and what Google's 'non-bot' is 'missing' is the 'exit' functions, so they choose to make the post request (fired via jQuery $(document).ready()) but don't 'attach' any of the other functions which include the exit recording.

Maybe these were not apparent in the stats initially, because firing jQuery $(document).ready() funcitons is something they only recently started doing? Not sure, because the stats I have using this system don't go back far enough to make a very accurate determination, but I can tell you these requests did not show on the initial installation.

[edited by: TheMadScientist at 3:41 am (utc) on May 17, 2011]

lucy24

3:38 am on May 17, 2011 (gmt 0)

the Google Web Preview bot is considered a pre-fetcher which is not subject to robots.txt instructions

Considered by whom? In my lexicon, the prefix "pre-" means before something, not in the hypothetical anticipation of something. Are they worried that the user will get bored and not stick around for the quarter-second it takes them to fetch a page?

Hm. Just days ago, someone reminded me that Microsoft once swore under oath that MSIE was not a browser. If it worked for them...

TheMadScientist

3:45 am on May 17, 2011 (gmt 0)

Considered by whom?

The algorithm, of course ... Don't you know the algorithm is what does everything there? They have no responsibility whatsoever as people ... They searched for it, saw 'pre-fetcher' in the results and that's gotta be what it is! lol

They can call it whatever they want, but imo, we all know if I wrote it and it accessed your (meaning the reader's) site, even if it only requested the files associated with a page (they request more*), it would be called a bot, and I can about assure you every long-term, regular member of this forum would check their robots.txt requests and if it was mine, not Google's, and the request was not there they would block it immediately.

* lucy24 checked and noted previously in this thread there were 100 requests in 6 seconds from a single activation of previews, including images from a linked page, which means they have to be 'recursive' requests, even though they stop the recursion short of a 'full Internet run'.

TheMadScientist

4:49 am on May 17, 2011 (gmt 0)

Just a note on 'stat system live' v 'first request made from G'

System Live: Sun, 01 May 2011 06:52:27 -0600 GMT
First Google Web Preview Request: Fri, 06 May 2011 19:12:31 -0600 GMT

Stat systems go obsolete so fast these days it's not even funny! lol
I though computers were fast, but 5 days? Ridiculous.

The site visits and traffic are basically 'constants' so for some reason they were not making the requests. In the 10 days since they've started they've made 3041 request (barring the 'odd variable drop' by another user-agent of course). All the records I've run the IP Addresses on without the variable are Google's so far.

Even though this 'thing' has been around for a while, afaik there are not too many 'bots' (or whatever you call this thing) you have to worry about running jQuery (or firing jQuery events), which is one of the reasons I decided to write them using jQuery.

So far, M$ was first (got them in the first couple days and they were much harder to detect than Google's) and now Google, but everything else I have seems to be 'real visitors' from what I'm seeing, so to date, to my knowledge, to keep your jQuery based JS stats from being 'obsolete' you need to account for 2 bots. One from M$ IP Addresses claiming to be a browser, and the other, Google's Web Preview.

If anyone else is running jQuery based stats and you know of bots processing them and sending HTML variables along with full browser information, including screen size and color depth, please feel free to post about it. (So far G is the only one I can find missing the HTML variable.)

Little note, so there's no 'hair splitting' about why I didn't post about it earlier: The only way to 'see' these visits is to pop open the database, because I don't 'add in' visitors without exit times or 'other missing entries', except on the overall count, so they're not apparent in the live version I watch and compared to the DB for about 8 hours a day the 1st 3 days it launched ... I dropped back by to compare again and as soon as I opened the DB I saw the G requests ... They were so obvious compared to M$ it's not even funny, and honestly, I wish they would do as good a job of running jQuery as M$ does ... They're way behind there, because M$ sends me the HTML variable I'm supposed to have ... My guess is this is a new addition for G.

rustybrick

12:00 pm on May 17, 2011 (gmt 0)

Can someone sum up if there is an issue with GoogleBot or BingBot?

wilderness

12:43 pm on May 17, 2011 (gmt 0)

Can someone sum up if there is an issue with GoogleBot or BingBot?

None.

The standard bots of both are 99.9999% compliant.

Samizdata

12:49 pm on May 17, 2011 (gmt 0)

Can someone sum up if there is an issue with GoogleBot

Despite the title of this thread and the strident claims and assertions in the opening posts it appears that GoogleBot was never actually involved.

This thread is about Google Web Preview, which is considered something else entirely.

Considered by whom?

The Google Web Preview bot is considered exempt from the Robots Exclusion Protocol by just about everyone except those belatedly complaining here.

That is why it was invented in the first place - so that Google could build previews using images, CSS, javaScript and (now) Flash files that are disallowed in the robots.txt file.

As I pointed out earlier, the justification would be that it is not a spider as defined in the Robots Exclusion Protocol, which is itself not a web standard but a 1994 draft (updated 1996) for which compliance is entirely voluntary.

Google Web Preview was launched six months ago - quite a long time in web terms - and was discussed in detail on WebmasterWorld at the time. I recommend reading those threads, which describe what it is and what the options are for dealing with it.

Google Web Preview has its own user-agent. It is not GoogleBot.

Newcomers to this forum can rant all they want - as long as they keep it civil - but they would do better to spend their time reading some of the earlier posts on the subject.

I would probably be annoyed if I found that my chosen method of statistical analysis had become obsolete, but I hope that I would recognise that the responsibility for keeping up with developments on the web was mine alone.

I certainly wouldn't waste time abusing people here, as abuse gets deleted.

...

rustybrick

1:05 pm on May 17, 2011 (gmt 0)

Yep, thanks Samizdata. Just making sure.

TheMadScientist

2:43 pm on May 17, 2011 (gmt 0)

Yeah, rustybrick, I ran a reverse lookup on the IP Address(es) and they reverse to googlebot, I didn't realize they changed the UA and tried to apply what seems to be a new definition to 'bot' until I dug farther into the stat system. Google's NEW Bot would be more accurate imo.

This is from the robotstxt.org site:

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

[robotstxt.org...]

They apply a heuristic to 'stop' the traversal shortly after it starts, but the bot automatically retrieves a document and recursively retrieves all documents that are referenced to assemble the page for the preview. (It has to, if it does not they have the base hypertext document 'shell' but do not have the full contents of the page for the preview to be assembled.)

They do 'space out' the requests for some subsequent 'base hypertext documents' until another user takes an action, and that appears to be the heuristic they apply to the traversal algorithm determining the order and speed of requests they make. (I say some, because in this thread there was a test and subsequent (linked) hypertext documents were recursively requested along with some or possibly all their referenced documents from a single preview activation.)

I'm not sure how people who are frequent visitors to this forum here on WebmasterWorld cannot see it that way or even make what seem to be erroneous claims as to what the robotstxt.org site seems to define as a bot and defenses of Google, but sometimes it seems people like to argue with the poster rather than the post itself. (I actually run into it quite a bit, go figure. LOL!)

I'll let the readers decide if this thing is a bot or not based on their understanding of the answer to the question 'What is a WWW robot?' posted on the robotstxt.org site and situation.

A user takes an action > bot requests hypertext documents on the page > bot recursively requests their included documents > bot on occasion requests subsequent hypertext documents ... They're crawling the web 10 (or so) pages at a time and spreading the timing out to by basing it on a user action ... It's a bot though, imo.

TheMadScientist

3:23 pm on May 17, 2011 (gmt 0)

Sorry for the double post, but I think it's important to note for those who are new here and do go read other threads: Even though it's been a while since I've dropped by, I do read this forum much more frequently than I post in it, and I always make sure to follow up, especially on those who post as authorities on issues and information but refrain from citing their source and explaining their position based on those citation, because I've found on occasion they can be what I believe is 'less than accurate' in their summary or position.

Usually, the 'best sources' I've found take the time to cite, reference and explain their position, rather than simply posting as 'the authority' on a subject, plus we all 'make mistakes' or 'jump to a conclusion' from time to time, so it's always wise to follow up, imo.

wilderness

4:40 pm on May 17, 2011 (gmt 0)

Usually, the 'best sources' I've found take the time to cite, reference and explain their position, rather than simply posting as 'the authority' on a subject, plus we all 'make mistakes' or 'jump to a conclusion' from time to time, so it's always wise to follow up, imo.

Unfortunately, this forum is much-less-than it once was.
Prior to the forum being shut down in approximately 2004, their was untold activity here daily.

At that time there was NOT an Apache forum at Webmaster World and many people focused upon Apache participated here.

Since the forum restarted (and especially since the advent of white-listing), restrictions and moderation have been a part of this forum (previously a bots activity could be noted and hundreds (perhaps more) could implement preventions of denial within seconds.) Today due to the moderated and delayed approval of new thread submissions this forum crawls along in a crippled capacity.

In summary, citing sources based upon Rewrite data that would expose the lines to harvesters and bots (whom also monitor this forum; proven time and again) longtime participants have simply accepted the fact that citing examples are simply not prudent.

Citing sources based upon the "supposed" restrictions of activity by google and others SE's is a waste of time. Their FAQ's are freely available.

It's perfectly acceptable for SE's to direct traffic as they wish, however when a webmaster imposes the same restrictions/directs, the SE's deem the activity as cloaking.

Samizdata

4:41 pm on May 17, 2011 (gmt 0)

erroneous claims

Nobody here has claimed that Google Web Preview is not a bot.

Nobody here has claimed that the introduction of Google Web Preview was not "a typical Google land grab" (in fact one person suggested it was exactly that).

Nobody here has claimed that Google Web Preview was not deliberately designed to circumvent robots.txt (in fact one person suggested it was exactly that).

the robotstxt.org site

The robotstxt.org site (as far as I can see) does not include any reference to Google Web Preview. It is not, in any case, a standards body, and the Robots Exclusion Protocol is not an officially adopted web standard.

Nobody anywhere (to the best of my knowledge) has mounted any actual challenge against Google Web Preview's legitimacy - in fact, many webmasters welcomed it at the time it was launched (though some were critical).

erroneous claims

One person here has claimed that the GoogleBot user-agent was used to fetch disallowed files.

One person here has claimed that a "web standard" was violated.

It is unfortunate that your stat system cannot identify a user-agent correctly and was not updated to take Google Web Preview into account six months ago.

But that is nobody else's fault.

...

TheMadScientist

4:50 pm on May 17, 2011 (gmt 0)

It is unfortunate that your stat system cannot identify a user-agent correctly and was not updated to take Google Web Preview into account six months ago.

LMAO! I can't update a stat system that's not 6 months old 6 months ago.
Are you sure you read my posts, because I keep getting the idea you don't?

It's funny, you keep talking about my stat system like it's old. I only installed this version 2 or so weeks ago, and my guess is anyone who writes a brand new one misses a few at the beginning. (Google's Web Preview didn't even show up right away, which I find interesting, because if it had I wouldn't have 'just let it run' for a couple of weeks without posting.)

...

System Live: Sun, 01 May 2011 06:52:27 -0600 GMT
First Google Web Preview Request: Fri, 06 May 2011 19:12:31 -0600 GMT

...

So far, M$ was first (got them in the first couple days and they were much harder to detect than Google's) and now Google, but everything else I have seems to be 'real visitors' from what I'm seeing, so to date, to my knowledge, to keep your jQuery based JS stats from being 'obsolete' you need to account for 2 bots. One from M$ IP Addresses claiming to be a browser, and the other, Google's Web Preview.

...

The only way to 'see' these visits is to pop open the database, because I don't 'add in' visitors without exit times or 'other missing entries', except on the overall count, so they're not apparent in the live version I watch and compared to the DB for about 8 hours a day the 1st 3 days it launched ... I dropped back by to compare again and as soon as I opened the DB I saw the G requests ... They were so obvious compared to M$ it's not even funny, and honestly, I wish they would do as good a job of running jQuery as M$ does ... They're way behind there, because M$ sends me the HTML variable I'm supposed to have ... My guess is this is a new addition for G.

One person here has claimed that the GoogleBot user-agent was used to fetch disallowed files.

One person here has claimed that a "web standard" was violated.

And the one person corrected the error within this thread, and we'll have to agree to disagree about what is considered to be standard, 'officially adopted' or not. You even KNOW I corrected the opening statement in this thread:

Apologies if I have misunderstood, but you seem to have initially claimed that your problem was caused by GoogleBot misbehaviour (title of the thread) and then changed to say it was caused by Google Web Preview (which is a rather different animal).

The Google Web Preview bot is considered exempt from the Robots Exclusion Protocol by just about everyone except those belatedly complaining here.

It's a bot, but it's an exempt bot? Alright, if that's your position, then fine, but imo, the two are mutually exclusive ... If it's a bot, it's not exempt ... As far as I can tell you're right about Google's Web Preview not being referenced on the robotstxt.org site, in fact, I can't even find the 'exemption list' to know it's not supposed to be subject to the generally accepted Protocol.

Can you point me in the direction of the 'Exemption List' or even the 'Exemption Protocol' so I know what bots are considered to be exempt from the Exclusion Protocol in the future? I can't find it...

I think I know why people may not bother to post when they disagree with the statements and positions taken by some in here ... I have a guess anyway. LOL

This is actually a fun little discussion though.

TheMadScientist

6:13 pm on May 17, 2011 (gmt 0)

My guess is this is a new addition for G.

Just for reader clarification: By 'new addition' I was meaning the firing of jQuery $(document).ready() functions ... Google's Web Preview requests were not made through this stat system until the 5th day after it went live.

[Exact times noted previously (x2)]

Samizdata

6:19 pm on May 17, 2011 (gmt 0)

we'll have to agree to disagree about what is considered to be standard

The Robots Exclusion Protocol is a draft offered for adoption in 1996:

This document is an Internet-Draft... Internet-Drafts are draft documents valid for a maximum of six months

Apparently this so-called "standard" expired on 4 June 1997 without being adopted.

Thousands of bots ignore it daily with impunity. Major search engines pay lip service to it when it suits them, but if they decide not to honour it they are breaking no actual rules.

Once you accept it as a charade (as previously mentioned) you will be getting somewhere.

Can you point me in the direction of the 'exemption list' or even the 'exemption protocol' so I know what bots are considered to be 'exempt' from the exclusion protocol in the future? I can't find it...

Sure. The real world is that way -->

Google would probably argue - if anyone was ever mad enough to challenge them about compliance with a non-existent "standard" - that Google Web Preview requests are user-generated, only take the composite elements of a document (i.e. no external links), and do not retrieve "all documents that are referenced" from across the web (as GoogleBot does).

You are welcome to argue the case against Google before the relevant standards body (assuming you can locate it). I'm sure we would all find it hilarious, particularly if you remain unable to identify a simple user-agent correctly.

The rest of us deal in reality. We use as best we can the fact that major search engines generally condescend to honour our robots.txt directives when crawling for indexing purposes. We accept that if Google or Microsoft decide they want to show visual previews in the SERPs then there is not much we can do about it (opting out may even have undesirable consequences).

[edited by: incrediBILL at 6:52 pm (utc) on May 17, 2011]
[edit reason] thread clean up [/edit]

incrediBILL

7:04 pm on May 17, 2011 (gmt 0)

Technically screen shot tools and link checkers don't need to follow robots.txt standards because technically they aren't crawlers. Given one page to examine isn't really a crawl, but the screen shot aspect tries to render the whole page thus tripping JS stats.

Google isn't the first to do this, ASK, Snap, ThumbShots, AboutUs, etc, lots of sites (including mine) are taking screen shots these days causing bad stats all over the place, but it's the frequency and sheer volume of the Google web preview updates that make them more noticeable.

A good stats program should be able to identify new bots as they come along just from the UA itself, shouldn't need to recode a stats program every time something new appears.

Anything not a known browser can easily be assumed to be a bot and there are lists of known browsers and such available that have all these sorted out ready to use, such as [browsers.garykeith.com...]

Hope that helps!

This 74 message thread spans 3 pages: 74