New Googlebot User-Agent Identification

Forum Moderators: open

Message Too Old, No Replies

New Googlebot User-Agent Identification

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Critter

3:56 am on Mar 3, 2004 (gmt 0)

Just noticed this tonight. A new identification for the Googlebot in my logs.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Verified that the IPs were Google's, so it seems legit. Funny that this new identification mimics Yahoo's crawler bot identification

kaled

2:29 pm on Mar 5, 2004 (gmt 0)

Presumably this is a significantly more powerful robot rather than just a new name.

GG, you mentioned frames and javascript - well I don't envy your engineers the task of dealing with javascript (so I guess that may be down the line) but since I use frames extensively I would certainly be interested to know what changes are coming.

I imagine that CSS files will also be scanned in the future. I guess we'd all like the head's up on that.

Also here's a suggestion I posted a couple of weeks ago.

From comments by GoogleGuy, I think it is safe to assume that it is possible to create plain html links that will not be followed by Googlebot. All you need to do is add something that looks like a session id to the url.
However, this is untidy, therefore I propose a very simple exclusion protocol just add
?...&robots=nofollow
to the url.
When a robot sees this parameter in an url it should not follow it.
The standard should allow other fields and fields in any order so that the following would be legal
?...&robots=newparam,nofollow,anotherparam
This would make it easy for webmasters to avoid setting spider traps. It would allow creators of shopping cart software to ensure that their products don't set spider traps. Since it is probably the existence of such problems that has caused some hosts to ban Googlebot (amongst others) it would help to solve this problem over time.
A standard such as this should have been agreed years ago. However, if we wait for a standards organisation to ratify this it'll take years. On the other hand, if Google were to unilaterally adopt such a standard, other robots would adopt it too.

Kaled.

GoogleGuy

5:08 pm on Mar 5, 2004 (gmt 0)

Interesting suggestion, kaled--I'll pass it on. You could also imagine allowing a div that lets you block out links or sections of a page not to index/follow. Thanks for the feedback, everybody.

Dayo_UK, it's more the latter. It's not like the new user-agent bot will be some brand-new "superbot" that can understand everything that webservers will offer. But it does lay the groundwork, so that if in the future we want to add a new superbot-like feature, things will be smoother for everyone (both webmasters and us).

Stefan

2:38 am on Mar 6, 2004 (gmt 0)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

Fine and dandy, but what is this?

2004-03-05 03:37:30 64.68.89.144 GET /b*rrows_icc_030510.htm 406 4085 134 www.site.org Googlebot/Test -

Did I miss something in the thread or did it address it? What's with the 406?

ScottM

2:48 am on Mar 6, 2004 (gmt 0)

GG, a lot of us are running session 'killers' ("cloaking" for a good reason) for Googlebot on our forums.

Will this affect us?

If so, should we change our code to the new name?

[edited by: ScottM at 2:50 am (utc) on Mar. 6, 2004]

jbgilbert

2:49 am on Mar 6, 2004 (gmt 0)

If that 406 is the response to the user agent it means:
Client Error - Not Acceptable.

Very interesting log entry you have there...

Stefan

2:57 am on Mar 6, 2004 (gmt 0)

It's the first 406 I've ever seen in the logs. I don't know what the test is, but it's not working with my site.

jdMorgan

3:17 am on Mar 6, 2004 (gmt 0)

We've had some recent discussion of the 406 error over in the Apache forum. It seems there may have been some change recently that affected Googlebot's ability to participate in content negotiation. If you have content negotiation enabled but are not actually using it, you can turn it off to fix this problem.

On Apache, the fix can be as simple as putting

 Options -MultiViews

in your Web root .htaccess file.

Jim

Stefan

3:23 am on Mar 6, 2004 (gmt 0)

JD, it's as simple an html page as you can get, the exact same as a hundred others on the site, many of which were crawled by the normal bot during the same 24 hr period and all got 200's.

I'll sticky you the log files if you want.

Interesting that GG didn't respond to the several posts on Googlebot/Test

sblake

3:30 am on Mar 6, 2004 (gmt 0)

"This bot that a few people noticed was a test crawl with the new user agent."

It seemed pretty clear to me.

Stefan

3:34 am on Mar 6, 2004 (gmt 0)

The test crawl he refered to wasn't Googlebot/Test and it wasn't getting 406's from servers. Read through the thread.

Added: this is what he was talking about

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

[edited by: Stefan at 3:39 am (utc) on Mar. 6, 2004]

jdMorgan

3:38 am on Mar 6, 2004 (gmt 0)

stefan,

Multiviews is a server-level setting, independent of (well, above) individual pages. You may have to ask your host if it's enabled. Basically, your server and Googlebot could not agree on a MIME type that was acceptable to both. You might also want to check your server headers [webmasterworld.com] and make sure they're correct. You should get a MIME-type of text/html for a plain-vanilla html page.

The problem with Googlebot and MultiViews started about 01/Feb/2004 according to this thread [webmasterworld.com].
More recent discussion [webmasterworld.com].

Jim

Stefan

3:46 am on Mar 6, 2004 (gmt 0)

Thanks JD

Our site gets crawled by the normal bot very well every day. The 406 only showed up on two attempts by that particular bot with that particular IP#. Whatever it is, it's not our site or server that caused the 406..

lunas

4:22 pm on Mar 6, 2004 (gmt 0)

So, I for one am unclear, what is this GoogleBot/Test? It hit my site and received 200 meaning all is okay, but curious as to what it is.

GoogleGuy

5:01 pm on Mar 7, 2004 (gmt 0)

I hadn't heard of GoogleBot/Test, but I'll ask about the 406 issue. I wouldn't be surprised to see different Googlebots with slightly different code in some ways--we're always trying new things. ScottM, I'd say it's best to prepare to recognize Googlebot by looking for either the old or the new user agent.

volatilegx

9:28 pm on Mar 7, 2004 (gmt 0)

FYI, I have already seen the "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) User agent" and was seeing it as early as late February. I too am curious about Googlebot/Test

g1smd

9:55 pm on Mar 7, 2004 (gmt 0)

I have seen 406 errors from another bot in the last few days.

2004-03-06 11:06:59 wfp2.almaden.ibm.com - W3xxnxnnn Wxxnnn ip.232.131.ip 80
GET /pdfs/Some_File.pdf - 406 0 4066 195 HTTP/1.0
[almaden.ibm.com...] -

Heh, even found someone browsing the site from a mobile phone.

2004-03-06 11:34:21 216.239.39.5 - W3xxnxnnn Wxxnnn ip.232.131.ip 80
GET /index.htm - 200 0 1069 565 HTTP/1.0
Nokia3510i/1.0+(04.01)+Profile/MIDP-1.0+Configuration/CLDC-1.0+(Google+WAP+Proxy/1.0) -

edit_g

10:26 pm on Mar 7, 2004 (gmt 0)

Now, what does the + do? :)

Stefan

10:28 pm on Mar 7, 2004 (gmt 0)

Interesting, g1smd. I also had them from the msnbot but it was on .doc's and .txt's, (but maybe I'm noticing them now because I'm looking for them).

2004-03-05 06:11:01 65.54.188.8 GET /c***-r****-a**essment.doc 406 4066 216 www.site.org msnbot/0.11+(+http://search.msn.com/msnbot.htm) -

The G ones were .htm.
The testbot and the 406's haven't shown up again, so no problem.

Thanks for the input, GG.

mr_strong

2:02 pm on Mar 8, 2004 (gmt 0)

Googleguy - Are these changes to Googlebot the reason why many new sites aren't being deepcrawled at the moment?

Thanks :)

seofreak

4:36 pm on Mar 8, 2004 (gmt 0)

one of my sites have been getting deep crawled daily.

lstrand

7:41 pm on Apr 20, 2004 (gmt 0)

Any more ideas if the user agent "Googlebot/Test (+http://www.googlebot.com/bot.html)" is a real Google bot or a fake? There didn't appear to be a definitive answer in this thread.

This user agent really upped the page views starting last night and it is going for our regular HTML pages. In the past it had a few crawls and now it is far more aggressive for its visit cycle. It is also visiting new pages it has not crawled before.

I'm concerned about non-Google scraping of pages. If a fake, I'd like to ban it.

Here are the IPs and view counts (past month) identifying itself as the "Google/Test" bot.

IP Page Views
64.68.91.131 106
64.68.91.53 54
64.68.91.164 53
64.68.81.18 50
64.68.91.54 34
64.68.89.141 14
64.68.91.33 12
64.68.91.59 12
64.68.83.188 7
64.68.89.167 6
64.68.91.64 6
64.68.83.79 5
64.68.91.188 5
64.68.81.154 3
64.68.89.154 3
64.68.83.132 2
64.68.83.182 2
64.68.88.18 2
64.68.88.191 2
64.68.89.173 2
64.68.89.180 2
64.68.91.39 2
64.68.83.144 1
64.68.89.179 1

DNS lookups results are unclear. I see Exodus and Google for most IPs. Is it really Google?

Cable & Wireless SC3-1 (NET-64-68-64-0-1)
64.68.64.0 - 64.68.95.255
Google Inc. EC12-1-GOOGLE (NET-64-68-80-0-1)
64.68.80.0 - 64.68.87.255

Who has answers? The help would be most appreciated.

GoogleGuy

7:53 pm on Apr 20, 2004 (gmt 0)

My hunch is that it's a real Google bot.

Spine

9:49 pm on Apr 20, 2004 (gmt 0)

What's the best way to ban all googlebots old and new from a site using robots.txt?

Dayo_UK

9:55 pm on Apr 20, 2004 (gmt 0)

If you really want to:-

User-agent: Googlebot
Disallow: /

BTW, Googlebot test is going mad at the moment on my site too - perhaps the word test will be dropped soon :)

This 54 message thread spans 2 pages: 54