homepage Welcome to WebmasterWorld Guest from 54.227.11.45
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 54 message thread spans 2 pages: 54 ( [1] 2 > >     
New Googlebot User-Agent Identification
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Critter




msg:42542
 3:56 am on Mar 3, 2004 (gmt 0)

Just noticed this tonight. A new identification for the Googlebot in my logs.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Verified that the IPs were Google's, so it seems legit. Funny that this new identification mimics Yahoo's crawler bot identification

 

closed




msg:42543
 6:12 am on Mar 3, 2004 (gmt 0)

You're quick, Critter. I was looking for new information about it, too.

The most recent visit by Googlebot/2.1 had this UA:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

But about two minutes before that, and all other visits before that, the UA was:

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Oh well. Something to investigate tomorrow. Nighty night.

I don't mind the new format. Although it does look a lot like Slurp's:

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

mlemos




msg:42544
 6:36 am on Mar 3, 2004 (gmt 0)

I believe that is Google checking if you are cloaking, ie, showing different things to the crawler depending on the user agent identification.

markus007




msg:42545
 7:26 am on Mar 3, 2004 (gmt 0)

if people where cloaking they would do it by just scanning for the word google..

mlemos




msg:42546
 7:49 am on Mar 3, 2004 (gmt 0)

In theory, yes, but in practice not everybody knew that Google would be checking pages with a different agent that does not start its name with Googlebot.

AthlonInside




msg:42547
 7:51 am on Mar 3, 2004 (gmt 0)

Why detecting clocking, it is only meaningful if your have 100% different contents. Sites like news.google.com change every minute without clocking because they have FRESH data.

mlemos




msg:42548
 8:07 am on Mar 3, 2004 (gmt 0)

Sites that employ cloaking are banned from Google index.

[google.com...]

Google can determine if a site is cloaking by looking twice at the page with Googlebot user agent and in between with the Mozilla pretender .

If in the two times with Googlebot it looks like the same but with the Mozilla it looks different, you're busted.

closed




msg:42549
 12:48 pm on Mar 3, 2004 (gmt 0)

I believe that is Google checking if you are cloaking

I doubt it. From what I've seen today, all of the UAs for the regular Googlebot (not Mediapartners) have the new string.

BTW, does anyone know why Google puts a + before the URL for their bot information page?

kaled




msg:42550
 2:20 pm on Mar 3, 2004 (gmt 0)

Sites that employ cloaking are banned from Google index.

Nonsense. There are legit uses for cloaking and some huge sites that use it. [msdn.microsoft.com...] detects the browser and delivers different pages accordingly. Try browsing it with Opera identified as Opera (instead of IE) and you'll see some huge differences. Probably other MS sites are the same, but I almost never visit them.

Kaled.

internetheaven




msg:42551
 2:40 pm on Mar 3, 2004 (gmt 0)

"Google can determine if a site is cloaking by looking twice at the page with Googlebot user agent and in between with the Mozilla pretender. If in the two times with Googlebot it looks like the same but with the Mozilla it looks different, you're busted."

Whoa! Whoa! That can't possibly be true. All my pages have server side includes that throw a piece of text in with a random quote at two points on the page. This means that the page changes everytime it is loaded and I've never been removed from the Google Index.
What about all the news sites that change constantly throughout the day? Google would have to remove itself ...

g1smd




msg:42552
 2:47 pm on Mar 3, 2004 (gmt 0)

I am sure that "minimal changes" are allowed between visits.

I expect that it looks to see if on one visit it gets three paragraphs of normal looking text, and on the other visit it gets 20 000 repetitive keywords in no particular order.

mlemos




msg:42553
 2:31 am on Mar 4, 2004 (gmt 0)

I don't think Google bans whole sites for cloaking but rather each of the pages that employ cloaking. Banning may not mean necessarily removing the page from the index but rather not counting for the page rank of the pages they link to. Therefore, internal pages may get PR0 if they do not have any other links pointing to them.

edit_g




msg:42554
 2:35 am on Mar 4, 2004 (gmt 0)

BTW, does anyone know why Google puts a + before the URL for their bot information page?

Just a bump. I've wondered about that too - and I know that others here have as well.

bull




msg:42555
 12:23 pm on Mar 4, 2004 (gmt 0)

Strange things.
Yesterday the "new" Googlebot did a full deep crawl of one of my sites. I first did not recognize it. Today, the "old" one came, too, picking all the old 301 redirects (with no external links) I had to put there three months ago due to a large-scale page rename. Hopefully the bots will not interpret this as a cloaking attempt.

closed




msg:42556
 2:00 pm on Mar 4, 2004 (gmt 0)

Yeah, I see that too, bull. Today, Googlebot with the old UA got a 404 from my site because it requested a file I deleted a few months ago, as well as the links to it.

Net_Wizard




msg:42557
 2:22 pm on Mar 4, 2004 (gmt 0)

Deep crawling have started.

seofreak




msg:42558
 3:46 pm on Mar 4, 2004 (gmt 0)

One of my sites has been getting deep crawled everyday since march 1. about 20 pages. <simpons mode: woohoo>

anyone else also experiencing this?

plasma




msg:42559
 4:11 pm on Mar 4, 2004 (gmt 0)

@critter:

Is the request really coming from google's IP-Range?

@seofreak:

we get daily fully crawled since a long time, even our PR4 sites.

Critter




msg:42560
 5:46 pm on Mar 4, 2004 (gmt 0)

Hi Plasma:

Yes, I've verified that the IPs are from Google's network allocation.

Furthermore, it's kind of obvious that the new bot is not a bogus crawler because it comes from different IPs within the same (Google) netblock, something that would be almost impossible for a scammer to pull off unless they had their own netblock (and they ain't givin' out Class Cs like they used to) :)

nilloc




msg:42561
 6:27 am on Mar 5, 2004 (gmt 0)

Hi,

Just found a very strange entry in my Logs.
I am certain this is NOT GOOGLE
Someone else got this one also?

Host: 64.68.88.152
Url: /index.html
Http Code : 200
Date: Mar 05 13:05:37
Http Version: HTTP/1.1"
Size in Bytes: 16491
Referer: - Agent: Googlebot/Test

Regards,

bull




msg:42562
 6:30 am on Mar 5, 2004 (gmt 0)

Yes, "Googlebot/Test" got two files.
IP was
64.68.89.144

Nilloc, the IP you mention IS Google, as well as the one I posted:

[edit] marcs was a little faster with the whois... :) [edit]

Any statment, Googleguy? Would be appreciated.

[edited by: bull at 6:35 am (utc) on Mar. 5, 2004]

marcs




msg:42563
 6:32 am on Mar 5, 2004 (gmt 0)

>I am certain this is NOT GOOGLE
Maybe it is :

[rwhois.exodus.net]
network:Class-Name:network
network:Auth-Area:0.0.0.0/0
network:Network-Name:64.68.88.0
network:IP-Network:64.68.88.0/21
network:Organization;I:Google Inc.-BGPconfig-SC3DC3
network:Name;I:Google Inc.
network:Email;I:dns-admin@GOOGLE.COM
network:Street;I:2400 E. Bayshore Pkwy
network:City;I:Mountain View , CA 94043
...

nilloc




msg:42564
 6:36 am on Mar 5, 2004 (gmt 0)

Hi,

Thanks guys.

Changing my htaccess file Back again.
10-minutes ago I had blocked agent: Googlebot/Test

Letting it in again now

Regards,

GoogleGuy




msg:42565
 9:05 am on Mar 5, 2004 (gmt 0)

"Any statement, Googleguy? Would be appreciated."

Absolutely, bull. Normally I post by the seat of my pants, but I wrote this up earlier today and hadn't posted it yet:

Hey everybody, I wanted to give you a heads-up about a potential change in our user-agent name. Currently we use the user-agent
Googlebot/2.1 (+http://www.googlebot.com/bot.html)

but we're considering changing our user-agent to something like
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The primary reason for this is that some web servers assume that unless a user agent is IE, Netscape/Mozilla, or maybe Opera, that your browser won't support JavaScript, frames, etc. As Googlebot gets better over time, it gets closer to a regular user and browser in our ability to handle features like that. This user-agent change would let vanilla webservers assume Googlebot is more like a regular user, while still allowing site owners to recognize that Google is visiting and still providing clear contact info in case of questions or issues. We would still respect "Googlebot" in robots.txt, so the overwhelming majority of webmasters wouldn't have to change anything at all; most site owners probably wouldn't even notice the difference. This bot that a few people noticed was a test crawl with the new user agent. It looks like things went very smoothly, but we'll still be conservative when making this change. If people have any positive or negative comments or questions, just post here. If we don't hear any showstopping objections, we'll look at making this name change on the user-agent when we're convinced everything would be a smooth transition.

thanks,
GoogleGuy

Dayo_UK




msg:42566
 9:12 am on Mar 5, 2004 (gmt 0)

Will this have an effect on how Googlebot treats Javascript and Frames?

Or is it purely a case on how the Web Server treats Googlebot?

Bit above me this type of talk :(

steveb




msg:42567
 9:54 am on Mar 5, 2004 (gmt 0)

"allowing site owners to recognize that Google is visiting"

Not if they have Urchin stats.

seofreak




msg:42568
 11:50 am on Mar 5, 2004 (gmt 0)

whoops

Stefan




msg:42569
 12:53 pm on Mar 5, 2004 (gmt 0)

But what is "Googlebot/Test"?

I have that in yesterday's logs from 64.68.89.144. It was getting a 406 from the server.

superba




msg:42570
 12:59 pm on Mar 5, 2004 (gmt 0)

I'm also getting 406 from 'Googlebot/Test' on pages that are fine (and already indexed for weeks.) Worrying. We haven't done a thing to them.

Ledfish




msg:42571
 1:08 pm on Mar 5, 2004 (gmt 0)

Only problem I see GG is that my stats vendor will have to update there software to recognize the new googlebot identification, but hey thats not a big deal and I can imagine that the negatives might outweight the good.

So the only thing I'd say is if it is all good, make a formal press release so that all the stats vendors will update our software so that we can know when googlebot is visiting and can see what you have looked at without having to resort to open our log files in something like excel and doing a "find" to pull up the googlebot entries.

This 54 message thread spans 2 pages: 54 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved