Forum Moderators: Robert Charlton & goodroi
What's the significance of this? Googlebot *and* Mozilla/5.0?
You should check out this thread [webmasterworld.com] for more info on that UA (including comments from GoogleGuy on the subject). If that's not enough, use Google to search WW for "Mozilla/5.0" & "Googlebot" [google.com].
Talking about spiders, I noticed that Google did a deep crawl of my site yesterday. Based on my experience and the time of month, I would expect some kind of update over the weekend - fourth of July and all here in the US...
Bourbon is the wild card. Things just settled down. Should be interesting. That's why I'm here, it's certainly not for the money - yet.
I must say m site was hijacked and now have been trying to get back, but nothing changed yet.
The Mozilla/5.0 version does seem to seem to have some problems with 302 redirects. A page that is part of a webring is getting hit multiple times a day and getting a 200 response rather than a 304.
In fact I've not seen Mozilla/5.0 googlebot get a 304 response.
Has anyone seen this bot get a 304 response?
He he - and being able to add pages to the index would be a good one to sort out.
Edit:-
Just to add to my thoughts on this bot - I am starting to think it might be the brains behind the normal Googlebot.
I think it might look for 404s, new urls, might be even the bot that works out the backlinks, page rank - but it does not seem to add pages.
I also only more supplemental result pages from 2004 show up and no real pages, like the googlebot crawls.
I posted a couple of days ago about this bot grabbing .js files (the first time I've seen this).
This bot has crawling my site on a regular basis since Dec 2004.
My gut feeling tells me this bot looks for CSS spamming techniques such as negative positioning and display none attributes etc, etc.
I also think it looks for for similar javascript techniques or the combination of both.
To put the cat among the pigeons I'd say It may have contributed to some of the 302 hijack problems.
"That spam penalty causes the PageRank of a site to decrease. Since one of the heuristics to pick a canonical site was to take PageRank into account, the declining PageRank of a site was usually the root cause of the problem" googleguy.
As well as others.
>I am starting to think it might be the brains behind the normal Googlebot
I think you are right dayo.
Dijkgraaf
I can't answer your 304 question but thanks for answering my post on the other forum. :)
everthing above IMHO
I dont think its a specieal javascripts/CSS bot I think it has more to do with the supplemental results, because I have NEVER seen it before I got hijacked, then there where only googlebot, which I dont see that often, only mozilla googlebot
Just to clear things up. I wasn't directly referencing your 302 prob.
The fact that I said this bot is feteching JS files might be misleading too.
"The primary reason for this is that some web servers assume that unless a user agent is IE, Netscape/Mozilla, or maybe Opera, that your browser won't support JavaScript, frames, etc. As Googlebot gets better over time, it gets closer to a regular user and browser in our ability to handle features like that" Google guy.
As simple example, what happens if you have this on your page.
document.write("hello") or document.write(varFromJSFile)
and therfore break google's golden rule of presenting google with different content from what you present the user?
There have been other discussion about this bot but no-one seems to know for sure what it does.
I did a longish thread in supporters but no-one seemed intrested.
If I could have gone to New Orleans I would have asked a G engineer about this bot.
Is the bot only active on filtered sites?
Is the bot only active, when a site has alot of supplemenatl results, because it did ad a lot of supplemental result to my site:mydomain.com, but not lately.
I just try figure out what tis bot does more then JS which does not show in IE or whatever.
It may be that google are staggering a full implementation of the bot.
I noticed it nov/dec 2004 but others had noticed it before that. Take a look at this thread [webmasterworld.com ] particularly pipster2004's post.
"I did a longish thread in supporters but no-one seemed intrested"
I find it odd too and its not only confined to WebmasterWorld the rest of the seo communtity don't seem to care except for a passing interest. It just seems very significant to me.
Going back to what you said about it being the might be "the brains behind the normal Googlebot"... I always had the feeling that mozilla fetched a page before the normal googlebot. So I checked my logs for this month on a deep page:
04/06/2005 23:15:33 GET /example.htm Googlebot/2.1 ( [google.com...]
11/06/2005 12:02:18 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
14/06/2005 08:17:33 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
15/06/2005 20:48:31 GET /example.htm Googlebot/2.1 ( [google.com...]
17/06/2005 22:41:10 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
21/06/2005 20:55:16 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
22/06/2005 14:09:09 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
23/06/2005 10:34:28 GET /example.htm Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...]
27/06/2005 20:32:03 GET /example.htm Googlebot/2.1 ( [google.com...]
I don't know if some kind of comparison is involved or if this just signifies the demise of Googlebot/2.1 :o
On a side note I also remember google guy in response to someone asking if they should ban the bot saying he didn't think this was a good idea.
It has hit the site 25,088 times so far this month (comparative figures for the normal G-bot are 1,730 times).
I call it the snippet-eater, since it adds nothing to mysite-SERPs (unlike the normal G-bot) but crawls all over my URL-only pages.
Finally, it is part-responsible for a swollen 404-file, requesting the most fantastic URLs.
If there was a way to discriminate, I would ban this b*stard in the robots.txt.
He he - yep that was me :)
Trying to get some info from GG
The damn thing has been hitting my site at up to 3 times/sec.
So, be warned: either let G hit your site as hard and as fast as it wants, or suffer a drought.
I guess I should have expected it. After all, G is very young, and you can hardly expect responsible, measured actions from the young, can you?
I think that you are being a little harsh.
Many, many years ago when I had a 64kbps leased line to the Net the Googlebot managed to crush my site by having huge numbers of simultaneous connections open for image downloads so that even I could not get in locally, and it was using nearly every drop of Net bandwidth!
I wrote them a polite email asking them to limit the load but NOT stop spidering and they did so almost immediately and I don't think that there has ever been a problem with their bot since. And I have stayed in the SERPS, though there is a much longer lag on images getting in than pages.
Rgds
Damon
I think that you are being a little harsh ... the Googlebot managed to crush my site ... so that even I could not get in locally
There is an unwritten contract between webmasters and the Search-Engines: we let them run their bots all over our sites (which costs us money) and they give us fresh visitors from their SERPs. My own research (msg#7) [webmasterworld.com] (also msg#59+60 [webmasterworld.com]) indicates that the Mozzie-bot adds nothing to the SERPs, which breaks that contract. Because of that, I'm fine with no visits from that particular bot, yet really pissed-off that the "standard" bot has also reduced it's rate. It has actually slowed even more. Here are comparative figures for the first 2 days of July:
Bot visits from 01-Jul-2005 00.00 to 03-Jul-2005 04:03:-
Inktomi Slurp : 2082+95
Google AdSense : 867+2
OmniExplorer : 516+1
GigaBot : 292+94
MSNBot : 369+10
Grub.org : 51+1
BecomeBot : 31+6
Googlebot HTTP/1.0 : 32+4
findlinks : 31
BSpider : 0+8
Others : 15+12
(Numbers after + are successful hits on "robots.txt" files.)
<headers>
<IP_ADDRESS>66.249.65.162</IP_ADDRESS>
<TIME>20050705092734</TIME>
<PATH_INFO>**REMOVED**</PATH_INFO>
<HTTP_CONNECTION>Keep-alive</HTTP_CONNECTION>
<HTTP_ACCEPT>*/*</HTTP_ACCEPT>
<HTTP_ACCEPT_ENCODING>gzip</HTTP_ACCEPT_ENCODING>
<HTTP_FROM>googlebot(at)googlebot.com</HTTP_FROM>
<HTTP_HOST>**REMOVED**</HTTP_HOST>
<HTTP_USER_AGENT>Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)</HTTP_USER_AGENT>
</headers>
<headers>
<IP_ADDRESS>66.249.65.162</IP_ADDRESS>
<TIME>20050705103219</TIME>
<PATH_INFO>**REMOVED (but same as above)**</PATH_INFO>
<HTTP_CONNECTION>Keep-alive</HTTP_CONNECTION>
<HTTP_ACCEPT>*/*</HTTP_ACCEPT>
<HTTP_ACCEPT_ENCODING>gzip</HTTP_ACCEPT_ENCODING>
<HTTP_FROM>googlebot(at)googlebot.com</HTTP_FROM>
<HTTP_HOST>**REMOVED**</HTTP_HOST>
<HTTP_USER_AGENT>Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)</HTTP_USER_AGENT>
</headers>
---
The time is in YYYMMDDHHMMSS format. Any clues as to why the same bot from the same IP will grab the same URL twice in just over an hour? Perhaps its checking for dynamically changing content?
I am starting to think it might be the brains behind the normal Googlebot ... it might look for 404s, new urls, might be even the bot that works out the backlinks, page rank - but it does not seem to add pages.
I have pages that were only visited by the Mozilla Googlebot ... they are appearing as "Supplemental Result"
Extra for #5:
After the M_Bot stopped hitting my site (28 Jun, msg 19) (June avg: 836/day) the G_Bot has also started to dry up:
After removing the noarchive meta, things gradually wound down to normal, and the Mozilla version is now only an occasional visitor. I assumed that my adding noarchive suggested in some way that I was cloaking, and the Mozilla version was sniffing for obvious discrepancies.
Re #2 No, I have supplemental results showing both title and snippets
G_Bot only visits parameters with single parameters at this time
The G_Bot does seem to choke at 3 parameters but, interestingly, the M_Bot will handle more than 2:
Re #4 I hope not, that would cause it to bounce back and forwards....which is exactly my experience on a site:my-site.com search.
Ok, I might not have noticed G_Bot visiting two parameter pages as I might not have any, and as you say it seems to be a fairly new development.
Doesn't sounds good that M_Bot and G_Bot are overiding each others results, sounds like a recipe for chaos, in fact that might explain a lot of peoples problems.