Forum Moderators: Robert Charlton & goodroi
I know their have been a few topics about this one, but i can't figure out what the Mozilla/5.0 Google spider exactly does..
Everything that gets spidered is not getting indexed, so, it's not a regular spider. It's meant specifically for something else. Cloaking detection could be one, but i can't see why they would build a whole new spider just for the detection of cloaked pages. Next to that, even if SEO's cloak pages, they would know by now their is a spider going around detecting it, so the cloaking would be done differently, undetectable.
Maybe the spider is used for analysing content of pages? This could be for future use (LSA?), to detect duplicate content or maybe to see if pages in a linkstructure contain mostly the same content..
I wonder what your thoughts are, your findings etc, cause this spider seems smarter to me then we might think it is...
I'm pretty sure that I read something from Google (possibly GoogleGuy) saying that it had something to do with Javascript. His explanation sounded very dubious if I remember, like it was some kind of cover story. Unfortunately i can't find the link.
I don't believe that it grabs .js files, So I can't see it being a fully javascript compatible web crawler.
I suspect it has something to do with gathering data for a forthcoming major update to Google.
I suspect it has something to do with gathering data for a forthcoming major update to Google.
Could well be. I also haven't seen it requesting .js files, so im not worried about that. It follows links between <script></script> tags, but the normal Googlebot also does that, so thats not something to worrie about.
But i also suspect it's analysing content much more then the normal Googlebot. I have a few pages behind my site which don't have much content on it, but are used as seopages. No bad things in it, just optimised for some keywords.
I had the pages visited by the Mozilla spider in januari and actually thats the last visit i ever had by Google. Maybe it determines the 'usefullness' for the actual spider, based on the content it analyses?
- It follows most redirects (at least on-site 301s) almost immediately
- It doesn't seem to remember those redirects (e.g. there are a number of redirected URLs on our site that it's re-visiting almost daily)
- It's not feeding the live index/cache
I half wonder if part of my problem is related to this little bot that picks up the constant changes I had been making. Perhaps kicking in some kind of cloaking penalty for the site. Not sure, but I have not read a good explantion for its visits.
I too have a dynamic homepage which changes as content is added.
It has no presence in G SERPS, even though I'm linked to from a PR5 site, have AdSense on the site and active AdW campaigns running. My site has been online in earnest and constantly worked on for about 5 months.
I have a feeling that this particular Bot is crawling my site because of the changing links on the homepage and the ongoing refinement.
In the last couple of days, I've made some changes in how those links are archived. I hope that'll help it get crawled and listed more effectively.
it not only ignored robots.txt it actually goes after everthing listed as disallow first
what?any other confirmations of this?
doe this not go against the spririt of "do no evil"?
Almost always when someone posts that Googlebot isn't obeying robots.txt, it's one of the following.
1. The robots.txt is incorrect in some way.
2. It wasn't Google, it was a forged useragent and the site owner didn't bother to reverseDNS the IP.