Forum Moderators: Robert Charlton & goodroi
Anyway, this new bot (66.249.66.130) seems broken to me. It limps along taking a page or two every hour or so. Most worrying is that it seems to stall whenever it hits a 301 redirect. It will follow the redirect then disapear for long periods.
Has anyone else seen similar behaviour from this robot? Previous Googlebots, at various other IP addresses, have always wizzed through 301s without breaking stride.
From the number of other posts reporting broadly the same problem, it looks like Google have introduced yet another problem to compound the huge problems already introduced by Big Daddy.
Not only have they thrown away countless thousands of pages, by reverting to an ancient index, but they have now further sealed our fate, by ensuring that effected sites will take a couple of years to crawl.
I, and I assume all of the other people reporting a problem, wouldn't be mentioning it if the new Googlebot behaviour was not markedly different from it's previous behaviour.
I know how often, and at what rate, Google spider's my site (from experience garnered over several years). The point is, all of a sudden it has pretty much ground to a halt.
This didn't correspond to the roll-out of Big-Cruddy, it all started a few days ago.
PS: Most of my pages are not in the index. Big Daddy's roll-out from an August 2005 cache, resulted in tens of thousands of pages of lost content.
Same thing noticed here as well. One site that seemed to have dissappeared last summer has mysteriously reappeared, complete with links from pages that have long since dissappeared as well.
I can't make any sense of any of this.
<lastmod/>
<changefreq/>
I think Daemestro means that those two values in conjunction could tell Google when it's time to spider your site again.
I doubt it though. As I suspect many other do, I set changefreq to daily for a lot of my pages and skip lastmod. Something will have changed on the pages, but I don't change every bit of content every day. It's more like adding links as new pages are added and such.
For example, one of my sites has a combined linking structure. It includes many static urls, and the script that powers my product catalog creates dynamic urls. See below:
Static Url
http://www.example.com/large/blue/index.html
Dynamic Url
http://www.example.com/index.php?size=large&color=blue&id=12
This actually goes back to a problem that I posted about some time back. All of the dynamic urls are linked from static urls. But after complete log file analysis, I can find absolutely no reference to any Googlebot (either old, or Mozilla) visiting any of the static urls. So how on earth did Mozilla Googlebot get a listing of the dynamic urls?
At this point, Mozilla Googlebot has crawled almost my entire catalog (all dynamic urls; in no particular order that I can see), and has yet to crawl even 1 static url ...
Example of how site is being crawled:
http://www.example.com/index.php?size=large&color=blue&id=117
http://www.example.com/index.php?size=large&color=blue&id=1
http://www.example.com/index.php?size=large&color=blue&id=57
http://www.example.com/index.php?size=large&color=blue&id=12
...
BTW, I have manually checked all static urls to make sure they are live. And the catalog script is custom, so I do not believe that the spiders are following any sort of footprint. As a matter of fact, in the time that this has been going on, the spiders have not hit a single "404" dynamic page (meaning that it can't be using a random list of numbers). As well, I have confirmed (via reverse IP tracing) that all spiders are genuine Mozilla Googlebot (all with the same IP).
[edited by: tedster at 6:05 am (utc) on April 9, 2006]
[edit reason] use example.com [/edit]
Static pages on the site are doing allright and are ranking well, it is only the dynamic perl base CMS content that is displaying the issues i mentioned.
I'm suspecting that the CMS may have had some server issues at the time of the crawler visit as i noticed some error messages when i checked the cache version of the affected pages on Google (some of them, although not most also has an old cache date going back to mid 2005).
Can someone shade some light on what might have just happened, assuming it was a server issue at time of crawler visit, how long would it take for Google to fully re-index the site from scratch. Mind you that the site was doing perfectly fine up to only two weeks ago, with both static and dynamic pages indexed fine (titles, body snippets etc.).
So many questions...i know, just wish someone with some experience with such issue could comment. I will also be able to sticky the URL if needed.
P.S.
Site is doing fine in Yahoo and MSN. It is only Google that seams to display the crawling issues i mentioned.
How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
always
hourly
daily
weekly
monthly
yearly
neverThe value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.
Please note that the value of this tag is considered a hint and not a command...
[google.com...]
So <changefreq>hourly</changefreq> would be a clue to Googlebot -- but using it wouldn't actually force any particular crawl rate frequency.