Faulty Googlebot?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Faulty Googlebot?

ClintFC

1:56 am on Apr 7, 2006 (gmt 0)

A few days ago a new Googlebot (i.e. one I've not seen before) started crawling my site. Looks like this was some kind of handover, because the Googlebot that used to crawl it stopped at the same time.

Anyway, this new bot (66.249.66.130) seems broken to me. It limps along taking a page or two every hour or so. Most worrying is that it seems to stall whenever it hits a 301 redirect. It will follow the redirect then disapear for long periods.

Has anyone else seen similar behaviour from this robot? Previous Googlebots, at various other IP addresses, have always wizzed through 301s without breaking stride.

ClintFC

1:47 pm on Apr 7, 2006 (gmt 0)

Looks like this problem is not related to 301s after all. I removed the 301s as an experiment, and this particular Googlebot still stalls after each and every page. So, I am now being crawled at a rate of 1 page every 30 minutes or so.

From the number of other posts reporting broadly the same problem, it looks like Google have introduced yet another problem to compound the huge problems already introduced by Big Daddy.

Not only have they thrown away countless thousands of pages, by reverting to an ancient index, but they have now further sealed our fate, by ensuring that effected sites will take a couple of years to crawl.

trinorthlighting

1:51 pm on Apr 7, 2006 (gmt 0)

Does the content on these pages change often? If it only changes once every month, then google bot only needs to visit once a month. If it changes daily then you might have an issue. As long as the pages are in the index and the content rarely changes I would not worry about it.

ClintFC

2:43 pm on Apr 7, 2006 (gmt 0)

The content changes every day.

I, and I assume all of the other people reporting a problem, wouldn't be mentioning it if the new Googlebot behaviour was not markedly different from it's previous behaviour.

I know how often, and at what rate, Google spider's my site (from experience garnered over several years). The point is, all of a sudden it has pretty much ground to a halt.

This didn't correspond to the roll-out of Big-Cruddy, it all started a few days ago.

ClintFC

2:45 pm on Apr 7, 2006 (gmt 0)

PS: Most of my pages are not in the index. Big Daddy's roll-out from an August 2005 cache, resulted in tens of thousands of pages of lost content.

angiolo

2:59 pm on Apr 7, 2006 (gmt 0)

I see few sites that got supplemental ( and recovered ) and now are decreasing the googlebot visits...

bumpski

6:57 pm on Apr 7, 2006 (gmt 0)

One of the parameters in sitemap.xml is the target crawl rate. Maybe the new bot is taking this seriously, defaulting to a large value, if you did not specify a value, but do have a sitemap.xml file? Just a farfetched possibility. Something to revise "just in case"?

catch2948

7:08 pm on Apr 7, 2006 (gmt 0)

PS: Most of my pages are not in the index. Big Daddy's roll-out from an August 2005 cache, resulted in tens of thousands of pages of lost content.

Same thing noticed here as well. One site that seemed to have dissappeared last summer has mysteriously reappeared, complete with links from pages that have long since dissappeared as well.

I can't make any sense of any of this.

b2net

7:33 pm on Apr 7, 2006 (gmt 0)

Since the new Mozilla bot took over and the old googlebot was retired new pages and new domains now take forever to get crawled and indexed.

I'm seeing Googlebot visit 90% less than before on all my domains.

ClintFC

10:36 pm on Apr 7, 2006 (gmt 0)

"One of the parameters in sitemap.xml is the target crawl rate."

Are you sure? I know MSN have something like this, but I've never seen any mention of any kind of target crawl rate for Google.

Demaestro

10:56 pm on Apr 7, 2006 (gmt 0)

[quote] "One of the parameters in sitemap.xml is the target crawl rate."

Are you sure? [\quote]

It is there for the Google sitemap xml file that you can pass using the Google sitemaps tool.

ClintFC

12:20 am on Apr 8, 2006 (gmt 0)

Could someone please elaborate on this.

The only extra tags that Google mention are:

Where is this mythical "target crawl rate"? What, specifically, is it called? And where is it mentioned?

Nikke

12:36 am on Apr 8, 2006 (gmt 0)

<lastmod/>
<changefreq/>

I think Daemestro means that those two values in conjunction could tell Google when it's time to spider your site again.

I doubt it though. As I suspect many other do, I set changefreq to daily for a lot of my pages and skip lastmod. Something will have changed on the pages, but I don't change every bit of content every day. It's more like adding links as new pages are added and such.

catch2948

2:21 am on Apr 8, 2006 (gmt 0)

Has anyone else noticed Mozilla Gogglebot ignoring static urls, and going through dynamic urls only?

For example, one of my sites has a combined linking structure. It includes many static urls, and the script that powers my product catalog creates dynamic urls. See below:

Static Url
http://www.example.com/large/blue/index.html

Dynamic Url
http://www.example.com/index.php?size=large&color=blue&id=12

This actually goes back to a problem that I posted about some time back. All of the dynamic urls are linked from static urls. But after complete log file analysis, I can find absolutely no reference to any Googlebot (either old, or Mozilla) visiting any of the static urls. So how on earth did Mozilla Googlebot get a listing of the dynamic urls?

At this point, Mozilla Googlebot has crawled almost my entire catalog (all dynamic urls; in no particular order that I can see), and has yet to crawl even 1 static url ...

Example of how site is being crawled:

http://www.example.com/index.php?size=large&color=blue&id=117
http://www.example.com/index.php?size=large&color=blue&id=1
http://www.example.com/index.php?size=large&color=blue&id=57
http://www.example.com/index.php?size=large&color=blue&id=12
...

BTW, I have manually checked all static urls to make sure they are live. And the catalog script is custom, so I do not believe that the spiders are following any sort of footprint. As a matter of fact, in the time that this has been going on, the spiders have not hit a single "404" dynamic page (meaning that it can't be using a random list of numbers). As well, I have confirmed (via reverse IP tracing) that all spiders are genuine Mozilla Googlebot (all with the same IP).

[edited by: tedster at 6:05 am (utc) on April 9, 2006]
[edit reason] use example.com [/edit]

Web_speed

3:58 am on Apr 8, 2006 (gmt 0)

One of my clients has a perl based CMS. It is a niche (biblical type) site with a wealth of same topic related articles and information. The crawler visited this site around mid March and for some reason was unable to grab the document titles or any snippets from the documents body. In short, more then 350 unique almost no were else to be found articles just went belly up over the index. The pages are listed with domain name as the title and a snippet from the meta tags (same snippet for all). They are also listed as supplemental now.

Static pages on the site are doing allright and are ranking well, it is only the dynamic perl base CMS content that is displaying the issues i mentioned.

I'm suspecting that the CMS may have had some server issues at the time of the crawler visit as i noticed some error messages when i checked the cache version of the affected pages on Google (some of them, although not most also has an old cache date going back to mid 2005).

Can someone shade some light on what might have just happened, assuming it was a server issue at time of crawler visit, how long would it take for Google to fully re-index the site from scratch. Mind you that the site was doing perfectly fine up to only two weeks ago, with both static and dynamic pages indexed fine (titles, body snippets etc.).

So many questions...i know, just wish someone with some experience with such issue could comment. I will also be able to sticky the URL if needed.

P.S.
Site is doing fine in Yahoo and MSN. It is only Google that seams to display the crawling issues i mentioned.

tedster

6:15 am on Apr 9, 2006 (gmt 0)

Google doesn't use the exact term "target crawl rate" -- but the idea is what is behind <changefreq>

How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
always
hourly
daily
weekly
monthly
yearly
never
The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.
Please note that the value of this tag is considered a hint and not a command...
[google.com...]
So <changefreq>hourly</changefreq> would be a clue to Googlebot -- but using it wouldn't actually force any particular crawl rate frequency.