Welcome to WebmasterWorld Guest from 54.196.244.186

Message Too Old, No Replies

Faulty Googlebot?

     
1:56 am on Apr 7, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


A few days ago a new Googlebot (i.e. one I've not seen before) started crawling my site. Looks like this was some kind of handover, because the Googlebot that used to crawl it stopped at the same time.

Anyway, this new bot (66.249.66.130) seems broken to me. It limps along taking a page or two every hour or so. Most worrying is that it seems to stall whenever it hits a 301 redirect. It will follow the redirect then disapear for long periods.

Has anyone else seen similar behaviour from this robot? Previous Googlebots, at various other IP addresses, have always wizzed through 301s without breaking stride.

1:47 pm on Apr 7, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


Looks like this problem is not related to 301s after all. I removed the 301s as an experiment, and this particular Googlebot still stalls after each and every page. So, I am now being crawled at a rate of 1 page every 30 minutes or so.

From the number of other posts reporting broadly the same problem, it looks like Google have introduced yet another problem to compound the huge problems already introduced by Big Daddy.

Not only have they thrown away countless thousands of pages, by reverting to an ancient index, but they have now further sealed our fate, by ensuring that effected sites will take a couple of years to crawl.

1:51 pm on Apr 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Jan 5, 2006
posts:2094
votes: 2


Does the content on these pages change often? If it only changes once every month, then google bot only needs to visit once a month. If it changes daily then you might have an issue. As long as the pages are in the index and the content rarely changes I would not worry about it.
2:43 pm on Apr 7, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


The content changes every day.

I, and I assume all of the other people reporting a problem, wouldn't be mentioning it if the new Googlebot behaviour was not markedly different from it's previous behaviour.

I know how often, and at what rate, Google spider's my site (from experience garnered over several years). The point is, all of a sudden it has pretty much ground to a halt.

This didn't correspond to the roll-out of Big-Cruddy, it all started a few days ago.

2:45 pm on Apr 7, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


PS: Most of my pages are not in the index. Big Daddy's roll-out from an August 2005 cache, resulted in tens of thousands of pages of lost content.
2:59 pm on Apr 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 23, 2000
posts:1186
votes: 0


I see few sites that got supplemental ( and recovered ) and now are decreasing the googlebot visits...
6:57 pm on Apr 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 13, 2004
posts:801
votes: 2


One of the parameters in sitemap.xml is the target crawl rate. Maybe the new bot is taking this seriously, defaulting to a large value, if you did not specify a value, but do have a sitemap.xml file? Just a farfetched possibility. Something to revise "just in case"?
7:08 pm on Apr 7, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Apr 25, 2003
posts:204
votes: 0


PS: Most of my pages are not in the index. Big Daddy's roll-out from an August 2005 cache, resulted in tens of thousands of pages of lost content.

Same thing noticed here as well. One site that seemed to have dissappeared last summer has mysteriously reappeared, complete with links from pages that have long since dissappeared as well.

I can't make any sense of any of this.

7:33 pm on Apr 7, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 9, 2006
posts:103
votes: 0


Since the new Mozilla bot took over and the old googlebot was retired new pages and new domains now take forever to get crawled and indexed.

I'm seeing Googlebot visit 90% less than before on all my domains.

10:36 pm on Apr 7, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


"One of the parameters in sitemap.xml is the target crawl rate."

Are you sure? I know MSN have something like this, but I've never seen any mention of any kind of target crawl rate for Google.

10:56 pm on Apr 7, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 15, 2003
posts:2606
votes: 0


[quote] "One of the parameters in sitemap.xml is the target crawl rate."

Are you sure? [\quote]

It is there for the Google sitemap xml file that you can pass using the Google sitemaps tool.

12:20 am on Apr 8, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


Could someone please elaborate on this.

The only extra tags that Google mention are:

<lastmod/>
<changefreq/>
<priority/>

Where is this mythical "target crawl rate"? What, specifically, is it called? And where is it mentioned?

12:36 am on Apr 8, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 13, 2003
posts:442
votes: 0


<lastmod/>
<changefreq/>

I think Daemestro means that those two values in conjunction could tell Google when it's time to spider your site again.

I doubt it though. As I suspect many other do, I set changefreq to daily for a lot of my pages and skip lastmod. Something will have changed on the pages, but I don't change every bit of content every day. It's more like adding links as new pages are added and such.

2:21 am on Apr 8, 2006 (gmt 0)

Full Member

10+ Year Member

joined:Apr 25, 2003
posts:204
votes: 0


Has anyone else noticed Mozilla Gogglebot ignoring static urls, and going through dynamic urls only?

For example, one of my sites has a combined linking structure. It includes many static urls, and the script that powers my product catalog creates dynamic urls. See below:

Static Url
http://www.example.com/large/blue/index.html

Dynamic Url
http://www.example.com/index.php?size=large&color=blue&id=12

This actually goes back to a problem that I posted about some time back. All of the dynamic urls are linked from static urls. But after complete log file analysis, I can find absolutely no reference to any Googlebot (either old, or Mozilla) visiting any of the static urls. So how on earth did Mozilla Googlebot get a listing of the dynamic urls?

At this point, Mozilla Googlebot has crawled almost my entire catalog (all dynamic urls; in no particular order that I can see), and has yet to crawl even 1 static url ...

Example of how site is being crawled:

http://www.example.com/index.php?size=large&color=blue&id=117
http://www.example.com/index.php?size=large&color=blue&id=1
http://www.example.com/index.php?size=large&color=blue&id=57
http://www.example.com/index.php?size=large&color=blue&id=12
...

BTW, I have manually checked all static urls to make sure they are live. And the catalog script is custom, so I do not believe that the spiders are following any sort of footprint. As a matter of fact, in the time that this has been going on, the spiders have not hit a single "404" dynamic page (meaning that it can't be using a random list of numbers). As well, I have confirmed (via reverse IP tracing) that all spiders are genuine Mozilla Googlebot (all with the same IP).

[edited by: tedster at 6:05 am (utc) on April 9, 2006]
[edit reason] use example.com [/edit]

3:58 am on Apr 8, 2006 (gmt 0)

Preferred Member

joined:Dec 28, 2005
posts:605
votes: 0


One of my clients has a perl based CMS. It is a niche (biblical type) site with a wealth of same topic related articles and information. The crawler visited this site around mid March and for some reason was unable to grab the document titles or any snippets from the documents body. In short, more then 350 unique almost no were else to be found articles just went belly up over the index. The pages are listed with domain name as the title and a snippet from the meta tags (same snippet for all). They are also listed as supplemental now.

Static pages on the site are doing allright and are ranking well, it is only the dynamic perl base CMS content that is displaying the issues i mentioned.

I'm suspecting that the CMS may have had some server issues at the time of the crawler visit as i noticed some error messages when i checked the cache version of the affected pages on Google (some of them, although not most also has an old cache date going back to mid 2005).

Can someone shade some light on what might have just happened, assuming it was a server issue at time of crawler visit, how long would it take for Google to fully re-index the site from scratch. Mind you that the site was doing perfectly fine up to only two weeks ago, with both static and dynamic pages indexed fine (titles, body snippets etc.).

So many questions...i know, just wish someone with some experience with such issue could comment. I will also be able to sticky the URL if needed.

P.S.
Site is doing fine in Yahoo and MSN. It is only Google that seams to display the crawling issues i mentioned.

6:15 am on Apr 9, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Google doesn't use the exact term "target crawl rate" -- but the idea is what is behind <changefreq>

How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
always
hourly
daily
weekly
monthly
yearly
never

The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.

Please note that the value of this tag is considered a hint and not a command...

[google.com...]

So <changefreq>hourly</changefreq> would be a clue to Googlebot -- but using it wouldn't actually force any particular crawl rate frequency.