Google pestering home page

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google pestering home page

Downloading several times a day

Phil_Payne

2:19 pm on Sep 8, 2006 (gmt 0)

Small site - 50 or so pages. Low update frequency, nice unique keywords.

Google XML sitemaps submitted, up to date and clean - priority, lastmod, updatefreq all specified correctly for each page. HTML sitemap derived from Xenu output.

The most stable page on the site is the home page. It changes three or four times a year. But Google spiders it up to seven times a day. A GET that gets a 200 and (from the log byte cound) the whole page is usually downloaded.

Nothing I am doing anywhere - or ever have done anywhere - would imply that the home page is updated frequently. It never has been, and the Google sitemap has updatefreq set to "monthly". There are about thirteen pages on the site - listed in the Google and HTML sitemaps and linked organically that Google has never crawled.

What is the Googlebot looking for on the home page? Surely if it's checking for changes I should be seeing 304s.

Brett_Tabke

2:47 pm on Sep 8, 2006 (gmt 0)

it will clear itself out in about 30 days as gbot learns your update frequency.

Are you producing *anything* that changes from page view to page veiw?

auto updating advertising code?
random headlines?
changing links?
different menu or flash.

any aspect of the page that changes at all? Even so much as one character change?

Take a accurate byte count of the source code (view from browser) and then compare that to a page view 2-3 hrs later. Any changes at all? Anything you are overlooking that would make the byte count or the character positions/code even a little bit different?

Phil_Payne

3:11 pm on Sep 8, 2006 (gmt 0)

> it will clear itself out in about 30 days as gbot learns your update frequency.

If that were true it would have cleared itself up just after Big Daddy. This is permanent activity.

The page is HTML 4.01 written using the Crimson editor. Static HTML barely begins to decribe it - fossilized HTML would be closer.

motorhaven

6:07 pm on Sep 8, 2006 (gmt 0)

Is your server properly handling dates in the header and set up to use if-modified-since requests?

Phil_Payne

8:17 pm on Sep 8, 2006 (gmt 0)

> Is your server properly handling dates in the header and set up to use if-modified-since requests?

Boring IIS 5. Google accesses the XML sitemap using if-mofified-since and gets 304s, just like it should.

IIS log extract:

2006-09-06 14:18:59 66.249.65.18 GET 200 /index.html
2006-09-06 14:21:27 66.249.65.18 GET 200 /index.html
2006-09-06 15:18:31 66.249.65.18 GET 200 /index.html
2006-09-07 04:04:07 66.249.66.34 GET 304 /sitemap.xml
2006-09-07 04:27:07 66.249.66.34 GET 200 /index.html
2006-09-07 05:08:47 66.249.66.34 GET 200 /index.html
2006-09-07 05:13:08 66.249.66.34 GET 200 /index.html
2006-09-07 05:51:26 66.249.66.34 GET 200 /index.html
2006-09-07 06:10:28 66.249.66.34 GET 200 /index.html
2006-09-07 06:48:00 66.249.66.34 GET 200 /index.html

jomaxx

8:48 pm on Sep 8, 2006 (gmt 0)

Keep it in perspective, you're talking about "up to" about 7 pageviews per day. This amounts to some tiny fraction of a penny.

Phil_Payne

9:09 pm on Sep 8, 2006 (gmt 0)

> .. perspective ..

I'm not complaining, just trying to understand.

Results of the GSiteCrawler Server-Test
Tested at 9/8/2006 9:05:15 PM / from 82.3.81.13:

URL=http://www.mysite.com
Result code: 200 (OK / OK)
Server: Microsoft-IIS/5.0
Content-Location: [mysite.com...]
Date: Fri, 08 Sep 2006 20:58:12 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Wed, 06 Sep 2006 12:20:34 GMT
ETag: "6a5577daaed1c61:c3a"
Content-Length: 4163

So "Last-Modified" is being returned correctly.

g1smd

9:18 pm on Sep 8, 2006 (gmt 0)

It should be requesting http://www.domain.com/ and not http://www.domain.com/index.html I think.

I suspect that fact may turn out to be important.

ronburk

10:12 pm on Sep 9, 2006 (gmt 0)

Interesting.

It should be requesting [domain.com...] and not [domain.com...] I think.

What made you think it was requesting /index.html? Looked like the request was for the (illegal, but universally accepted) URL of [something.com,...] and the server used the Content-Location header to inform the User-Agent where the actual resource resides.

------

Although you describe this as fossilized HTML, the server says it was modified recently -- which is correct?

g1smd

10:16 pm on Sep 9, 2006 (gmt 0)

What made me think that? The fact that I have seen several hundreds of sites with that exact same problem in recent months.

Does the call for http://www.domain.com respond with a 302 or a 301 response? That short URL was my other, much more unlikely, guess.

Phil_Payne

8:39 am on Sep 12, 2006 (gmt 0)

Results of the GSiteCrawler Server-Test
Tested at 9/12/2006 8:36:31 AM / from 82.2.113.108:

URL=http://www.mysite.com
Result code: 200 (OK / OK)
Server: Microsoft-IIS/5.0
Content-Location: [mysite.com...]
Date: Tue, 12 Sep 2006 08:29:22 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Wed, 06 Sep 2006 12:20:34 GMT
ETag: "6a5577daaed1c61:c3a"
Content-Length: 4163

What I don't understand is:

a) Why this is "bad"? I have the same "problem" on many other sites that are performing very well.

b) Why is Google downloading the index.html page repetitively - when it almost never changes - and NOT downloading the pages that do change?