Forum Moderators: Robert Charlton & goodroi
Google XML sitemaps submitted, up to date and clean - priority, lastmod, updatefreq all specified correctly for each page. HTML sitemap derived from Xenu output.
The most stable page on the site is the home page. It changes three or four times a year. But Google spiders it up to seven times a day. A GET that gets a 200 and (from the log byte cound) the whole page is usually downloaded.
Nothing I am doing anywhere - or ever have done anywhere - would imply that the home page is updated frequently. It never has been, and the Google sitemap has updatefreq set to "monthly". There are about thirteen pages on the site - listed in the Google and HTML sitemaps and linked organically that Google has never crawled.
What is the Googlebot looking for on the home page? Surely if it's checking for changes I should be seeing 304s.
Are you producing *anything* that changes from page view to page veiw?
auto updating advertising code?
random headlines?
changing links?
different menu or flash.
any aspect of the page that changes at all? Even so much as one character change?
Take a accurate byte count of the source code (view from browser) and then compare that to a page view 2-3 hrs later. Any changes at all? Anything you are overlooking that would make the byte count or the character positions/code even a little bit different?
If that were true it would have cleared itself up just after Big Daddy. This is permanent activity.
The page is HTML 4.01 written using the Crimson editor. Static HTML barely begins to decribe it - fossilized HTML would be closer.
Boring IIS 5. Google accesses the XML sitemap using if-mofified-since and gets 304s, just like it should.
IIS log extract:
2006-09-06 14:18:59 66.249.65.18 GET 200 /index.html
2006-09-06 14:21:27 66.249.65.18 GET 200 /index.html
2006-09-06 15:18:31 66.249.65.18 GET 200 /index.html
2006-09-07 04:04:07 66.249.66.34 GET 304 /sitemap.xml
2006-09-07 04:27:07 66.249.66.34 GET 200 /index.html
2006-09-07 05:08:47 66.249.66.34 GET 200 /index.html
2006-09-07 05:13:08 66.249.66.34 GET 200 /index.html
2006-09-07 05:51:26 66.249.66.34 GET 200 /index.html
2006-09-07 06:10:28 66.249.66.34 GET 200 /index.html
2006-09-07 06:48:00 66.249.66.34 GET 200 /index.html
I'm not complaining, just trying to understand.
Results of the GSiteCrawler Server-Test
Tested at 9/8/2006 9:05:15 PM / from 82.3.81.13:
URL=http://www.mysite.com
Result code: 200 (OK / OK)
Server: Microsoft-IIS/5.0
Content-Location: [mysite.com...]
Date: Fri, 08 Sep 2006 20:58:12 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Wed, 06 Sep 2006 12:20:34 GMT
ETag: "6a5577daaed1c61:c3a"
Content-Length: 4163
So "Last-Modified" is being returned correctly.
It should be requesting [domain.com...] and not [domain.com...] I think.
What made you think it was requesting /index.html? Looked like the request was for the (illegal, but universally accepted) URL of [something.com,...] and the server used the Content-Location header to inform the User-Agent where the actual resource resides.
------
Although you describe this as fossilized HTML, the server says it was modified recently -- which is correct?
URL=http://www.mysite.com
Result code: 200 (OK / OK)
Server: Microsoft-IIS/5.0
Content-Location: [mysite.com...]
Date: Tue, 12 Sep 2006 08:29:22 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Wed, 06 Sep 2006 12:20:34 GMT
ETag: "6a5577daaed1c61:c3a"
Content-Length: 4163
What I don't understand is:
a) Why this is "bad"? I have the same "problem" on many other sites that are performing very well.
b) Why is Google downloading the index.html page repetitively - when it almost never changes - and NOT downloading the pages that do change?