Forum Moderators: open
I started a new site two months ago and decided to test If-Modified-Since on it to see if it does save bandwidth by making Google only get files that have been updated. At first before I tried IMS, Google got the files. Then I use IMS. Google comes and hits one file, and leaves. So after a few weeks I get rid of If-Modified-Since, and BANG! With in a day Google crawls the whole site, so I dump IMS. That was a month ago, and since then Google had only got the index. I start thinking it was just luck that I got crawled right after stoping IMS, but yesterday I started getting my next deepcrawl on the site, and of course the best time to test this is while your getting deepcrawled, so over the last hour I've been watching Google deepcrawl the site while I change the permission settings to try to get IMS working. It looks like IMS CAN keep you from being crawled by Google, at least if it's .shtml files (I havn't tested this on .html files). You have to have the permission setting exactly correct. I've only found one setting where Google will crawl the site with IMS on. I try changing the files in a directory to different permissions, and here's what I get.
Owner: Read-Write-Search
Group: Read-Write-Search
Everyone: Read-Write-Search
XXX
XXX Has IMS, but stopped crawling site (even files not with this permission) (Chmod 777)
XXXXXX
X Has IMS, but stopped crawling (even files not with this permission) (Chmod 755)
XXX
XX
X No IMS, but all directories get's crawled. (Original permission setting) (Chmod 454)
X
X
X X Has IMS, and also crawls the site. (Chmod 644)
X
First off, Your permission description is messed up. The third permission setting is 'execute' not 'search'.
Secondly, your permission settings are all wrong. As an example 644 is:
Owner: Read - Write
Group: Read
Everyone: Read
Not:
X
X X Has IMS, and also crawls the site. (Chmod 644)
X
Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.
I'm sure it's clear to you because you know what you're talking about. For me though a better description would be helpful.
Mike
Looks like I got one number wrong. Should be....
X
X X Has IMS, and also crawls the site. (Chmod 454)
X
All can read, and Group can Search/Execute.
This is the Chmod that I changed all the files to to get crawled with If-Modified-Since.
:::Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.
Had to give it time to do more crawling before making that guess. Looking at the last 10 hours of the crawl, it looks like it is skipping the files that havn't been updated. I only see a few files from sections that were crawled a month ago geting hit this time around, and yes, I did edit a few of the files since the last crawl. Before doing this, it was recrawling the sections that were crawled last time around, and stoped after geting IMS set up. So it looks like it's now doing exactly what it's supposed to do with IMS.
If you got If-Modified-Since, you will see something like
Last-Modified: Thu, 06 Nov 2003 00:41:40 GMT
One more thing, you might have to have a .htaccess file with
XBitHack Full
to get IMS with .shtml files.