|Not geting deepcrawled? It could be If-Modified Since and SSI.|
Here's how to get rid of the IMS Google Death penalty once and for all.
Is Google only hitting one page and leaving? Do you have the files end with .shtml? If you answer yes to those questions, it could be If-Modified-Since giving you the Google Death Penalty.
I started a new site two months ago and decided to test If-Modified-Since on it to see if it does save bandwidth by making Google only get files that have been updated. At first before I tried IMS, Google got the files. Then I use IMS. Google comes and hits one file, and leaves. So after a few weeks I get rid of If-Modified-Since, and BANG! With in a day Google crawls the whole site, so I dump IMS. That was a month ago, and since then Google had only got the index. I start thinking it was just luck that I got crawled right after stoping IMS, but yesterday I started getting my next deepcrawl on the site, and of course the best time to test this is while your getting deepcrawled, so over the last hour I've been watching Google deepcrawl the site while I change the permission settings to try to get IMS working. It looks like IMS CAN keep you from being crawled by Google, at least if it's .shtml files (I havn't tested this on .html files). You have to have the permission setting exactly correct. I've only found one setting where Google will crawl the site with IMS on. I try changing the files in a directory to different permissions, and here's what I get.
XXX Has IMS, but stopped crawling site (even files not with this permission) (Chmod 777)
X Has IMS, but stopped crawling (even files not with this permission) (Chmod 755)
X No IMS, but all directories get's crawled. (Original permission setting) (Chmod 454)
X X Has IMS, and also crawls the site. (Chmod 644)
So I changed the permission setting on every file except the section indexes to Chmod 644, and Google is now crawling the site with IMS also set up. At the next deepcrawl I'll see if it really does make Google only get pages that are new or have been updated.
I'm not sure what to make of this post;
First off, Your permission description is messed up. The third permission setting is 'execute' not 'search'.
Secondly, your permission settings are all wrong. As an example 644 is:
Owner: Read - Write
X X Has IMS, and also crawls the site. (Chmod 644)
Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.
I'm sure it's clear to you because you know what you're talking about. For me though a better description would be helpful.
My FTP is on Mac, so it's a little different. That's why I included the chmod numbers. On Mac it says Search/Execute in Fetch where you make the permission settings. If your on Windows, only look at the (Chmod ###) part.
Looks like I got one number wrong. Should be....
X X Has IMS, and also crawls the site. (Chmod 454)
All can read, and Group can Search/Execute.
This is the Chmod that I changed all the files to to get crawled with If-Modified-Since.
:::Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.
Had to give it time to do more crawling before making that guess. Looking at the last 10 hours of the crawl, it looks like it is skipping the files that havn't been updated. I only see a few files from sections that were crawled a month ago geting hit this time around, and yes, I did edit a few of the files since the last crawl. Before doing this, it was recrawling the sections that were crawled last time around, and stoped after geting IMS set up. So it looks like it's now doing exactly what it's supposed to do with IMS.
Does anybody know if there are any tools for finding out what a web server returns when using the if-modified-since?
Use this form. [webmasterworld.com]
If you got If-Modified-Since, you will see something like
Last-Modified: Thu, 06 Nov 2003 00:41:40 GMT
How do permissions affect spidering? Google cannot see the permissions of your files... It can only either gain access to them or get an error, that is the only affect I can see.
This is with .shtml files. Some permissions include If-Modified-Since and others don't. (With .html, it allready has IMS.) Some permissions that do also keep you from geting crawled. What Google does with If-Modified-Since, is looks at the date the page has been last edited. If it hasn't been changed, it skips that page, saving you bandwidth and is able to get more of your files that have been updated or are new. I don't know why it stoped crawling even though you could access the page. Then right when I change the permission back, it started crawling the site again. Google did that every time until I got the right permission to get crawled and have IMS at the same time.
One more thing, you might have to have a .htaccess file with
to get IMS with .shtml files.