homepage Welcome to WebmasterWorld Guest from 23.22.97.26
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Not geting deepcrawled? It could be If-Modified Since and SSI.
Here's how to get rid of the IMS Google Death penalty once and for all.
Jesse_Smith




msg:142180
 5:05 am on Nov 5, 2003 (gmt 0)

Is Google only hitting one page and leaving? Do you have the files end with .shtml? If you answer yes to those questions, it could be If-Modified-Since giving you the Google Death Penalty.

I started a new site two months ago and decided to test If-Modified-Since on it to see if it does save bandwidth by making Google only get files that have been updated. At first before I tried IMS, Google got the files. Then I use IMS. Google comes and hits one file, and leaves. So after a few weeks I get rid of If-Modified-Since, and BANG! With in a day Google crawls the whole site, so I dump IMS. That was a month ago, and since then Google had only got the index. I start thinking it was just luck that I got crawled right after stoping IMS, but yesterday I started getting my next deepcrawl on the site, and of course the best time to test this is while your getting deepcrawled, so over the last hour I've been watching Google deepcrawl the site while I change the permission settings to try to get IMS working. It looks like IMS CAN keep you from being crawled by Google, at least if it's .shtml files (I havn't tested this on .html files). You have to have the permission setting exactly correct. I've only found one setting where Google will crawl the site with IMS on. I try changing the files in a directory to different permissions, and here's what I get.

Owner: Read-Write-Search
Group: Read-Write-Search
Everyone: Read-Write-Search

XXX
XXX Has IMS, but stopped crawling site (even files not with this permission) (Chmod 777)
XXX

XXX
X Has IMS, but stopped crawling (even files not with this permission) (Chmod 755)
XXX

XX
X No IMS, but all directories get's crawled. (Original permission setting) (Chmod 454)
X

X
X X Has IMS, and also crawls the site. (Chmod 644)
X

So I changed the permission setting on every file except the section indexes to Chmod 644, and Google is now crawling the site with IMS also set up. At the next deepcrawl I'll see if it really does make Google only get pages that are new or have been updated.

 

olderscot




msg:142181
 9:51 am on Nov 5, 2003 (gmt 0)

I'm not sure what to make of this post;

First off, Your permission description is messed up. The third permission setting is 'execute' not 'search'.

Secondly, your permission settings are all wrong. As an example 644 is:

Owner: Read - Write
Group: Read
Everyone: Read

Not:

X
X X Has IMS, and also crawls the site. (Chmod 644)
X

Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.

I'm sure it's clear to you because you know what you're talking about. For me though a better description would be helpful.

Mike

Jesse_Smith




msg:142182
 6:32 pm on Nov 5, 2003 (gmt 0)

My FTP is on Mac, so it's a little different. That's why I included the chmod numbers. On Mac it says Search/Execute in Fetch where you make the permission settings. If your on Windows, only look at the (Chmod ###) part.

Looks like I got one number wrong. Should be....

X
X X Has IMS, and also crawls the site. (Chmod 454)
X

All can read, and Group can Search/Execute.

This is the Chmod that I changed all the files to to get crawled with If-Modified-Since.

:::Thirdly, the whole point of using IMS is so Google will only spider modified pages. You don't say whether the pages google isn't spidering have been modified or not. If they haven't then that's the whole point of using IMS.

Had to give it time to do more crawling before making that guess. Looking at the last 10 hours of the crawl, it looks like it is skipping the files that havn't been updated. I only see a few files from sections that were crawled a month ago geting hit this time around, and yes, I did edit a few of the files since the last crawl. Before doing this, it was recrawling the sections that were crawled last time around, and stoped after geting IMS set up. So it looks like it's now doing exactly what it's supposed to do with IMS.

MyWifeSays




msg:142183
 10:44 pm on Nov 5, 2003 (gmt 0)

Does anybody know if there are any tools for finding out what a web server returns when using the if-modified-since?

Jesse_Smith




msg:142184
 12:42 am on Nov 6, 2003 (gmt 0)

Use this form. [webmasterworld.com]

If you got If-Modified-Since, you will see something like

Last-Modified: Thu, 06 Nov 2003 00:41:40 GMT

moltar




msg:142185
 12:47 am on Nov 6, 2003 (gmt 0)

How do permissions affect spidering? Google cannot see the permissions of your files... It can only either gain access to them or get an error, that is the only affect I can see.

Jesse_Smith




msg:142186
 2:33 am on Nov 6, 2003 (gmt 0)

This is with .shtml files. Some permissions include If-Modified-Since and others don't. (With .html, it allready has IMS.) Some permissions that do also keep you from geting crawled. What Google does with If-Modified-Since, is looks at the date the page has been last edited. If it hasn't been changed, it skips that page, saving you bandwidth and is able to get more of your files that have been updated or are new. I don't know why it stoped crawling even though you could access the page. Then right when I change the permission back, it started crawling the site again. Google did that every time until I got the right permission to get crawled and have IMS at the same time.

One more thing, you might have to have a .htaccess file with

XBitHack Full

to get IMS with .shtml files.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved