Welcome to WebmasterWorld Guest from 54.146.201.80

Message Too Old, No Replies

Google crawl - not always sending IF_MODIFIED_SINCE

Thousands of requests each day

     
10:00 am on Mar 8, 2007 (gmt 0)

New User

5+ Year Member

joined:Mar 8, 2007
posts:1
votes: 0


This is what I tried:

I've tried: Checking for the HTTP_IF_MODIFIED_SINCE header and returns "304 Not Modified" if possible.

Problem: Googlebot doesn't always send this header. Even if they already know about a page they doesn't always send the header.

I've tried: Using the expires header to tell google that each page should expire in a month from the request.

Problem: Googlebot keep requesting the pages. They seem to ignore this header.

I've tried: Lowering the crawl rate to "Slow" in google webmaster tools.
Problem: This doesn't seem to have any significant effect.

Are there other solutions to this problem? I don't want to ban googlebot since we get a lot of visitors from google.

1:46 pm on Oct 8, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38047
votes: 11


Ran into this thread from a G search on the same topic.

Anyone know the answer to this? I am running into the same thing on lower pr sites.

2:45 pm on Oct 8, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Check that the 'expires' header is relative -- Expires after so much time, rather than Expires at a certain time.

You should check your Cache-control server response headers as well.

Jim

8:20 am on Oct 10, 2007 (gmt 0)

New User

5+ Year Member

joined:June 14, 2006
posts: 31
votes: 0


Should the "Expires" be past of future, or doesn't that matter?

[edited by: Webnauts at 8:22 am (utc) on Oct. 10, 2007]

8:35 pm on Oct 10, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 9, 2005
posts:1509
votes: 0


I usually set a 'future expires' and check for both an HTTP_IF_MODIFIED_SINCE and HTTP_IF_NONE_MATCH.

header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");

if($_SERVER['HTTP_IF_MODIFIED_SINCE']===$date $_SERVER['HTTP_IF_NONE_MATCH']===$etag) {
header('HTTP/1.1 304 Not Modified'); exit();
}

I use a simple md5 hash based on URL / filemtime() to create custom ETag headers and ensure if I update the file a new request is made.

The order in the file should be reversed...
The Expires is created and set only if there is no match to either header, and needs to be set before any other output.

Justin

5:36 am on Oct 11, 2007 (gmt 0)

New User

5+ Year Member

joined:June 14, 2006
posts: 31
votes: 0


jd1 thanks for the cool tip.

I have this header so far:

<?php
// $file contains the file name of the page being displayed (the actual
// content, not any templates you may be using). We take the last modified
// date of this file.
$mtime = filemtime(__FILE__);
// Create a HTTP conformant date, example 'Mon, 22 Dec 2003 14:16:16 GMT'
$gmt_mtime = gmdate('D, d M Y H:i:s', $mtime).' GMT';
// send a unique 'strong' identifier. This is always the same for this
// particular file while the file itself remains the same.
header('ETag: "'.md5($mtime.$file).'"');
// check if the last modified date sent by the client is the the same as
// the last modified date of the requested file. If so, return 304 header
// and exit.
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE']))
{
if ($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $gmt_mtime)
{
header('HTTP/1.1 304 Not Modified');
exit();
}
}
// check if the Etag sent by the client is the same as the Etag of the
// requested file. If so, return 304 header and exit.
if (isset($_SERVER['HTTP_IF_NONE_MATCH']))
{
if (str_replace('"', '', stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == md5($mtime.$file))
{
header("HTTP/1.1 304 Not Modified");
// abort processing and exit
exit();
}
}
// output last modified header using the last modified date of the file.
header('Last-Modified: '.$gmt_mtime);
// tell all caches that this resource is publically cacheable.
header('Cache-Control: must-revalidate');
// this resource expires one day from now.
header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");
// set the content-type
if (isset($_SERVER["HTTP_ACCEPT"]) && stristr( $_SERVER["HTTP_ACCEPT"], "application/xhtml+xml") ) {
header ("Content-type: application/xhtml+xml; charset=utf-8");
} else {
header ("Content-type: text/html; charset=utf-8");
}
// start output.
// Note that no output can precede the headers unless you call ob_start().
// You don't have to use gzip, but it greatly saves on bandwidth (for text)
// at the cost of a little more processing.
ob_start ("ob_gzhandler");
?>

What do you think about that?

[edited by: Webnauts at 6:21 am (utc) on Oct. 11, 2007]

7:17 pm on Oct 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 9, 2005
posts:1509
votes: 0


Looks like a good start...

There are few things I notice at a glance:
1. You need to move the ETag header down to where the rest of the headers are set, so the checks are first.

2. I don't usually set the cache-control header, but it overrides the expires setting if a max-age is set, so if you are still having issues, you might set it to cache for a day (or the time period of your choice --- 24*60*60 = 1 day) using max-age=86400...

3. The comment for the cache control header says you want to let the file be cached publicly, but you are running 'must-revalidate', so the remote cache will always be compared to the original source. Are you sure that's what you want?

Justin

7:22 pm on Oct 11, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Feb 18, 2004
posts:68
votes: 0


try searching for "HTTP conditional requests in PHP" and Alexandre Alapetite for some code that does all this and works well.
7:34 pm on Oct 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


We don't normally want to see search terms in the Google Search forum, but in this case the reference has also been mentioned in our PHP forum [webmasterworld.com], so I trust that it is a solid authority and in line with our Charter.

Try [alexandre.alapetite.net...]

7:37 pm on Oct 11, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 9, 2005
posts:1509
votes: 0


Just to clarify the way an Expires and Cache-Control 'max-age' header should be set for future visitors:

[w3.org...]

Expires:

The format is an absolute date and time as defined by HTTP-date in section 3.3.1; it MUST be in RFC 1123 date format.

max-age:

Indicates that the client is willing to accept a response whose age is no greater than the specified time in seconds.

Justin