Welcome to WebmasterWorld Guest from 107.20.122.81

Message Too Old, No Replies

Google crawl - not always sending IF_MODIFIED_SINCE

Thousands of requests each day

     

expiresnow

10:00 am on Mar 8, 2007 (gmt 0)

5+ Year Member



This is what I tried:

I've tried: Checking for the HTTP_IF_MODIFIED_SINCE header and returns "304 Not Modified" if possible.

Problem: Googlebot doesn't always send this header. Even if they already know about a page they doesn't always send the header.

I've tried: Using the expires header to tell google that each page should expire in a month from the request.

Problem: Googlebot keep requesting the pages. They seem to ignore this header.

I've tried: Lowering the crawl rate to "Slow" in google webmaster tools.
Problem: This doesn't seem to have any significant effect.

Are there other solutions to this problem? I don't want to ban googlebot since we get a lot of visitors from google.

Brett_Tabke

1:46 pm on Oct 8, 2007 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Ran into this thread from a G search on the same topic.

Anyone know the answer to this? I am running into the same thing on lower pr sites.

jdMorgan

2:45 pm on Oct 8, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Check that the 'expires' header is relative -- Expires after so much time, rather than Expires at a certain time.

You should check your Cache-control server response headers as well.

Jim

Webnauts

8:20 am on Oct 10, 2007 (gmt 0)

5+ Year Member



Should the "Expires" be past of future, or doesn't that matter?

[edited by: Webnauts at 8:22 am (utc) on Oct. 10, 2007]

jd01

8:35 pm on Oct 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I usually set a 'future expires' and check for both an HTTP_IF_MODIFIED_SINCE and HTTP_IF_NONE_MATCH.

header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");

if($_SERVER['HTTP_IF_MODIFIED_SINCE']===$date $_SERVER['HTTP_IF_NONE_MATCH']===$etag) {
header('HTTP/1.1 304 Not Modified'); exit();
}

I use a simple md5 hash based on URL / filemtime() to create custom ETag headers and ensure if I update the file a new request is made.

The order in the file should be reversed...
The Expires is created and set only if there is no match to either header, and needs to be set before any other output.

Justin

Webnauts

5:36 am on Oct 11, 2007 (gmt 0)

5+ Year Member



jd1 thanks for the cool tip.

I have this header so far:

<?php
// $file contains the file name of the page being displayed (the actual
// content, not any templates you may be using). We take the last modified
// date of this file.
$mtime = filemtime(__FILE__);
// Create a HTTP conformant date, example 'Mon, 22 Dec 2003 14:16:16 GMT'
$gmt_mtime = gmdate('D, d M Y H:i:s', $mtime).' GMT';
// send a unique 'strong' identifier. This is always the same for this
// particular file while the file itself remains the same.
header('ETag: "'.md5($mtime.$file).'"');
// check if the last modified date sent by the client is the the same as
// the last modified date of the requested file. If so, return 304 header
// and exit.
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE']))
{
if ($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $gmt_mtime)
{
header('HTTP/1.1 304 Not Modified');
exit();
}
}
// check if the Etag sent by the client is the same as the Etag of the
// requested file. If so, return 304 header and exit.
if (isset($_SERVER['HTTP_IF_NONE_MATCH']))
{
if (str_replace('"', '', stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == md5($mtime.$file))
{
header("HTTP/1.1 304 Not Modified");
// abort processing and exit
exit();
}
}
// output last modified header using the last modified date of the file.
header('Last-Modified: '.$gmt_mtime);
// tell all caches that this resource is publically cacheable.
header('Cache-Control: must-revalidate');
// this resource expires one day from now.
header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");
// set the content-type
if (isset($_SERVER["HTTP_ACCEPT"]) && stristr( $_SERVER["HTTP_ACCEPT"], "application/xhtml+xml") ) {
header ("Content-type: application/xhtml+xml; charset=utf-8");
} else {
header ("Content-type: text/html; charset=utf-8");
}
// start output.
// Note that no output can precede the headers unless you call ob_start().
// You don't have to use gzip, but it greatly saves on bandwidth (for text)
// at the cost of a little more processing.
ob_start ("ob_gzhandler");
?>

What do you think about that?

[edited by: Webnauts at 6:21 am (utc) on Oct. 11, 2007]

jd01

7:17 pm on Oct 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks like a good start...

There are few things I notice at a glance:
1. You need to move the ETag header down to where the rest of the headers are set, so the checks are first.

2. I don't usually set the cache-control header, but it overrides the expires setting if a max-age is set, so if you are still having issues, you might set it to cache for a day (or the time period of your choice --- 24*60*60 = 1 day) using max-age=86400...

3. The comment for the cache control header says you want to let the file be cached publicly, but you are running 'must-revalidate', so the remote cache will always be compared to the original source. Are you sure that's what you want?

Justin

4specs

7:22 pm on Oct 11, 2007 (gmt 0)

10+ Year Member



try searching for "HTTP conditional requests in PHP" and Alexandre Alapetite for some code that does all this and works well.

tedster

7:34 pm on Oct 11, 2007 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



We don't normally want to see search terms in the Google Search forum, but in this case the reference has also been mentioned in our PHP forum [webmasterworld.com], so I trust that it is a solid authority and in line with our Charter.

Try [alexandre.alapetite.net...]

jd01

7:37 pm on Oct 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just to clarify the way an Expires and Cache-Control 'max-age' header should be set for future visitors:

[w3.org...]

Expires:

The format is an absolute date and time as defined by HTTP-date in section 3.3.1; it MUST be in RFC 1123 date format.

max-age:

Indicates that the client is willing to accept a response whose age is no greater than the specified time in seconds.

Justin

 

Featured Threads

Hot Threads This Week

Hot Threads This Month