homepage Welcome to WebmasterWorld Guest from 54.196.136.119
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google crawl - not always sending IF_MODIFIED_SINCE
Thousands of requests each day
expiresnow




msg:3275020
 10:00 am on Mar 8, 2007 (gmt 0)

This is what I tried:

I've tried: Checking for the HTTP_IF_MODIFIED_SINCE header and returns "304 Not Modified" if possible.

Problem: Googlebot doesn't always send this header. Even if they already know about a page they doesn't always send the header.

I've tried: Using the expires header to tell google that each page should expire in a month from the request.

Problem: Googlebot keep requesting the pages. They seem to ignore this header.

I've tried: Lowering the crawl rate to "Slow" in google webmaster tools.
Problem: This doesn't seem to have any significant effect.

Are there other solutions to this problem? I don't want to ban googlebot since we get a lot of visitors from google.

 

Brett_Tabke




msg:3471788
 1:46 pm on Oct 8, 2007 (gmt 0)

Ran into this thread from a G search on the same topic.

Anyone know the answer to this? I am running into the same thing on lower pr sites.

jdMorgan




msg:3471842
 2:45 pm on Oct 8, 2007 (gmt 0)

Check that the 'expires' header is relative -- Expires after so much time, rather than Expires at a certain time.

You should check your Cache-control server response headers as well.

Jim

Webnauts




msg:3473576
 8:20 am on Oct 10, 2007 (gmt 0)

Should the "Expires" be past of future, or doesn't that matter?

[edited by: Webnauts at 8:22 am (utc) on Oct. 10, 2007]

jd01




msg:3474168
 8:35 pm on Oct 10, 2007 (gmt 0)

I usually set a 'future expires' and check for both an HTTP_IF_MODIFIED_SINCE and HTTP_IF_NONE_MATCH.

header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");

if($_SERVER['HTTP_IF_MODIFIED_SINCE']===$date $_SERVER['HTTP_IF_NONE_MATCH']===$etag) {
header('HTTP/1.1 304 Not Modified'); exit();
}

I use a simple md5 hash based on URL / filemtime() to create custom ETag headers and ensure if I update the file a new request is made.

The order in the file should be reversed...
The Expires is created and set only if there is no match to either header, and needs to be set before any other output.

Justin

Webnauts




msg:3474553
 5:36 am on Oct 11, 2007 (gmt 0)

jd1 thanks for the cool tip.

I have this header so far:

<?php
// $file contains the file name of the page being displayed (the actual
// content, not any templates you may be using). We take the last modified
// date of this file.
$mtime = filemtime(__FILE__);
// Create a HTTP conformant date, example 'Mon, 22 Dec 2003 14:16:16 GMT'
$gmt_mtime = gmdate('D, d M Y H:i:s', $mtime).' GMT';
// send a unique 'strong' identifier. This is always the same for this
// particular file while the file itself remains the same.
header('ETag: "'.md5($mtime.$file).'"');
// check if the last modified date sent by the client is the the same as
// the last modified date of the requested file. If so, return 304 header
// and exit.
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE']))
{
if ($_SERVER['HTTP_IF_MODIFIED_SINCE'] == $gmt_mtime)
{
header('HTTP/1.1 304 Not Modified');
exit();
}
}
// check if the Etag sent by the client is the same as the Etag of the
// requested file. If so, return 304 header and exit.
if (isset($_SERVER['HTTP_IF_NONE_MATCH']))
{
if (str_replace('"', '', stripslashes($_SERVER['HTTP_IF_NONE_MATCH'])) == md5($mtime.$file))
{
header("HTTP/1.1 304 Not Modified");
// abort processing and exit
exit();
}
}
// output last modified header using the last modified date of the file.
header('Last-Modified: '.$gmt_mtime);
// tell all caches that this resource is publically cacheable.
header('Cache-Control: must-revalidate');
// this resource expires one day from now.
header("Expires: " . gmdate("D, d M Y H:i:s", time()+24*60*60) . " GMT");
// set the content-type
if (isset($_SERVER["HTTP_ACCEPT"]) && stristr( $_SERVER["HTTP_ACCEPT"], "application/xhtml+xml") ) {
header ("Content-type: application/xhtml+xml; charset=utf-8");
} else {
header ("Content-type: text/html; charset=utf-8");
}
// start output.
// Note that no output can precede the headers unless you call ob_start().
// You don't have to use gzip, but it greatly saves on bandwidth (for text)
// at the cost of a little more processing.
ob_start ("ob_gzhandler");
?>

What do you think about that?

[edited by: Webnauts at 6:21 am (utc) on Oct. 11, 2007]

jd01




msg:3475245
 7:17 pm on Oct 11, 2007 (gmt 0)

Looks like a good start...

There are few things I notice at a glance:
1. You need to move the ETag header down to where the rest of the headers are set, so the checks are first.

2. I don't usually set the cache-control header, but it overrides the expires setting if a max-age is set, so if you are still having issues, you might set it to cache for a day (or the time period of your choice --- 24*60*60 = 1 day) using max-age=86400...

3. The comment for the cache control header says you want to let the file be cached publicly, but you are running 'must-revalidate', so the remote cache will always be compared to the original source. Are you sure that's what you want?

Justin

4specs




msg:3475251
 7:22 pm on Oct 11, 2007 (gmt 0)

try searching for "HTTP conditional requests in PHP" and Alexandre Alapetite for some code that does all this and works well.

tedster




msg:3475260
 7:34 pm on Oct 11, 2007 (gmt 0)

We don't normally want to see search terms in the Google Search forum, but in this case the reference has also been mentioned in our PHP forum [webmasterworld.com], so I trust that it is a solid authority and in line with our Charter.

Try [alexandre.alapetite.net...]

jd01




msg:3475262
 7:37 pm on Oct 11, 2007 (gmt 0)

Just to clarify the way an Expires and Cache-Control 'max-age' header should be set for future visitors:

[w3.org...]

Expires:
The format is an absolute date and time as defined by HTTP-date in section 3.3.1; it MUST be in RFC 1123 date format.

max-age:
Indicates that the client is willing to accept a response whose age is no greater than the specified time in seconds.

Justin

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved