Forum Moderators: phranque
The user agent sends a request for "if-modified_since" along with the date of it's last visit to that page.
The server responds according to wether the date matches or not. If the date matches, it will respond with a 304 (not modified) and the user agent will use the version of the page it already has and go on to it's next request. If the date connected with the page is newer, the server responds with a 200 (ok) and the user agent grabs the page to update it's cache.
If using a server with Apache, "if-modified-since" is set by default, but Apache cannot / does not respond to the "if modified-since" request on PHP files and will return a status 200 (ok). Therefore, I must include code in my php files to make the server respond to the request (because I do not have access to Apaches configuration). Just adding <?PHP header("Last-Modified: " . gmdate("D, d M Y H:i:s", getlastmod()) . " GMT");?> to the file does not work on it's own, I must also tell the user agent how to react to the status that it receives.
With a status of 200 on every php file, the user agent thinks that the page is always "fresh", and therefore graps the "newer" version. This causes unneeded bandwith usage and slows down load times.
Once I make sure I am understanding how "if-modified-since" works, my next step will be determing which is the best way for me to tell the user agents how to respond.
Questions that I still have:
If I am using a google sitemap, and the last-modified date and the change frequency don't match, will / can this confuse the bot and delay, stop or interrupt crawling? ie: Sitemap lists <lastmod>2006-01-05</lastmod>
<changefreq>weekly</changefreq>. The bot checks the page, gets a 200 which says this page has changed. The bot checks again immediately, gets another 200 when it is expecting a 304 (not modified). Now the bot is confused because I told it that page only changes weekly when in fact the server has told it that the page changes each time it is called. (Does this look like blackhat to the bot?) Obviously, it must tell it something, even if it only says, "this webmaster doesn't know what he's doing".
Is it possible that bots are looking for a certain percentage of 304's and 200's when crawling a site (especially new sites) and if the ratio is not within a certain range, crawling will be affected? One bot gets the page and says to the second bot, "this is what I have and when you go back, it should be the same and you should get a 304". If it's not, the second bot says, I got a 200, you need to go back and try again. (The circle theory, as I have decided to call it, not that "S" word.)
Do the bots also check page size against it's last cache and compare along side with the status code to see if it really has changed?
Are PHP files the only ones that Apache does not send a last modified date? xml, css ect?
When I check the Apache configuration with info.php, the directive for last_modified is set at 0 for the local value and the master value. This confused me because I do see last modified dates on html files. If it were set at something other than 0, would this send last modified dates for php files? XbitHack is also set at 0, (I have not read about XbitHack yet, I'm saving that for another day.)
This is the 'default' setting of what you are asking:
Your server will not send last-modified headers for pages processed server side (dynamic), because there is no way for the server to know if the content of the page has changed from the last request or not. So, php, pl, asp, etc. which require server side processing will (and should) not have a last modified date.
Static pages are serverd with last-modified headers, because the modified date is set when the file is saved.
The only real way to correct this is as you assumed, within your php (or other dynamic) file itself... (I have seen occasions where a Server will set headers on dynamic pages, but have not seen a last-modified set.) I usually wrap everything with if(!headers_sent()) { do stuff }
I cannot speak to all search engines, but Google does compare actual page content from the most recent cache to previous cache(s), and will adjust accordingly. I do not know if there is some type of penaltly for this, but it would seem silly and petty for there to be one. AFAIK the date on the site maps are there as a guide, and should not invoke a penalty.
If you need some help with the actual php code, I am sure you will find quite a bit of help in the php forum.
Hope this helps.
Justin
This is really more of a php question than an Apache question...
When I check the Apache configuration with info.php, the directive for last_modified is set at 0 for the local value and the master value. This confused me because I do see last modified dates on html files. If it were set at something other than 0, would this send last modified dates for php files?
Still wondering about this...Anyone know?
suzie250:
With a status of 200 on every php file, the user agent thinks that the page is always "fresh", and therefore graps the "newer" version.
When a server sends a "304" that is all that the UA gets (a header - there is no body to the "page"; that is where the bandwidth savings come from). When a server sends a "200" it sends both header+page. (sorry that I am being completely anal-retentive about this!)
The difference between static and dynamic pages:
Static:
Apache does all Content-Negotiation (exactly what is decided by the settings within httd.conf).
Dynamic:
All Content-Negotiation has to be performed within the scripts. Using
"<?PHP header("Last-Modified: " . gmdate("D, d M Y H:i:s", getlastmod()) . " GMT");?>" is fine, it is just that that is not Content-Negotiation (look through the Class linked above to get that point). Unfortunately, I cannot answer the meat of your principal questions, other than to say that on my own site all pages send (and respond to)
Last-Modifiedand
If-Modified-Since, with unchanging-dates for some pages, yet *every* page changes on every single request (the pages contain hit stats, etc). AFAIK that does *not* affect their spidering adversely, and *does* reduce bandwidth immensely.
$last_modified = 1;
$original_photo_date = $row_Recordset1['stamp'];
$modified_photo_date = $row_Recordset1['modifiedstamp'];
if($original_photo_date != "" && $modified_photo_date == "") { $last_modified = $original_photo_date; }
if($original_photo_date != "" && $modified_photo_date != "") { $last_modified = $modified_photo_date; }
else { $last_modified = $original_photo_date; }
$photo_timestamp = strtotime($last_modified);
$date_mod_photo_gmt = gmdate('D, d M Y H:i:s', $photo_timestamp) . ' GMT';
$if_modified_since = preg_replace('/;.*$/', '', $HTTP_IF_MODIFIED_SINCE);
$mtime = filemtime($SCRIPT_FILENAME);
$gmdate_mod = gmdate('D, d M Y H:i:s', $mtime) . ' GMT';
if ($if_modified_since == $date_mod_photo_gmt) {
header("HTTP/1.0 304 Not Modified");
exit;
}
header("Last-Modified: $date_mod_photo_gmt");
sends the following...
HTTP/1.1 200 OK =>
Date => Sun, 24 Feb 2008 18:26:51 GMT
Server => Apache/1.3.33 (Darwin) PHP/5.0.4 mod_perl/1.26
Cache-Control => max-age=60
Expires => Sun, 24 Feb 2008 18:27:51 GMT
X-Powered-By => PHP/5.0.4
Last-Modified => Sat, 01 Dec 2007 17:57:18 GMT
Connection => close
Content-Type => text/html
Caching behavior can be also affected by several other headers from the http specification, like Expires, Cache-Control, ETag, Age and some more.
If-Modified-Since: date
This request header is used with GET method to make it conditional: if the requested document has not changed since the time specified in this field the document will not be sent, but instead a Not Modified 304 reply.
If-Modified-Since: date
[w3.org...]
Not Modified 304
If the client has done a conditional GET and access is allowed, but the document has not been modified since the date and time specified in If-Modified-Since field, the server responds with a 304 status code and does not send the document body to the client.Response headers are as if the client had sent a HEAD request, but limited to only those headers which make sense in this context. This means only headers that are relevant to cache managers and which may have changed independently of the document's Last-Modified date. Examples include Date, Server and Expires.
The purpose of this feature is to allow efficient updates of local cache information (including relevant metainformation) without requiring the overhead of multiple HTTP requests (e.g. a HEAD followed by a GET) and minimizing the transmittal of information already known by the requesting client (usually a caching proxy).
Not Modified 304
[w3.org...]
About the headers; in a nutshell:
Expires response header is used to tell the user agent, when the resource expires, before that time the user agent shall not request the resource in any way, even not with conditional get requests. By default PHP sets this value into the past, meaning the resource already expired (so the user agent will request it anyway).
The Cache-control header is used to explicitly define some aspect of the caching, for example it can simply disable any caching.
The ETag header is more or less can be interpreted as the unique identifier of the resource's content. That's, if you are generating dynamic content on the fly, you can generate an identifier for this content, and user agents can use this identifier (entity tag) to make IF-Match conditional requests.
There are several other headers which have some relevance, or affect on the caching, so if you really want to get into this, you should read the protocol specification. But keep in mind, that this is just a recommendation, the behavior of the user agents may be (slightly) different, or they might not have all of the features implemented.
Additionally, I have a few include pages as part of the main page and wonder if these might be somehow affecting the interpretation of if-modified-since response header.
***ADDED Sat, 01 Mar 2008 17:00:44 GMT***
Is it possible that the check header tool I'm using isn't reporting what is actually happening? I've tried a number of different tools availably freely on the Internet that report a Response Code: HTTP/1.1 200 but the action seems to indicate that the Response Code is being sent correctly as a 301 because of the caching behavior of the user agent. Maybe it IS working the way it should be but I'm being led to believe it's not because I'm putting too much trust into these tools.
I'm then checking to see if the photo's modified date is equal to a variable populated from the $HTTP_IF_MODIFIED_SINCE header. If it is, the server should respond with the 304 Not Modified header. If not, it should respond with the 200 OK header.
I am printing these variables out on my page to see what is happening. The last-modified variable taken from my photo's last modified timestamp is working just fine.
However, my if-modified-since variable is returning as an empty value UNLESS I remove the if/then condition evaluating whether the photo's modified date is equal to the $HTTP_IF_MODIFIED_SINCE response. In these instances the if-modified-since variable IS NOT empty and is in fact EQUAL to my photo's last modified date like I expect it to be. So, I'm perplexed as to why, if reinstated, my if/then condition construct is not catching this equality and serving the 304 like it should.
If I reinstate the if/then condition which is evaluating if $HTTP_IF_MODIFIED_SINCE is equal to my photo's last modified date, the $HTTP_IF_MODIFIED_SINCE variable is again returning as empty. The strange part is that the page is acting as if it is being given the 304 header response because it is definitely being served from the user agent's cache and not being downloaded again from the server. In this case I assume it is working but if I use one of the check header tools on the Internet it reports that the 200 OK header is being sent, not the 304 header.
I'm sorry to be so verbose but I'm stumped and I need someone to tell me where my logic is failed.
Thank you.
1) The If-Modified-Since header is sent by the client (browser or robot), and may in fact be empty.
2) You should not be comparing the file timestamp and the If-Modified-Since header for equality, but rather checking to see if the time stamp on the file is later than the date in the IMS header. There is likely to be a PHP library available to implement the needed date-comparison function(s), similar to PERL's "Date::Calc" library.
Jim