Forum Moderators: phranque

Message Too Old, No Replies

Understanding If-Modified-Since

Do I have a right and how responses might affect bot activity

         

suzie250

8:57 pm on Jan 18, 2006 (gmt 0)

10+ Year Member



I'm on information overload and need to make sure I understand "if-modified-since" and have questions on how responses from the server might affect spidering.

The user agent sends a request for "if-modified_since" along with the date of it's last visit to that page.
The server responds according to wether the date matches or not. If the date matches, it will respond with a 304 (not modified) and the user agent will use the version of the page it already has and go on to it's next request. If the date connected with the page is newer, the server responds with a 200 (ok) and the user agent grabs the page to update it's cache.

If using a server with Apache, "if-modified-since" is set by default, but Apache cannot / does not respond to the "if modified-since" request on PHP files and will return a status 200 (ok). Therefore, I must include code in my php files to make the server respond to the request (because I do not have access to Apaches configuration). Just adding <?PHP header("Last-Modified: " . gmdate("D, d M Y H:i:s", getlastmod()) . " GMT");?> to the file does not work on it's own, I must also tell the user agent how to react to the status that it receives.

With a status of 200 on every php file, the user agent thinks that the page is always "fresh", and therefore graps the "newer" version. This causes unneeded bandwith usage and slows down load times.

Once I make sure I am understanding how "if-modified-since" works, my next step will be determing which is the best way for me to tell the user agents how to respond.

Questions that I still have:
If I am using a google sitemap, and the last-modified date and the change frequency don't match, will / can this confuse the bot and delay, stop or interrupt crawling? ie: Sitemap lists <lastmod>2006-01-05</lastmod>
<changefreq>weekly</changefreq>. The bot checks the page, gets a 200 which says this page has changed. The bot checks again immediately, gets another 200 when it is expecting a 304 (not modified). Now the bot is confused because I told it that page only changes weekly when in fact the server has told it that the page changes each time it is called. (Does this look like blackhat to the bot?) Obviously, it must tell it something, even if it only says, "this webmaster doesn't know what he's doing".

Is it possible that bots are looking for a certain percentage of 304's and 200's when crawling a site (especially new sites) and if the ratio is not within a certain range, crawling will be affected? One bot gets the page and says to the second bot, "this is what I have and when you go back, it should be the same and you should get a 304". If it's not, the second bot says, I got a 200, you need to go back and try again. (The circle theory, as I have decided to call it, not that "S" word.)

Do the bots also check page size against it's last cache and compare along side with the status code to see if it really has changed?

Are PHP files the only ones that Apache does not send a last modified date? xml, css ect?

When I check the Apache configuration with info.php, the directive for last_modified is set at 0 for the local value and the master value. This confused me because I do see last modified dates on html files. If it were set at something other than 0, would this send last modified dates for php files? XbitHack is also set at 0, (I have not read about XbitHack yet, I'm saving that for another day.)

jd01

12:01 am on Jan 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is really more of a php question than an Apache question...

This is the 'default' setting of what you are asking:

Your server will not send last-modified headers for pages processed server side (dynamic), because there is no way for the server to know if the content of the page has changed from the last request or not. So, php, pl, asp, etc. which require server side processing will (and should) not have a last modified date.

Static pages are serverd with last-modified headers, because the modified date is set when the file is saved.

The only real way to correct this is as you assumed, within your php (or other dynamic) file itself... (I have seen occasions where a Server will set headers on dynamic pages, but have not seen a last-modified set.) I usually wrap everything with if(!headers_sent()) { do stuff }

I cannot speak to all search engines, but Google does compare actual page content from the most recent cache to previous cache(s), and will adjust accordingly. I do not know if there is some type of penaltly for this, but it would seem silly and petty for there to be one. AFAIK the date on the site maps are there as a guide, and should not invoke a penalty.

If you need some help with the actual php code, I am sure you will find quite a bit of help in the php forum.

Hope this helps.

Justin

suzie250

8:06 pm on Jan 20, 2006 (gmt 0)

10+ Year Member



This is really more of a php question than an Apache question...

Wasn't sure where to put this post.

When I check the Apache configuration with info.php, the directive for last_modified is set at 0 for the local value and the master value. This confused me because I do see last modified dates on html files. If it were set at something other than 0, would this send last modified dates for php files?

Still wondering about this...Anyone know?

AlexK

8:58 pm on Jan 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This link will give a PHP-Class [webmasterworld.com] to enact all content-negotiation (not just If-Modified-Since).

suzie250:

With a status of 200 on every php file, the user agent thinks that the page is always "fresh", and therefore graps the "newer" version.

Not quite accurate.

When a server sends a "304" that is all that the UA gets (a header - there is no body to the "page"; that is where the bandwidth savings come from). When a server sends a "200" it sends both header+page. (sorry that I am being completely anal-retentive about this!)

The difference between static and dynamic pages:

Static:
Apache does all Content-Negotiation (exactly what is decided by the settings within httd.conf).

Dynamic:
All Content-Negotiation has to be performed within the scripts. Using

"<?PHP header("Last-Modified: " . gmdate("D, d M Y H:i:s", getlastmod()) . " GMT");?>"
is fine, it is just that that is not Content-Negotiation (look through the Class linked above to get that point).

Unfortunately, I cannot answer the meat of your principal questions, other than to say that on my own site all pages send (and respond to)

Last-Modified
and
If-Modified-Since
, with unchanging-dates for some pages, yet *every* page changes on every single request (the pages contain hit stats, etc). AFAIK that does *not* affect their spidering adversely, and *does* reduce bandwidth immensely.

jepler

6:31 pm on Feb 24, 2008 (gmt 0)

10+ Year Member



I, too, am having difficulty understanding why $HTTP_IF_MODIFIED_SINCE seems to be inconsistently sent with my PHP dynamic pages. I am plugging in a "Last-Modified:" header based on the last time the content was updated in the database tables, not the last time the page script was updated. A header check reveals that the last modified header seems to be sent correctly, but logic I built into the page to send a "304 Not Modified" doesn't always work because the $HTTP_IF_MODIFIED_SINCE variable is often returned with an empty value. I'm also using mod rewrite to rewrite my dynamic URLs to static urls which include an .html extension.

$last_modified = 1;
$original_photo_date = $row_Recordset1['stamp'];
$modified_photo_date = $row_Recordset1['modifiedstamp'];
if($original_photo_date != "" && $modified_photo_date == "") { $last_modified = $original_photo_date; }
if($original_photo_date != "" && $modified_photo_date != "") { $last_modified = $modified_photo_date; }
else { $last_modified = $original_photo_date; }

$photo_timestamp = strtotime($last_modified);
$date_mod_photo_gmt = gmdate('D, d M Y H:i:s', $photo_timestamp) . ' GMT';

$if_modified_since = preg_replace('/;.*$/', '', $HTTP_IF_MODIFIED_SINCE);

$mtime = filemtime($SCRIPT_FILENAME);
$gmdate_mod = gmdate('D, d M Y H:i:s', $mtime) . ' GMT';

if ($if_modified_since == $date_mod_photo_gmt) {
header("HTTP/1.0 304 Not Modified");
exit;
}
header("Last-Modified: $date_mod_photo_gmt");

sends the following...

HTTP/1.1 200 OK =>
Date => Sun, 24 Feb 2008 18:26:51 GMT
Server => Apache/1.3.33 (Darwin) PHP/5.0.4 mod_perl/1.26
Cache-Control => max-age=60
Expires => Sun, 24 Feb 2008 18:27:51 GMT
X-Powered-By => PHP/5.0.4
Last-Modified => Sat, 01 Dec 2007 17:57:18 GMT
Connection => close
Content-Type => text/html

gergoe

1:11 am on Feb 25, 2008 (gmt 0)

10+ Year Member



There are cases when the user agent (browser) is not sending the if-modified-since request header, for example when it's the first time it loads that uri, the caching is disabled in the user agent (or is full), or the user agent is a proxy, and it wants to have a recent copy of the requested uri. In such a case, the server should respond with 200 OK, and send the whole content.

Caching behavior can be also affected by several other headers from the http specification, like Expires, Cache-Control, ETag, Age and some more.

jepler

1:25 am on Feb 25, 2008 (gmt 0)

10+ Year Member



I've tested on a variety of platforms and browsers including resetting user agent cache, etc.

I'm interested in knowing more about how Expires, Cache-Control and ETag could potentially affect when, how and why a browser would send the if-modified-since header, though.

pageoneresults

1:52 am on Feb 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just for clarification, from the protocol...

If-Modified-Since: date
This request header is used with GET method to make it conditional: if the requested document has not changed since the time specified in this field the document will not be sent, but instead a Not Modified 304 reply.

If-Modified-Since: date
[w3.org...]

Not Modified 304
If the client has done a conditional GET and access is allowed, but the document has not been modified since the date and time specified in If-Modified-Since field, the server responds with a 304 status code and does not send the document body to the client.

Response headers are as if the client had sent a HEAD request, but limited to only those headers which make sense in this context. This means only headers that are relevant to cache managers and which may have changed independently of the document's Last-Modified date. Examples include Date, Server and Expires.

The purpose of this feature is to allow efficient updates of local cache information (including relevant metainformation) without requiring the overhead of multiple HTTP requests (e.g. a HEAD followed by a GET) and minimizing the transmittal of information already known by the requesting client (usually a caching proxy).

Not Modified 304
[w3.org...]

gergoe

11:23 am on Feb 25, 2008 (gmt 0)

10+ Year Member



Firs of all, you should check the HTTP protocol specification, [google.com ].

About the headers; in a nutshell:
Expires response header is used to tell the user agent, when the resource expires, before that time the user agent shall not request the resource in any way, even not with conditional get requests. By default PHP sets this value into the past, meaning the resource already expired (so the user agent will request it anyway).

The Cache-control header is used to explicitly define some aspect of the caching, for example it can simply disable any caching.

The ETag header is more or less can be interpreted as the unique identifier of the resource's content. That's, if you are generating dynamic content on the fly, you can generate an identifier for this content, and user agents can use this identifier (entity tag) to make IF-Match conditional requests.

There are several other headers which have some relevance, or affect on the caching, so if you really want to get into this, you should read the protocol specification. But keep in mind, that this is just a recommendation, the behavior of the user agents may be (slightly) different, or they might not have all of the features implemented.

jepler

4:08 pm on Mar 1, 2008 (gmt 0)

10+ Year Member



I appreciate everyone's responses. I am not sending any expires response or cache-control information on the aforementioned pages myself. So, by default, this information is being generated by the server which is what I expected. And I'm familiar with the protocol spec and understand how it is "supposed to work" in theory. From what I'm gathering here the problem is not server related but browser (user agent) related. So, if I'm not receiving the expected if-modified-since request header then I need to figure out why my user agent isn't sending it. I've tested on a variety of platforms and browser versions and haven't been able to get it to work consistently. I haven't used the e-tag and wonder if doing so might make a difference.

Additionally, I have a few include pages as part of the main page and wonder if these might be somehow affecting the interpretation of if-modified-since response header.

***ADDED Sat, 01 Mar 2008 17:00:44 GMT***
Is it possible that the check header tool I'm using isn't reporting what is actually happening? I've tried a number of different tools availably freely on the Internet that report a Response Code: HTTP/1.1 200 but the action seems to indicate that the Response Code is being sent correctly as a 301 because of the caching behavior of the user agent. Maybe it IS working the way it should be but I'm being led to believe it's not because I'm putting too much trust into these tools.

jepler

6:35 pm on Mar 1, 2008 (gmt 0)

10+ Year Member



Maybe if I explain in pseudocode what I'm attempting to do it will be more obvious as to what I'm doing wrong. This a details page for a photography gallery that I'm serving from a MySQL database. I'm attempting to use the last modified stamp of the photo's properties as the variable which will be used as the last-modified page header field. So far so good.

I'm then checking to see if the photo's modified date is equal to a variable populated from the $HTTP_IF_MODIFIED_SINCE header. If it is, the server should respond with the 304 Not Modified header. If not, it should respond with the 200 OK header.

I am printing these variables out on my page to see what is happening. The last-modified variable taken from my photo's last modified timestamp is working just fine.

However, my if-modified-since variable is returning as an empty value UNLESS I remove the if/then condition evaluating whether the photo's modified date is equal to the $HTTP_IF_MODIFIED_SINCE response. In these instances the if-modified-since variable IS NOT empty and is in fact EQUAL to my photo's last modified date like I expect it to be. So, I'm perplexed as to why, if reinstated, my if/then condition construct is not catching this equality and serving the 304 like it should.

If I reinstate the if/then condition which is evaluating if $HTTP_IF_MODIFIED_SINCE is equal to my photo's last modified date, the $HTTP_IF_MODIFIED_SINCE variable is again returning as empty. The strange part is that the page is acting as if it is being given the 304 header response because it is definitely being served from the user agent's cache and not being downloaded again from the server. In this case I assume it is working but if I use one of the check header tools on the Internet it reports that the 200 OK header is being sent, not the 304 header.

I'm sorry to be so verbose but I'm stumped and I need someone to tell me where my logic is failed.

Thank you.

jdMorgan

8:35 pm on Mar 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Since this is really a PHP coding problem rather than strictly Apache-specific, I won't comment on the code except to point out two general points that may need clarification:

1) The If-Modified-Since header is sent by the client (browser or robot), and may in fact be empty.

2) You should not be comparing the file timestamp and the If-Modified-Since header for equality, but rather checking to see if the time stamp on the file is later than the date in the IMS header. There is likely to be a PHP library available to implement the needed date-comparison function(s), similar to PERL's "Date::Calc" library.

Jim

jepler

4:15 am on Mar 3, 2008 (gmt 0)

10+ Year Member



Understanding this is an Apache forum I'm not really soliciting help on the PHP coding of the problem. Rather, I'm asking for help regarding logic I'm applying to If-Modified-Since and how it is handled between the user agent and the server. I appreciate your point that I should not be looking at a comparison of equality, which is what I've seen done in a variety of examples I've pulled from a handful of sources, but instead at whether the timestamp is later than the header date. This gives me a different perspective of which could prove to be the eventual solution I'm seeking. Thank you.