If-Modified-Since Header

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

If-Modified-Since Header

Google mention about this in it's guidelines. How does it work?

Pat1975

2:36 pm on Nov 6, 2005 (gmt 0)

I found the following text in the webmaster guidelines:
------------------------------------------------
Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.
------------------------------------------------

I wonder how dose this header work? Does it check the Last-Modified dates of the physical files? If so, Google will never know if I update my website because I insert updated content into a database and have PHP pages retrieve these content and show up to the users.

Overhead bandwidth is not a big deal, but I heard someone say that Google ranking site according to how often we update our site. Is there anyway to get around this?

StuffOfInterest

4:42 pm on Nov 7, 2005 (gmt 0)

My website is generated using PHP with a backend database. I handle the If-modified-since header by producing a Last-modified header based on the database content information. If I receive a If-modified-since header than I compare it to the date I have stored in the database and return a 304 code if they are the same. Being that my site has the potential for over 70,000 unique pages, this helps Google figure out which few hundred are being updated.

walkman

8:00 pm on Nov 7, 2005 (gmt 0)

Suppose bandwith or resources are NOT a problem for me, will this hurt my rank?

Status: HTTP/1.1 200 OK
Date: Mon, 07 Nov 2005 19:56:57 GMT
Server: Apache
Set-Cookie: csuv=visitor; expires=Tue, 08-Nov-2005 19:56:57 GMT
Set-Cookie: PHPSESSID=3cf4b9914*******0ea2531257dcbd; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

DamonHD

10:00 am on Nov 8, 2005 (gmt 0)

Hi,

Many (most) of my pages have zero (no-cache) or short expiry on them (because they are dynamic) and it does not seem to do me any harm.

Rgds

Damon

walkman

3:37 pm on Nov 8, 2005 (gmt 0)

DamonHD,
the problem is that Googlebot visits (pulls the gzipped version of the file) 200+ pages a day, but only takes the full file (30K or so) for about a dozen pages.

I have noticed that Google only adds them to the index when it takes the full file. For some reason I lost most of my pages in Google's index, and woudl like to speed up the proccess.

My pages were very well linked (within the site), the pages have no dupe problems at all, and I have google sitemaps.

DamonHD

4:21 pm on Nov 8, 2005 (gmt 0)

Hi,

OK, the only straw I can clutch at for you is that historically some servers (eg Tomcat 4) have had some problems with corrupting (well, sort-of double-compressing) GZIPped output (which is why I don't use it) due to an ambiguity in the (servlet) specs.

Have you tried turning off the compression to see if the problem goes away?

Rgds

Damon

PS. Is that enough braket(t)ing for you? B^>

walkman

4:28 pm on Nov 8, 2005 (gmt 0)

DamonHD,
is it mod_deflate in apache 2.0? I know I'm almost crossing into another forum

on edit: yes deflate is the gzip one. Disabled, and tested it, and we'll see if gbot likes me better now.

thanks ;)

not (even) one (or more) () in this post :)

DamonHD

8:30 pm on Nov 8, 2005 (gmt 0)

Hi,

Pedantic note: the gzip and deflate encodings (compression types) are distinct even though they use the same compression algorithm AFAIK. I think, for example, that the gzip encoding has a trailing CRC and a different header.

Rgds

Damon

bull

10:31 pm on Nov 8, 2005 (gmt 0)

Slightly OT:

66.249.65.34 - - [today] "GET /foo/bar.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The first time I observed that Mozilla/Googlebot got a 304.

Pat1975

7:27 am on Nov 9, 2005 (gmt 0)

StuffOnInterest, how do you add handler to the If-Modified-Since header? Is it just a simple php script?

I'm not clearly understand how the header work yet, but, what I guess now is that the server will check the "last modified date" of the requested file and return 304 if no update since the specific date.

If that is the case, the server may return 304 without even invoking the php script.

Just a guess, anyone know how exactly it works please clarify.

walkman

4:33 pm on Nov 9, 2005 (gmt 0)

just a semi-update,
Google now is taking the pages as they are, with no gzip. We'll see if this speed the reindexing a bit.

Lord Majestic

4:36 pm on Nov 9, 2005 (gmt 0)

We'll see if this speed the reindexing a bit.

Highly unlikely - indexing and merging into main index are different operations from crawling, gzip will only help save bandwidth on all sides but since so vew sites support it (sadly) overall gain is rather small.

walkman

4:48 pm on Nov 9, 2005 (gmt 0)

Lord Majestic,
the problem was that G was checking them too much :). I only noted that after Jagger 1 or so when I lost most of my pages in G's index.

Everytime it got the full file, it appeared in the index. The pages are unique, entered manually, and they are not copied elsewhere. Plus I have a "Related Products," thingie where it pulls 2-5 different poducts, about 10 sentences in total, from the same category, just in case someone scraps my content.

I had to try this, and see.

Lord Majestic

4:54 pm on Nov 9, 2005 (gmt 0)

Everytime it got the full file, it appeared in the index.

It could be due to current work on update - they are likely to be changing algos and crawling and recrawling all over again.

When I said gzip should not matter I meant overall savings are low so it won't be a factor.

walkman

5:06 pm on Nov 9, 2005 (gmt 0)

>> When I said gzip should not matter I meant overall savings are low so it won't be a factor.

I meant it differently: I use the raw logs to analyze, and then search for "Googlebot," so I can see when google comes and what it pulls, including file size. For example, GB would come, pull page.html as a gzip, and the page not appear on the index. Days later, I would see GB come back, pull the same file (full size), and then that page would appear.

I'm not sure if GB adds the pages it gets with Gzip to the index, that's what I was trying to say.

Lord Majestic

5:53 pm on Nov 9, 2005 (gmt 0)

My bad then :(

jd01

7:26 pm on Nov 9, 2005 (gmt 0)

The short, readable version of If-Modified-Since and If-None-Match:

User-Agents and Servers use If-None-Match and If-Modified-Since in many transactions to determine if a page should be reloaded, or should be served from the cache of the page the requesting agent has.

For static pages, which are cached, the requesting client will send either one or both of these headers. The If-Modified-Since will contain the last modified date, and the If-None-Match will contain the ETag set at the time of last access by the agent. (An ETag is a unique value set to a page, generated at the time the page is last modified (or uploaded) -- usually looks like a hash.)

If either/both of these headers match the servers response for a requested page, the page is considered 'fresh', a 304 will be served and the contents can be loaded from the requesting client's cache -- the page is never actually loaded from the server if a 304 is sent. If they do not match the page is considered 'stale' and should be reloaded from the server.

A php or other dynamic page will not contain these headers from the server, because the server is creating the html (etc.) page 'on-the-fly', so it will not 'know' if the contents of the page have been modified. In order to serve a 304 response from a dynamic page, the headers must be retreived from the requesting client and compared within the script manually.

If the headers retreived are the same as those sent from the last access a 304 header should be served and the script should 'exit' without serving any other information. If the headers retreived do not match, a new set of Last-Modified and ETag headers should be created and set, then the page should continue to load as usual.

Supporting these headers will significantly cut unnecessary bandwidth on dynamic sites, where the results of a page are always the same for a set of criterion.

The use of these headers may or may not make a difference in the indexing of a site... G states they will index a lower portion of pages from dynamic sites to keep from overloading a server, but some sites I know of are indexed very well, despite being 90% dynamic. (Personally, I use them, because the lower the number of requests to my server, the faster my sites are...)

Hope this helps.

Justin

w3c Header Information [w3.org] see: 14.25 and 14.26

4specs

7:32 pm on Nov 9, 2005 (gmt 0)

We use php to gzip our html pages on the fly. I checked and today Googlebot was not requesting the gzipped ones, but I think they have in the past.

I use php to gzip as mod-gzip caused problems with the image caching as it served a variable header with the image.

Getting the correct headers, including the If-Modified-Since HTTP header is tough with php, and a 304 not changed response. I went to a script called as a prepend to out html that handles all the headers correctly and handles gzipping transparently to me and the user.

Seach google for:

HTTP conditional requests in PHP

it was written in France and is a slick solution.

Pat1975

11:16 am on Nov 10, 2005 (gmt 0)

jd01,

Thank you very much. That exactly answer my question.

AlexK

1:03 pm on Nov 10, 2005 (gmt 0)

For PHP users this PHP Class will auto-provide compression + can respond to If-Modified-Since headers [webmasterworld.com].

walkman:

the problem is that Googlebot ... pulls the gzipped version of the file 200+ pages a day, but only takes the full file for about a dozen pages. I have noticed that Google only adds them to the index when it takes the full file.

"Normal" G-Bot does not take compressed pages, G-Mozilla-Bot does. M-Bot does not add to the index [webmasterworld.com] (msg#4) (also this one [webmasterworld.com]) (and this link includes a compendium of info on the M-Bot [webmasterworld.com]).

Thus, the behaviour you describe seems to be the normal activity of the 2 different G-Bots. I doubt that compressed/not compressed will have anything to do with it, but would be interested to have your feedback on this.

DamonHD

3:31 pm on Nov 10, 2005 (gmt 0)

Hi,

Yes, I think that the experiment results would be very interesting, and I will work harder to (re)instate correct 304 and compression for spiders if it does not hurt.

(Note use of Vary: header for content-negotiated dynamic pages.)

Rgds

Damon