Forum Moderators: Robert Charlton & goodroi
I wonder how dose this header work? Does it check the Last-Modified dates of the physical files? If so, Google will never know if I update my website because I insert updated content into a database and have PHP pages retrieve these content and show up to the users.
Overhead bandwidth is not a big deal, but I heard someone say that Google ranking site according to how often we update our site. Is there anyway to get around this?
Status: HTTP/1.1 200 OK
Date: Mon, 07 Nov 2005 19:56:57 GMT
Server: Apache
Set-Cookie: csuv=visitor; expires=Tue, 08-Nov-2005 19:56:57 GMT
Set-Cookie: PHPSESSID=3cf4b9914*******0ea2531257dcbd; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
I have noticed that Google only adds them to the index when it takes the full file. For some reason I lost most of my pages in Google's index, and woudl like to speed up the proccess.
My pages were very well linked (within the site), the pages have no dupe problems at all, and I have google sitemaps.
OK, the only straw I can clutch at for you is that historically some servers (eg Tomcat 4) have had some problems with corrupting (well, sort-of double-compressing) GZIPped output (which is why I don't use it) due to an ambiguity in the (servlet) specs.
Have you tried turning off the compression to see if the problem goes away?
Rgds
Damon
PS. Is that enough braket(t)ing for you? B^>
on edit: yes deflate is the gzip one. Disabled, and tested it, and we'll see if gbot likes me better now.
thanks ;)
not (even) one (or more) () in this post :)
I'm not clearly understand how the header work yet, but, what I guess now is that the server will check the "last modified date" of the requested file and return 304 if no update since the specific date.
If that is the case, the server may return 304 without even invoking the php script.
Just a guess, anyone know how exactly it works please clarify.
Everytime it got the full file, it appeared in the index. The pages are unique, entered manually, and they are not copied elsewhere. Plus I have a "Related Products," thingie where it pulls 2-5 different poducts, about 10 sentences in total, from the same category, just in case someone scraps my content.
I had to try this, and see.
I meant it differently: I use the raw logs to analyze, and then search for "Googlebot," so I can see when google comes and what it pulls, including file size. For example, GB would come, pull page.html as a gzip, and the page not appear on the index. Days later, I would see GB come back, pull the same file (full size), and then that page would appear.
I'm not sure if GB adds the pages it gets with Gzip to the index, that's what I was trying to say.
User-Agents and Servers use If-None-Match and If-Modified-Since in many transactions to determine if a page should be reloaded, or should be served from the cache of the page the requesting agent has.
For static pages, which are cached, the requesting client will send either one or both of these headers. The If-Modified-Since will contain the last modified date, and the If-None-Match will contain the ETag set at the time of last access by the agent. (An ETag is a unique value set to a page, generated at the time the page is last modified (or uploaded) -- usually looks like a hash.)
If either/both of these headers match the servers response for a requested page, the page is considered 'fresh', a 304 will be served and the contents can be loaded from the requesting client's cache -- the page is never actually loaded from the server if a 304 is sent. If they do not match the page is considered 'stale' and should be reloaded from the server.
A php or other dynamic page will not contain these headers from the server, because the server is creating the html (etc.) page 'on-the-fly', so it will not 'know' if the contents of the page have been modified. In order to serve a 304 response from a dynamic page, the headers must be retreived from the requesting client and compared within the script manually.
If the headers retreived are the same as those sent from the last access a 304 header should be served and the script should 'exit' without serving any other information. If the headers retreived do not match, a new set of Last-Modified and ETag headers should be created and set, then the page should continue to load as usual.
Supporting these headers will significantly cut unnecessary bandwidth on dynamic sites, where the results of a page are always the same for a set of criterion.
The use of these headers may or may not make a difference in the indexing of a site... G states they will index a lower portion of pages from dynamic sites to keep from overloading a server, but some sites I know of are indexed very well, despite being 90% dynamic. (Personally, I use them, because the lower the number of requests to my server, the faster my sites are...)
Hope this helps.
Justin
w3c Header Information [w3.org] see: 14.25 and 14.26
I use php to gzip as mod-gzip caused problems with the image caching as it served a variable header with the image.
Getting the correct headers, including the If-Modified-Since HTTP header is tough with php, and a 304 not changed response. I went to a script called as a prepend to out html that handles all the headers correctly and handles gzipping transparently to me and the user.
Seach google for:
HTTP conditional requests in PHP
it was written in France and is a slick solution.
walkman:
the problem is that Googlebot ... pulls the gzipped version of the file 200+ pages a day, but only takes the full file for about a dozen pages. I have noticed that Google only adds them to the index when it takes the full file.
Thus, the behaviour you describe seems to be the normal activity of the 2 different G-Bots. I doubt that compressed/not compressed will have anything to do with it, but would be interested to have your feedback on this.