Forum Moderators: phranque
The first saving grace is that the WA will not prefetch GET URLs that contain a query string.
Secondly, GoogleGuy has gone on to say in the thread:
We do add an "X-moz: prefetch" header to all prefetch requests. That way, webmasters can choose to just ignore prefetch requests if they so choose.
Fine, but how do you "ignore" a request in such a way that the accelerating proxy server does not think that your attempt at ignoring it was not the orignally intended response, and hense return that to the user in response to their subsequent genuine click?
Claus said:
403 Access Denied
...but that surely is making an assumption about the design of the accelerating proxy server; in that it won't cache a 403 response and instead re-request the page in response to a genuine click.
jd01 said:
So, since G is sending the 'prefetch' header request only to requests it is prefetching and *not* to links that are being clicked by the user. You can effectively block the prefetch request at the server level
...I don't see how this works. The whole purpose of the accelerating proxy server is that links that are being clicked by the user never make it to my server - the user gets whatever was returned when the prefetch request came in.
Apologies to Claus and jd01 if i'm being really stupid here; but i'm still don't understand how you "ignore" a prefetch request, as GoogleGuy puts it!
The only 'ignore' option I can think of is later in the log-file processing: filter out "X-moz: prefetch" flagged log records (grep -v) in a prior step before generating the hit statistics. If you have flagged them in the first instance.
Regards,
R.
Logically, you should be sending back a 501 ("request not supported") because you don't support it.
I thought about that, having gone away to read RFC2616 - but it comes full circle back to my original argument! The WA is a proxy; so stricly speaking it will just proxy your 501 back to the client when they really do click the link!
This is an example of a server log file:
00.000.00.00 - - [03/May/2005:22:28:25 -0500] "GET /page.html HTTP/1.1" 200 2850 "http://www.other-site.com" "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1"
This is an example of the actual exchange that took place to generate the above log:
Initial Request from G:
Host: www.anysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20041001 Firefox/0.10.1
Accept: text/html,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.other-site.com/
Your Server Response:
HTTP/1.1 200 OK
Date: Sun, 8 May 2005 11:55:45 GMT
Server: Apache/1.3.31
Cache-Control: max-age=90000
Expires: Mon, 9 May 2005 11:55:45 GMT
Last-Modified: Sat, 19 Jun 2004 15:25:10 GMT
ETag: "7b80d9-891-40d52ad7"
Accept-Ranges: bytes
Content-Length: 2850
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: image/gif
What G is saying is when they send the initial request string (top) they will be appending their string for prefetch headers only, with X-moz: prefetch (GG '...We do add an "X-moz: prefetch" header to all prefetch requests...') By adding this string to the prefetch header, not to all header requests, they allow for the blocking of the prefetch on a site by site, or directory by directory basis, while still allowing full navigation of the site.
By doing this, they are adding a custom header, and their request will now look something like this:
Appended G Request:
Host: www.anysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20041001 Firefox/0.10.1
Accept: text/html,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.other-site.com/
X-moz: prefetch
For a 'regular' click on a link (https, URL with a query string, or no prefetch present) their header string will still be the original example:
Host: www.anysite.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20041001 Firefox/0.10.1
Accept: text/html,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: http://www.other-site.com/
Without the 'prefetch' string appended.
Therefore, even though it will not appear in logs, (because logs are a summary of the connection, not the full connection string) the prefetch can be blocked with the following line of code:
RewriteEngine On
SetEnvIf X-moz prefetch HAS_X-moz
RewriteCond %{ENV:HAS_X-moz} prefetch
RewriteRule . [F,L]
This ruleset specifically checks the original, full header request, for the 'X-moz: prefetch' string, and if present this access is forbiden. But the user experience is not changed (other than no prefetch) because G is prefetching the pages from the page preceding the entry, and if they do not have the page in the cache, they are then requesting the page with a 'regular' or non-appended header, which causes the ruleset to fail and the content will then be served as usual.
I hope this helps. I have tested this, and it is effective. If there are questions about the function or process I will answer to the best of my ability.
Justin
When I tested it I noticed no difference at all... NONE... with either the WA turned on or off, or by blocking/not blocking the prefetching.
My sites have fairly small pages (in kb) so the results may vary, but I did not see a noticable difference in any function, except with the WA turned on, and the prefetch allowed it told me I saved time. With the WA off, or during the blocking of the prefetch I saw no difference.
To add a little clarity to those who think by blocking the prefetch you will 'hurt' your site or rankings... The connection still runs through the G proxy, so any information passed is still passed, it's just not prefetching pages. (IOW Any click/collecting data G is accumulating will still be accumulated, because the connection still runs through their server... just like a regular proxy.)
Justin
Thanks for your time taken to help out with this; and also the fact that you have tested it and found it to work OK under the current WA functionality.
However, I would still like to understand why you are so confident that the initial 403 (that you have generated by detecting X-Moz: prefetch) will not be the response that is returned to user on their subsequent genuine click.
Your answer keeps talking about how the genuine click won't contain the prefetch header - I understand that - but the whole point of an accelerating proxy server is that the genuine click will never make it to your server. My worry, therefore, is that testing for it is dangerous because there is no way to tell the WA that the response is only 403 because of the prefetch header.
Is it stated in the HTTP RFC's somewhere that error responses should not be cached and that a fresh request should be made every time? If so, then i'm happy with your answer - but I have not been able to find any reference to this.
Or are you simply assuming that the WA will not cache the 403?
Actually, I am very confident it works, because I actually installed the solution I posted on two separate sites, and tested them with the WA. I had no issues at all when retrieving the pages, either static or dynamic served as static, when I followed links. (This was also verified by someone, almost immediately after my last post in the WA thread.)
The second solution posted on this thread might need to contain 'private' in the cache-control line like this, and I have not personally tested this one:
Header unset cache-control:
Header append cache-control: "private, no-cache, must-revalidate"
The documentation is more obscure on the first solution, because it is a custom solution to a custom header. The documentation at W3C (and possibly Apache) should validate, that a HTTP compliant proxy, should follow the directive in the second solution, and not serve a cached version of the page.
I will be posting a solution for the X-Forwarded-For header sent in the near future... This one may prove to be much more critical to geo-targeting and logging than the X-moz solution is.
Justin
A response received with a status code of 200, 203, 206, 300, 301 or 410 MAY be stored by a cache and used in reply to a subsequent request, subject to the expiration mechanism, unless a cache-control directive prohibits caching. However, a cache that does not support the Range and Content-Range headers MUST NOT cache 206 (Partial Content) responses.A response received with any other status code (e.g. status codes 302 and 307) MUST NOT be returned in a reply to a subsequent request unless there are cache-control directives or another header(s) that explicitly allow it. [...]
Jim