Forum Moderators: phranque
When I change a page on my site and upload it to the web server, I often find that I get an old version of the page if I use www in the URL.
For example
[mysite.com...] leads to my recently updated home page
Whereas
[mysite.com...] leads to the previous version of home page
This problem persists for hours sometimes.
I have cleared my internet cache in IE
I was wondering if my ISP is caching my site and sending me pages from the cache.
This has puzzled me for a while so I'm hoping someone can explain the cause. I frequently waste time double checking if pages have uploaded.
Thanks
There are two things you may not be doing that you should consider doing. The first is to take control of your site's caching behaviour, and the second is to take control of your domains and subdomains. These two issues cause an untold number of problems for Webmasters who leave them to fate -- from user confusion to search engine ranking problems due to duplicate or stale content.
Use Apache mod_expires [httpd.apache.org] and mod_headers [httpd.apache.org] to create appropriate cache-control headers for your pages, etc. Then, pick either example.com or www.example.com as your 'preferred domain name' and redirect the alternate domain name to that one with a 301 external redirect. From that time on, link to your site only with the preferred domain name, and try to to get all links to the alternate domain name changed to point to the preferred domain name.
This comes up often, but I'll repeat it: www.example.com is a subdomain of example.com. It is not the same thing, and these two URLs can and often do lead to separate sites. As a "courtesy" to webmasters who are not technically oriented, search engines and other online entities often will check to see if they are the same, and 'merge' them if so. But they are not the same URL, and no-one should expect (as in "demand") that they be treated this way.
I'll give a couple of examples on how to cure these problems, but you'll have to look up the cited Apache modules and adapt these examples to suit your needs.
First, get everyone pointed to a single domain for your site... I'll assume you like to use http://domain.com/ to reach your home page, and that you can use mod_rewrite [httpd.apache.org].
In mod_rewrite, for use in your top-level .htaccess file:
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} !^example\.com
RewriteRule (.*) http://example.com/$1 [R=301,L]
Now, to set up cache-control:
ExpiresActive On
# Default - Set http header to expire everything 2 weeks from last access, set must-revalidate
ExpiresDefault A604800
Header append Cache-Control: "must-revalidate"
#
# Apply a customized Cache-Control header to frequently-updated files
<FilesMatch "^(hlplocat¦test[23]?¦404¦410¦403i?)\.html$">
ExpiresDefault A1
Header unset Cache-Control:
Header append Cache-Control: "no-cache, must-revalidate"
</FilesMatch>
#
<FilesMatch "^robots\.txt$">
ExpiresDefault A14400
</FilesMatch>
#
<FilesMatch "^index\.htm">
ExpiresDefault A7200
</FilesMatch>
#
<FilesMatch "^(calendar¦event_sched)\.html$">
ExpiresDefault A14400
</FilesMatch>
Note that posting on this board changes solid vertical pipes into broken "¦" pipes -- You'll need to edit those characters manually to change then back to solid pipes before using any examples shown here.
Refs:
Web caching tutorial [mnot.net]
Cacheability checker [ircache.net]
Apache URL Rewriting Guide [httpd.apache.org]
Server Headers checker [webmasterworld.com]
Regular expressions tutorial [etext.lib.virginia.edu]
Jim
I think I get the gist of what how the redirect works but would appreciate help understanding some of the syntax (I've read Jim's links about but couldn't find explanations for all the syntax. I might have overlooked it as it was complicated!)
I've noted my understanding of the script in the comments below. Would be great if someone could explain the bits I've commented 'not sure' on
Thanks a lot
-------------------
# I want to redirect [mysite.com...] to [mysite.com...]
# The option +FollowSymLinks and RewriteEngine On is
# required to enable the rewriting engine for
# per-directory configuration files (i.e htaccess)
Options +FollowSymLinks
RewriteEngine on
# The next 2 lines mean if the referring host begins with example.com rewrite it so it begins [example.com...]
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) [example.com...] [R=301,L]
# Not sure what % symbol does
# ^ means the string starts here
# $ means The string ends here
# Not sure what the 1 means as in $1
# Not sure what [R=301,L] does
# Not sure what % symbol does
%{something} resolves to the value of the server variable named "something". %1 is a back-reference -- see below.
# Not sure what the 1 means as in $1
$1 is a back-reference and refers to the characters matching the first parenthesized subpattern in the RewriteRule pattern. You can also use %1 to refer back to the first parenthesized subpattern in the immediately-preceding RewriteCond pattern. Either of these back-references can be numbered from 1 to 9 inclusive (i.e. $1 through $9), so up to nine RewriteRules subpattern matches and nine RewriteCond subpattern matches can be back-referenced - that is, copied from the RewriteRule and RewriteCond subpattern matches into the substitution string.
# Not sure what [R=301,L] does
R=301 specifies that the server generate a 301-Moved Permanently response to the request, and give this new URL to the client. This tells the client to re-request the resource using the provided new URL. This closes the current HTTP transaction, and it is up to the client to start a new one with the URL you provide.
You may also specify R=302 to generate a 302-Moved Temporarily response.
The difference between 301 and 302 is that a 301 tells the client that the change is permanent, to use the new URL from now on, and to update any records it may have to replace the old URL with the new URL. In theory, a browser could use this information to automatically update old "favorites" or "bookmarks," but this is not widely implemented. The only common clients that pay attention to this function are search engine robots.
A third "redirect" option is to use a "silent redirect" or server-internal rewrite, which is simply providing an alternate filepath for the requested URL. This takes place entirely within the server, and is not visible to the client in any way. This capability demonstrates that URLs and local server filepaths are two different ways of referrring to a resource, and that the two need not be similar in any way. You can set up your server so that all requests for the URL www.example.com/trolleys/small/crimson.html are served from c:/Apache/users/yourname/site1/wheeled-vehicles/carts/tiny/red_ones.php, and the user wlll never know it.
In addition to these three common functions, mod_rewrite can also generate 403-Forbidden and 410-Gone error responses. There are many more esoteric capabilities in there as well.
The [L] flag specifies that if the current rule matches and the rewrite is invoked, the rewrite engine should stop processing, and not attempt to apply any further RewriteRules. In most cases, this is what you want, unless you need to change a URL step-by-step using multiple RewriteRules due to some complex URL-rewriting requirement. Use [L] unless you have a reason not to.
All of this is concisely documented in the Apache mod_rewrite documentation [httpd.apache.org] and in the accompanying URL Rewriting Guide.
Jim
Welcome to WebmasterWorld [webmasterworld.com]!
Can I include this code im my httpd.conf file on apache?
Yes, but you'll likely need to change it to suit your needs.
> So every site that I have will be cach controlled.
You should carefully arrange to control your files by type, by file extension, or by subdirectory. Some files need to be cached long-term, some need moderate-term caching, and some, you may not want cached at all. If you cache for too short a term or require revalidation on every file, then your server load goes way up, and your site will seem much slower to users. If you cache for too long a time, then your frequent visitors may see stale pages/scripts/images for a long time after you update these.
I'd recommend a carefully-planned site-by-site file-by-file approach to get the maximum benefit from caching. If you want the benefit, it'll take some work. Otherwise, you may be better off using default cache-control.
> Or some how when I create a site that .htaccess file will be automaticly created in the home directory?
No. .htaccess is a file you (or a script you write) must create.
Check out the links I cited above for more info.
Jim
Thanks for the quick replay.
I have the smartsearch script installed on one of my sites. I want to make its urls search engine friendly.
Here is what I did:
---------------------------------
AddHandler server-parsed .html
#
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST}!^www.(.*)$
RewriteRule ^(.*)$ [%{HTTP_HOST}...] [R=301,L]
#
RewriteEngine on
RewriteBase /cgi-bin/search/
RewriteRule ^cats([^.]+).html$ smartsearch.cgi?keywords=$1 [T=application/x-httpd-cgi]
------------------------------------------------
And that rewrites /cats/web_hosting.html to smartsearch.cgi?keywords=$1
Now on the bottom of the search results I have
<<Next>> which is smartsearch.cgi?keywords=web_hosting&s=50&bt=&crawlsite=&c=0&db=&e=8587370&f=0&username=&c3=&c5=&c99=
where:
&s=50 [# of links]
&e=8587370 [# of links found]
How to make this line[url] search engine friendly like
/cats/web_hosting/50/8587370 or /cats/web_hosting_50_8587370.html
Thanks
Jim
SITUATION
- My home page includes an image that is randomly selected from a database
- I want this image to be refreshed each time a visitor returns to the home page
- So I don't want my home page to be cached
- I have set up basic cach control (pasted below) in htaccess
PROBLEM
- When accessing my site via my normal ISP, I am shown a cached version of the page as the random image stays the same
- When accessing my site via another ISP the random image is refreshed every time I return to the home page, as expected
QUESTION
- Why is my ISP caching my home page?
Here is the HTACESS
# If file ends .html file, scan for php code
AddType application/x-httpd-php .php .php3 .phtml .html
#----------------------
# Redirect [mysite.com...] to [mysite.com...]
# The option +FollowSymLinks and RewriteEngine On is
# required to enable the rewriting engine for
# per-directory configuration files (i.e htaccess)
Options +FollowSymLinks
RewriteEngine on
# Next 2 lines mean if referring host begins example.com
# rewrite it so it begins [example.com...]
RewriteCond %{HTTP_HOST} ^mysite.com
RewriteRule (.*) [mysite.com...] [R=301,L]
# %{HTTP_HOST} means the value of server variable HTTP_HOST
# ^ means the string starts here
# $ means The string ends here
# $1 is a back reference to the characters matching those in (.*)
# [R=301,L] means generate a 301-Moved Permanently ande give the new URL to the client
# [L] means if the rule matches and rewrite is invoked,
# the rewrite engine should stop and not attempt to apply any further RewriteRules
#----------------------
# Set cache control
ExpiresActive On
# Default - Set http header to expire everything 1 week from last access, set must-revalidate
ExpiresDefault A604800
Header append Cache-Control: "must-revalidate"
# Apply a customized Cache-Control header to frequently-updated files
# I only added this section in last 24 hours
<FilesMatch "^index\.html">
ExpiresDefault A1
Header unset Cache-Control:
Header append Cache-Control: "no-cache, must-revalidate"
</FilesMatch>
The most likely cause is that since you only added the special treatment for index.html in the past 24 hours, and you had previously made all files cacheable for a week, your ISP cached a copy, and may consider it valid for up to a week. Although this won't usually happen (most cached pages get replaced before they expire), you have told them it is OK to keep the page for that period of time.
The "must-revalidate" header should take care of this problem, but you must make sure that the last-modified date on your index.html page has been updated in order to notify the cache that it should update.
If you force a reload, it will *usually* cause any intervening caches to update.
Jim
Thanks for your explanation, I suspected that might be the case but despite fiddling around for hours trying to resolve this situation I am still stuck.
I'd greatly appreciate a bit of guidance:
1. How do I set the last-modified date? Is it taken from the date the file was last saved on the web server? Or do I need to hand code it in the header of the page or something else? I assume it must be the latter as I have updated and resaved the page but am still seeing a stale version of it
2. Why am I seeing stale versions of my pages ending with .php or .html? I have this in my htaccess file:
#Set cache control
ExpiresActive On
# Default - Set http header to expire everything 1 day from last access, set must-revalidate
ExpiresDefault A86400
Header append Cache-Control: "must-revalidate"
#Apply a customized Cache-Control header to frequently-updated files
<FilesMatch "^\.php">
ExpiresDefault A1
Header unset Cache-Control:
Header append Cache-Control: "no-cache, must-revalidate"
</FilesMatch>
<FilesMatch "^\.html">
ExpiresDefault A1
Header unset Cache-Control:
Header append Cache-Control: "no-cache, must-revalidate"
</FilesMatch>
I was wondering if my pattern matching is wrong as if I run the cacheability check on a .html page it appears to indicate it will be cached even though the htaccess says don't cache pages ending .html
Expires 1 day from now (Sat, 15 May 2004 00:31:59 GMT)
Cache-Control max-age=86400, must-revalidate
Last-Modified -
ETag -
Content-Length - (actual size: 4485)
Server Apache/1.3.26 (Unix) PHP/4.2.3
Cheers
Nubbin
Yes, unless the file includes other files. In that case, you must use mod_include [httpd.apache.org]'s XBitHack full directive to tell the server to only look at the date that the main file was updated in order to determine Last-Modified. Otherwise, it won't know which of the files -- the main one, or the includes -- to take the date from, and as a result, it won't send a Last-Modified header.
> Why am I seeing stale versions of my pages ending with .php or .html? I have this in my htaccess file:
<FilesMatch "^\.php">
<FilesMatch "\.php$">
Ref: [etext.lib.virginia.edu...]
Same problem with
<FilesMatch "^\.html">
As a result of this, these filetypes are taking the default setting your previously defined, which is a 24-hour expiry time.
Jim
Thanks for the advice. I've fixed the regular expressions. My site seems to performing much faster now I've implemented basic caching.
Here's a tip for anyone else who is implementing caching in Htaccess for the first time:
Make sure you set ExpiresDefault to be a small number - the equivalent of hours rather than days. If you make it too big and you make a mistake in your caching instructions, then your site will get cached for a long time and you will see stale pages until the cache expires.
XBitHack looks very useful but quite complicated... and not without a few drawbacks. I'll need to do more reading before implementing that
Thanks once again Jim