Forum Moderators: phranque
Options -Indexes
Options +FollowSymLinks
RewriteEngine On
RewriteBase /
# Internally rewrite SEO-friendly URLs to scripts with query strings (all fine)
RewriteRule ^(.*)-s.html$ index.php?section=$1&%{QUERY_STRING}
RewriteRule ^(.*)-s-(.*)-p-(.*)-d.html$ index.php?section=$1&page=$2&%{QUERY_STRING}
RewriteRule ^(.*)-s-(.*)-a-(.*)-t.html$ index.php?section=$1&advertid=$2&%{QUERY_STRING}
# Externally Redirect direct client requests for non-canonical URLs to canonical URL (fine)
RewriteCond %{HTTP_HOST} ^example\.co.uk
RewriteRule ^(.*) http://www.example.co.uk/$1 [R=301,L]
# Externally Redirect for a particular script URL to a page in root, to allow unique title and meta data (fine)
RewriteCond %{QUERY_STRING} ^type=adv$
RewriteRule ^index.php$ /Property+Search.html? [R=301,L]
# Externally Redirect direct client requests for 'escaped' script URLs (from above 3 internal rewrites) back to SEO-friendly URLs (they do no harm but don't work - please help!)
RewriteCond %(THE_REQUEST) ^section=$1&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s.html? [R=301,L]
RewriteCond %(THE_REQUEST) ^section=$1&page=$2&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s-(.*)-p-(.*)-d.html? [R=301,L]
RewriteCond %(THE_REQUEST) ^section=$1&advertid=$2&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s-(.*)-a-(.*)-t.html? [R=301,L]
Thanks, Dave
The last of the new rules should probably look something like this, although I can't be sure without seeing example requested URLs and intended substitution URLs, and one of your query string variable names seems to be missing (I substituted "tango"):
RewriteCond %(THE_REQUEST) ^[A-Z]+\ /index\.php\?section=([^&]+)&advertid=([^&]+)&tango=([^&\ ]+)\ HTTP/
RewriteRule ^index\.php$ http://www.example.com/%1-s-%2-a-%3-t.html? [R=301,L]
GET /index.php?section=top&advertid=abc123&tango=xyz HTTP/1.1
Also, the internal rewrite which corresponds to this redirect could be coded for much more efficient operation by making the patterns more-specific -- again in this case using negative-match subpatterns:
RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&tango=$3 [L]
Since these two examples represent your most-complex rules, I trust you can derive the simpler ones from them... :)
Two rules of thumb:
1) Never use multiple ".*" subpatterns in a pattern when it can be avoided. ".*" is a greedy pattern meaning "match anything and everything" and therefore, the first such subpattern will 'consume' the entire string, leaving rest of the pattern to 'starve'. The matching engine will then have to back off from the end of the string one, then two, then three ... characters per pass, trying to leave enough characters to find a match for the other ".*" subpatterns and exact-match characters. This can result in dozens, hundreds or even thousands of matching attempts for just this one rule, and can actually slow down a busy server.
It is far better to use the negative-match subpatterns when appropriate, since such a pattern can be evaluated in a single left-to-right pass. In case it is not clear, "([^&]+)" means "match one or more characters not an ampersand," or equivalently, "match one or more characters until you find an ampersand and then stop."
2) Put your rules in order with external redirects first, starting from most-specific pattern/least URLs affected to least-specific pattern/most URLs affected, then your internal rewrites, again in order from most-specific to least-specific.
You can also combine your Options into one line as "Options -Indexes +FollowSymLinks", and get rid of the "RewriteBase /" directive entirely; It is most likely not needed, since "RewriteBase /" is the default behavior.
(Hopefully, no more typos -- Been bad for that today...)
Jim
[edited by: jdMorgan at 1:15 am (utc) on Jan. 22, 2009]
The 3 internal rewrites were inherited when I took over the site; they do the job they’re intended to do, without an [L] or any server problems, so I would need to be very careful if changing them. It is php4. An example of an external with its internal address (from the second rewrite of the 3) is :
#SE friendly url
/Travel+Services-s-17-p-Ferries-d.html
#internal url
/index.php?section=Travel+Services&page=17
Another version is internally generated for each internal page but I won’t complicate matters here! (confuses me and both can be addressed by clients, hence my concern that Google can see each page 3 ways and the need for external redirects for these).
So if the internal rewrites are working correctly, the apparently missing query string (I agree with you) is a mystery, to do with what happens in the internal codes. The above example comprises a Section (Travel +Services) with an Article (Ferries), ie ‘s’and ‘a’ so what is ‘d’ ? (There’s nothing below Articles and only the Travel+Services Section has Articles). So I’ve tried removing the ‘d’, but that stopped me going to the Article page. I want to use or adapt your excellent code but I first need to understand what’s happening here. My php knowledge is not clever enough yet to analyse the code in the files. Maybe I shouldn’t worry too much about Google penalising me for duplicate content?
Any ideas? Thanks again, Dave
The 3 internal rewrites were inherited when I took over the site; they do the job they’re intended to do, without an [L] or any server problems...
Except for a lot of wasted CPU cycles, and an attendant loss of performance. Always use [L] unless you have a very good reason not to.
I don't know what 'd' is either. All I can speak of (and think in terms of) is input URL-path, and either output URL (external redirect syntax and function) or output filepath (internal rewrite syntax and function).
I also can't help solve the "missing t=" problem, as I have no idea what it means or how it needs to be handled.
Start with a click on a URL appearing on your page, and work through the process that way. Starting anywhere else in the process is "starting in the middle" and leads to confusion and difficulty.
A link is clicked An HTTP request for that URL arrives at your server. The requested URL-path is passed to mod_rewrite, mod_rewrite compares the requested URL-path against the RewriteRule patterns. If a match is found, that rule's RewriteConds are checked. If the rule's RewriteConds match, the redirect or rewrite is invoked and the URL-path is updated. If an [L] flag is found on the invoked rule, stop processing for this pass and (in .htaccess) re-start rule processing. If no [L] flag is present, then process all subsequent rules (usually a total waste of time).
Once all matching rules have been invoked and no further rule-matches are found, control is passed to the next Apache module, and once all modules have executed, the content-handler phase is invoked to either serve a static page or invoke a script to generate and serve a page.
(This is vastly simplified, but is intended only to illustrate the most basic procedure).
I just spotted a problem with the internal rewrite rule I posted above. It should appear as
RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&tango=$3 [QSA,L]
Jim
Thanks Jim. I'll remember the [L] and add it and I'll also research where the 'd' and 't' come from (the 1st rewrite only has the 's' though, so I also need to search for the %{QUERY_STRING} - can't work out the code for this either. For this I've just tried (without success) :
RewriteCond %(THE_REQUEST) ^[A-Z]+\ /index\.php\?section=([^&\ ]+)\ HTTP/
RewriteRule ^index\.php$ http://www.example.com/%1-s.html? [R=301,L]
Thanks for the insight into how Apache works - I understood it perfectly, where official tutorials blind you with science and quickly lose novices like me. I am a technical writer and understand the need to lead from the top instead of leaping straight into the nitty gritty. Maybe we should write a winning manual along those lines together?
I'll carry on with this problem now (and keep in mind your new internal rewrite code)!
Cheers, Dave
The redirect is used to establish the "correct" URLs in search engines -- to force them to recognize the new URLs and tell them to ascribe the link-popularity and PageRank of the old URLs to the new. It also serves as a heads-up to people using bookmarks (if they notice that the URL in their address bar changes). Remember, URLs are used on the Web -- they define the "Web view" of your site.
You need to do this because although you update your entire site to link only to the new URLs, there are undoubtedly many links out there on the Web to the old URLs, and you need to indicate to the search engines that the new ones are the "right" ones.
So, we redirect the client to the new URL, and the client then re-requests the page using the new URL. When this request arrives at the server, it is then internally rewritten to the script filepath -- and no indication of this rewrite is given to the client. So the internal filepath is now only "associated" with the URL, and no longer resembles it.
If you don't include the RewriteCond checking THE_REQUEST inthe redirect rules, you will indeed get an infinite rewrite/redirect loop if your code is in .htaccess. This is because mod_rewrite in an .htaccess context behaves recursively, and is re-started if any rule is invoked. The RewriteCond checking THE_REQUEST ensures that the redirect is only invoked if the old URL/filepath is requested directly by a browser or search engine robot, and will not be invoked by an internal script filepath request resulting from the action of the internal rewrite rule.
If you're not seeing any redirects, then it is because the requested URL-path doesn't exactly-match the pattern of any of the redirect rules. That and you need to be sure to completely flush your browser cache after changing any server-side code. Otherwise, your browser will cache the page contents and the server response code, and use the cached data instead of making a request to your server. If you are doing a very long test session, consider disabling your browser cache (set the cache size or time to zero), but be sure to re-enable it when done!
Jim
On the other hand, an internal rewrite takes place entirely within the context of the current HTTP transaction.
This doesn't necessarily answer any of the questions above, but making sense of redirects and rewrites is almost impossible without some idea of how the whole process works.
Jim
Just a small point - when I referred to SE-friendly URLs I really meant the ONLY url a SE is allowed to see (not ALSO the other internal ones that somehow have been revealed, causing duplicated content). I can live with 'Travel+Services-s-19-p-Language+and+Travel+Guides-d.html' (probably the longest one) FOR NOW. My immediate concern is for any duplicated content penalties. This recently inherited website has 60% editable Sections, with the others all with their Titles and meta data set as the Homepage, so I have stripped the 2 main ones out and placed them as .html files in the root directory, and added the following simple (successful) redirects in .htaccess :
# Externally Redirects for script URLs to pages in root, to allow unique title and meta data (fine)
#Before internal redirects
RewriteCond %{QUERY_STRING} ^type=adv$
RewriteRule ^index.php$ /Property+Search.html? [R=301,L]
#Necessary AFTER internal redirects (maybe because there is no RewriteCond)
RewriteRule ^Register-s.html$ /Register.html [R=301,L]
Anyway, it's a very professional website, but it's a very competitive business, we are getting next to no visitors and I'm trying desperately to satisfy Google and get some sort of ranking (without things like artificial link exchanges, etc).
As I said, I will now try to analyse the strange internal rewrites we have.
Many thanks, Dave
But what if they split the ranking credit 50-50? Both URLs on page 3 (or page 10) of the search results isn't so good. And it only gets worse if you have muliple factors causing even more 'mirror' URLs: www- versus non-www, http versus https, etc...
Then there is a temporal effect, in that you can 'confuse and befuddle' the search engines when they are trying to figure out which URL to list. This requires a back-end process that is hugely expensive in terms of computing power, and which wouldn't be needed if Webmasters didn't allow canonicalization problems to occur in the first place. So, you throw yourself at the mercy of the search engines' back-end de-duplication process, and hope that they have time to figure our your site's quirks before they roll out a new index... If they don't get around to de-duplicating your site, then what happens? I don't don't know and don't care to find out.
If you haven't checked out the Google Search forum's Library here at WebmasterWorld, I'll commend it to you; Lots of tutorials --some quite old (in Web years) but still 99% true-- on how to get ranked in Google.
I won't say that these duplicate-content problems are the main cause of your ranking trouble. They're probably a factor, but likely not the main problem. However, it is wise to get these basic server configuration and linking issues resolved; At the very least, you'll have a much more solid foundation going forward to build your improvements on.
Jim
This is because the "-s"-format URL is rewritten to the script filepath, but then gets passed into the domain-canonicalization redirect rule. So the filepath is taken as the URL-path, gets tacked onto the corrected hostname, and is sent back to the client as the redirect URL... Bingo! Everyone now knows the internal filepath, and can use that as a valid (but non-canonical) URL.
Jim
Note: The time zone difference (US-UK) is such that I am often sleeping when you post. Really appreciate the time you have taken to help.
Dave
STOP!
Revealing the internal filepath is another cause of Duplicate Content.
jd was saying you should NOT reveal them.
That is, list all redirects before any rewrites to stop that happening.
The whole idea of the redirect is to force users to use one URL to access your content - asking for the "wrong" version of the URL simply makes the server tell you to go to the "right" URL for the content.
The whole idea of the rewrite is to translate the requested URL into an internal server filepath location, and fetch the content from it, without revealing what location actually is.
If you do the rewrite first, and then issue a redirect, users do get to see that internal filepath that you wanted to hide. That is bad news and should be avoided.
# Internally rewrite external URLs (with regular expressions) to scripts with query strings (variables) more efficiently than originally (all fine)
RewriteRule ^([^\-]+)-s\.html$ index.php?section=$1&%{QUERY_STRING}=$2 [L]
RewriteRule ^([^\-]+)-s-([^\-]+)-p-([^\-]+)-d\.html$ index.php?section=$1&page=$2&%{QUERY_STRING}=$3 [L]
RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&%{QUERY_STRING}=$3 [L]