Yet another .htaccess post - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Yet another .htaccess post

eclipsetbs

10:52 pm on Jan 21, 2009 (gmt 0)

10+ Year Member

Sorry about this, but I'm still only part-way learning the huge number of different aspects of using .htaccess, particularly for redirects and rewrites. I have my .htaccess almost fully working as I want it to, except for 3 redirects for 'escaped' script urls, which are baffling me. Maybe an obvious answer? This is the 'examplified' top of my .htaccess file; the 3 problem redirects are at the bottom :

Options -Indexes
Options +FollowSymLinks
RewriteEngine On
RewriteBase /

# Internally rewrite SEO-friendly URLs to scripts with query strings (all fine)

RewriteRule ^(.*)-s.html$ index.php?section=$1&%{QUERY_STRING}
RewriteRule ^(.*)-s-(.*)-p-(.*)-d.html$ index.php?section=$1&page=$2&%{QUERY_STRING}
RewriteRule ^(.*)-s-(.*)-a-(.*)-t.html$ index.php?section=$1&advertid=$2&%{QUERY_STRING}

# Externally Redirect direct client requests for non-canonical URLs to canonical URL (fine)

RewriteCond %{HTTP_HOST} ^example\.co.uk
RewriteRule ^(.*) http://www.example.co.uk/$1 [R=301,L]

# Externally Redirect for a particular script URL to a page in root, to allow unique title and meta data (fine)

RewriteCond %{QUERY_STRING} ^type=adv$
RewriteRule ^index.php$ /Property+Search.html? [R=301,L]

# Externally Redirect direct client requests for 'escaped' script URLs (from above 3 internal rewrites) back to SEO-friendly URLs (they do no harm but don't work - please help!)

RewriteCond %(THE_REQUEST) ^section=$1&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s.html? [R=301,L]

RewriteCond %(THE_REQUEST) ^section=$1&page=$2&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s-(.*)-p-(.*)-d.html? [R=301,L]

RewriteCond %(THE_REQUEST) ^section=$1&advertid=$2&%{QUERY_STRING}
RewriteRule ^index.php$ http://www.example.co.uk/(.*)-s-(.*)-a-(.*)-t.html? [R=301,L]

Thanks, Dave

jdMorgan

12:27 am on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You've got variables in RewriteCond patterns, and regular-expressions patterns in substitution URLs, neither of which will work.

The last of the new rules should probably look something like this, although I can't be sure without seeing example requested URLs and intended substitution URLs, and one of your query string variable names seems to be missing (I substituted "tango"):


RewriteCond %(THE_REQUEST) ^[A-Z]+\ /index\.php\?section=([^&]+)&advertid=([^&]+)&tango=([^&\ ]+)\ HTTP/ 
RewriteRule ^index\.php$ http://www.example.com/%1-s-%2-a-%3-t.html? [R=301,L]

This matches the exact request sent by the client (e.g. browser) and logged in your raw server access log, for example:

GET /index.php?section=top&advertid=abc123&tango=xyz HTTP/1.1

Also, the internal rewrite which corresponds to this redirect could be coded for much more efficient operation by making the patterns more-specific -- again in this case using negative-match subpatterns:


RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&tango=$3 [L]

Note also that the literal period has been escaped, as required to match a period rather than serving as a regex token meaning "match any single character."

Since these two examples represent your most-complex rules, I trust you can derive the simpler ones from them... :)

Two rules of thumb:

1) Never use multiple ".*" subpatterns in a pattern when it can be avoided. ".*" is a greedy pattern meaning "match anything and everything" and therefore, the first such subpattern will 'consume' the entire string, leaving rest of the pattern to 'starve'. The matching engine will then have to back off from the end of the string one, then two, then three ... characters per pass, trying to leave enough characters to find a match for the other ".*" subpatterns and exact-match characters. This can result in dozens, hundreds or even thousands of matching attempts for just this one rule, and can actually slow down a busy server.

It is far better to use the negative-match subpatterns when appropriate, since such a pattern can be evaluated in a single left-to-right pass. In case it is not clear, "([^&]+)" means "match one or more characters not an ampersand," or equivalently, "match one or more characters until you find an ampersand and then stop."

2) Put your rules in order with external redirects first, starting from most-specific pattern/least URLs affected to least-specific pattern/most URLs affected, then your internal rewrites, again in order from most-specific to least-specific.

You can also combine your Options into one line as "Options -Indexes +FollowSymLinks", and get rid of the "RewriteBase /" directive entirely; It is most likely not needed, since "RewriteBase /" is the default behavior.

(Hopefully, no more typos -- Been bad for that today...)

Jim

[edited by: jdMorgan at 1:15 am (utc) on Jan. 22, 2009]

g1smd

2:02 am on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

List all redirects before the rewrites.

Additionally, add [L] the end of all of your rewrites.

eclipsetbs

5:40 pm on Jan 22, 2009 (gmt 0)

10+ Year Member

Thanks Jim (and glsmd). May I say you are prolific with superfast detailed answers to many posts (must be a full-time passion for you!). Sorry for my delay (been to hospital).
I am taking care of all you said, except for (I need to clarify this) :

The 3 internal rewrites were inherited when I took over the site; they do the job they�re intended to do, without an [L] or any server problems, so I would need to be very careful if changing them. It is php4. An example of an external with its internal address (from the second rewrite of the 3) is :

#SE friendly url

/Travel+Services-s-17-p-Ferries-d.html

#internal url

/index.php?section=Travel+Services&page=17

Another version is internally generated for each internal page but I won�t complicate matters here! (confuses me and both can be addressed by clients, hence my concern that Google can see each page 3 ways and the need for external redirects for these).

So if the internal rewrites are working correctly, the apparently missing query string (I agree with you) is a mystery, to do with what happens in the internal codes. The above example comprises a Section (Travel +Services) with an Article (Ferries), ie �s�and �a� so what is �d� ? (There�s nothing below Articles and only the Travel+Services Section has Articles). So I�ve tried removing the �d�, but that stopped me going to the Article page. I want to use or adapt your excellent code but I first need to understand what�s happening here. My php knowledge is not clever enough yet to analyse the code in the files. Maybe I shouldn�t worry too much about Google penalising me for duplicate content?

Any ideas? Thanks again, Dave

jdMorgan

10:20 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The 3 internal rewrites were inherited when I took over the site; they do the job they�re intended to do, without an [L] or any server problems...

Except for a lot of wasted CPU cycles, and an attendant loss of performance. Always use [L] unless you have a very good reason not to.

I don't know what 'd' is either. All I can speak of (and think in terms of) is input URL-path, and either output URL (external redirect syntax and function) or output filepath (internal rewrite syntax and function).

I also can't help solve the "missing t=" problem, as I have no idea what it means or how it needs to be handled.

Start with a click on a URL appearing on your page, and work through the process that way. Starting anywhere else in the process is "starting in the middle" and leads to confusion and difficulty.

A link is clicked An HTTP request for that URL arrives at your server. The requested URL-path is passed to mod_rewrite, mod_rewrite compares the requested URL-path against the RewriteRule patterns. If a match is found, that rule's RewriteConds are checked. If the rule's RewriteConds match, the redirect or rewrite is invoked and the URL-path is updated. If an [L] flag is found on the invoked rule, stop processing for this pass and (in .htaccess) re-start rule processing. If no [L] flag is present, then process all subsequent rules (usually a total waste of time).

Once all matching rules have been invoked and no further rule-matches are found, control is passed to the next Apache module, and once all modules have executed, the content-handler phase is invoked to either serve a static page or invoke a script to generate and serve a page.

(This is vastly simplified, but is intended only to illustrate the most basic procedure).

I just spotted a problem with the internal rewrite rule I posted above. It should appear as


RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&tango=$3 [QSA,L]

if it is necessary to preserve query strings appended to "html" URLs. There are two problems, though. First, putting query strings on html URLs rather defeats the purpose of 'SEO friendly links," and secondly, now the internal rewrite and the external redirect rules are no longer complimentary. Without identifying all of the query string variables, there can be no solution to this problem. In fact, you cannot do the external redirect without resolving this question, because mod_rewrite has no idea of how to 'generate' the missing parameters. It has no knowledge outside of the variables it can examine, so it certainly cannot 'make up' something to put into the redirection URL... :(

Jim

eclipsetbs

10:21 pm on Jan 22, 2009 (gmt 0)

10+ Year Member

I just had a thought - Wouldn't my redirects be immediately converted back by the internal rewrites?!

I've tried various permutations without success (redirects always just ignored - no loops though!)

regards, Dave

eclipsetbs

10:58 pm on Jan 22, 2009 (gmt 0)

10+ Year Member

Our last posts crossed within seconds of each other1

Thanks Jim. I'll remember the [L] and add it and I'll also research where the 'd' and 't' come from (the 1st rewrite only has the 's' though, so I also need to search for the %{QUERY_STRING} - can't work out the code for this either. For this I've just tried (without success) :

RewriteCond %(THE_REQUEST) ^[A-Z]+\ /index\.php\?section=([^&\ ]+)\ HTTP/
RewriteRule ^index\.php$ http://www.example.com/%1-s.html? [R=301,L]

Thanks for the insight into how Apache works - I understood it perfectly, where official tutorials blind you with science and quickly lose novices like me. I am a technical writer and understand the need to lead from the top instead of leaping straight into the nitty gritty. Maybe we should write a winning manual along those lines together?

I'll carry on with this problem now (and keep in mind your new internal rewrite code)!

Cheers, Dave

jdMorgan

11:15 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yes and no.

The redirect is used to establish the "correct" URLs in search engines -- to force them to recognize the new URLs and tell them to ascribe the link-popularity and PageRank of the old URLs to the new. It also serves as a heads-up to people using bookmarks (if they notice that the URL in their address bar changes). Remember, URLs are used on the Web -- they define the "Web view" of your site.

You need to do this because although you update your entire site to link only to the new URLs, there are undoubtedly many links out there on the Web to the old URLs, and you need to indicate to the search engines that the new ones are the "right" ones.

So, we redirect the client to the new URL, and the client then re-requests the page using the new URL. When this request arrives at the server, it is then internally rewritten to the script filepath -- and no indication of this rewrite is given to the client. So the internal filepath is now only "associated" with the URL, and no longer resembles it.

If you don't include the RewriteCond checking THE_REQUEST inthe redirect rules, you will indeed get an infinite rewrite/redirect loop if your code is in .htaccess. This is because mod_rewrite in an .htaccess context behaves recursively, and is re-started if any rule is invoked. The RewriteCond checking THE_REQUEST ensures that the redirect is only invoked if the old URL/filepath is requested directly by a browser or search engine robot, and will not be invoked by an internal script filepath request resulting from the action of the internal rewrite rule.

If you're not seeing any redirects, then it is because the requested URL-path doesn't exactly-match the pattern of any of the redirect rules. That and you need to be sure to completely flush your browser cache after changing any server-side code. Otherwise, your browser will cache the page contents and the server response code, and use the cached data instead of making a request to your server. If you are doing a very long test session, consider disabling your browser cache (set the cache size or time to zero), but be sure to re-enable it when done!

Jim

g1smd

11:17 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

*** Wouldn't my redirects be immediately converted back by the internal rewrites? ***

Yes, but one is an external URL and the other is an internal server filepath. :)

jdMorgan

11:33 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I should also point out that when the server returns a redirect response, that terminates the current HTTP transaction, and the client has to start a new HTTP transaction if it wishes to 'follow' the redirect. When this new HTTP request arrives at the server, the server does not associate it in any way with the previous HTTP request that invoked the redirect; The server has no memory of that previous request, because HTTP is a stateless protocol.

On the other hand, an internal rewrite takes place entirely within the context of the current HTTP transaction.

This doesn't necessarily answer any of the questions above, but making sense of redirects and rewrites is almost impossible without some idea of how the whole process works.

Jim

eclipsetbs

12:44 am on Jan 23, 2009 (gmt 0)

10+ Year Member

Wow! On the question re any redirect/rewrite conflicts, thanks for the concise advice glsmd, but mostly for the thorough and eye-opening analysis Jim. I'm sure many others will also learn from that (and all your other posts). I have certainly taken it all onboard and will use it's advice.

Just a small point - when I referred to SE-friendly URLs I really meant the ONLY url a SE is allowed to see (not ALSO the other internal ones that somehow have been revealed, causing duplicated content). I can live with 'Travel+Services-s-19-p-Language+and+Travel+Guides-d.html' (probably the longest one) FOR NOW. My immediate concern is for any duplicated content penalties. This recently inherited website has 60% editable Sections, with the others all with their Titles and meta data set as the Homepage, so I have stripped the 2 main ones out and placed them as .html files in the root directory, and added the following simple (successful) redirects in .htaccess :

# Externally Redirects for script URLs to pages in root, to allow unique title and meta data (fine)

#Before internal redirects
RewriteCond %{QUERY_STRING} ^type=adv$
RewriteRule ^index.php$ /Property+Search.html? [R=301,L]

#Necessary AFTER internal redirects (maybe because there is no RewriteCond)
RewriteRule ^Register-s.html$ /Register.html [R=301,L]

Anyway, it's a very professional website, but it's a very competitive business, we are getting next to no visitors and I'm trying desperately to satisfy Google and get some sort of ranking (without things like artificial link exchanges, etc).

As I said, I will now try to analyse the strange internal rewrites we have.

Many thanks, Dave

jdMorgan

2:43 am on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There's no such thing as a duplicate-content penalty; Whenever I refer to it, I try to always put the "penalty" word in quotes. It is simply a matter of two or more URLs (the only "unit" that search engines deal with) with identical content competing against each other for ranking. One URL might get 99% of the ranking credit and the other only 1%, but then at least the one with the 99% share has a chance to be on the first page of search results.

But what if they split the ranking credit 50-50? Both URLs on page 3 (or page 10) of the search results isn't so good. And it only gets worse if you have muliple factors causing even more 'mirror' URLs: www- versus non-www, http versus https, etc...

Then there is a temporal effect, in that you can 'confuse and befuddle' the search engines when they are trying to figure out which URL to list. This requires a back-end process that is hugely expensive in terms of computing power, and which wouldn't be needed if Webmasters didn't allow canonicalization problems to occur in the first place. So, you throw yourself at the mercy of the search engines' back-end de-duplication process, and hope that they have time to figure our your site's quirks before they roll out a new index... If they don't get around to de-duplicating your site, then what happens? I don't don't know and don't care to find out.

If you haven't checked out the Google Search forum's Library here at WebmasterWorld, I'll commend it to you; Lots of tutorials --some quite old (in Web years) but still 99% true-- on how to get ranked in Google.

I won't say that these duplicate-content problems are the main cause of your ranking trouble. They're probably a factor, but likely not the main problem. However, it is wise to get these basic server configuration and linking issues resolved; At the very least, you'll have a much more solid foundation going forward to build your improvements on.

Jim

jdMorgan

2:53 am on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

BTW, the easiest way to 'reveal' an internal filepath is to put external redirects after your internal rewrites. For example, the code in your original post is guaranteed to reveal an internal filepath/URL if the original request is for any of the "-s"-format URLs, but using the non-canonical domain...

This is because the "-s"-format URL is rewritten to the script filepath, but then gets passed into the domain-canonicalization redirect rule. So the filepath is taken as the URL-path, gets tacked onto the corrected hostname, and is sent back to the client as the redirect URL... Bingo! Everyone now knows the internal filepath, and can use that as a valid (but non-canonical) URL.

Jim

eclipsetbs

10:49 am on Jan 23, 2009 (gmt 0)

10+ Year Member

Thanks for that duplicated content insight Jim; it is good to know the real implications. I will also avidly read posts in the Google Search forum's Library. Now I will try to reveal the internal filepaths for all my pages using your method. Incidentally, that will have been how these were exposed in the first place (because until now the redirect for canonical urls has always been AFTER the internal rewrites). Some of us learn the hard way after confusing the SEs!

Note: The time zone difference (US-UK) is such that I am often sleeping when you post. Really appreciate the time you have taken to help.

Dave

g1smd

11:01 am on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

*** Now I will try to reveal the internal filepaths for all my pages using your method. ***

STOP!

Revealing the internal filepath is another cause of Duplicate Content.

jd was saying you should NOT reveal them.

That is, list all redirects before any rewrites to stop that happening.

The whole idea of the redirect is to force users to use one URL to access your content - asking for the "wrong" version of the URL simply makes the server tell you to go to the "right" URL for the content.

The whole idea of the rewrite is to translate the requested URL into an internal server filepath location, and fetch the content from it, without revealing what location actually is.

If you do the rewrite first, and then issue a redirect, users do get to see that internal filepath that you wanted to hide. That is bad news and should be avoided.

eclipsetbs

11:53 am on Jan 23, 2009 (gmt 0)

10+ Year Member

Thanks glsmd. But I plan only to reveal the internal filepaths (with the canonical redirect below the internal rewrites) to see myself what they are. Of course I will then replace the canonical one above the internal rewrites. As I said in my previous post, I recognise why the internal filepaths have already been revealed due my previous ignorance.
Regards, Dave

eclipsetbs

12:00 pm on Jan 23, 2009 (gmt 0)

10+ Year Member

Sorry glsmd. I'm so impulsive! On second thoughts, I will not do what I said - I think I already know what the internal filepaths are anyway and I will just try to se how these are derived in the php code (as I planned).

Dave

g1smd

12:43 pm on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You should already know what the "real" internal filepaths are, because you will have had to code them in as the target internal filepaths of your internal rewrites when you set those up.

eclipsetbs

1:28 am on Jan 25, 2009 (gmt 0)

10+ Year Member

Just a quick update. I managed to change the 3 internal Rewrites successfully based on your post Jim and a day experimenting (will advise if I manage the converse later) :

# Internally rewrite external URLs (with regular expressions) to scripts with query strings (variables) more efficiently than originally (all fine)

RewriteRule ^([^\-]+)-s\.html$ index.php?section=$1&%{QUERY_STRING}=$2 [L]
RewriteRule ^([^\-]+)-s-([^\-]+)-p-([^\-]+)-d\.html$ index.php?section=$1&page=$2&%{QUERY_STRING}=$3 [L]
RewriteRule ^([^\-]+)-s-([^\-]+)-a-([^\-]+)-t\.html$ index.php?section=$1&advertid=$2&%{QUERY_STRING}=$3 [L]