Forum Moderators: phranque

Message Too Old, No Replies

Rewrite rule question

Variables separated by dashes, but one variable contains 4 dashes itself.

         

gcan

9:03 pm on Sep 28, 2009 (gmt 0)

10+ Year Member



Hello,

This rewrite rule:


RewriteRule ^([0-9a-z-]+)-([a-z]+)-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)-([a-z]+)-([0-9]+).html$ multimedia.php?description=$1&action=$2&city=$3&cid=$4&pid=$5&album=$6&idm=$7&start=$8 [NC,L]

works fine for:


http://www.example.com/hotels-apartaments-hostels-united-kingdom-cat-0-13-0-0-en-0.html

At the same time I wonder how can it work. All variables in my rewrite rule are separated by dashes "-", but the first variable "description" contains four dashes itself: "hotels-apartaments-hostels-united-kingdom".

So, I can't understand how Apache knows where the first variable ends. Is Apache reading rewrite rules from right to left?

I hope that you can understand what I mean.

As I said, this rule works fine and I don't want to change it. I opened this thread just to make sure that there is nothing wrong with this rewrite rule and I can leave this rewrite rule as it is.

[edited by: jdMorgan at 12:52 pm (utc) on Sep. 29, 2009]
[edit reason] example.com [/edit]

jdMorgan

10:16 pm on Sep 28, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It "knows" because the hyphens are allowed by your first subpattern. However it's also inefficient, because the pattern-matching engine has to 'try' hundreds of times to get a match on your first sub-pattern while also getting a match on the suibsequent seven subpatterns.

It will initially match the entire request-URL into the first subpattern, then fail to get a match on the second subpattern. So it will 'back off' one character and try again, fail, back off and fail again, etc. until it gets a match on both the first and second subpattern. But then it will fail on the third subpattern, so it will again start the back-off-and-retry, one character at a time, until if can match the first through third subpatterns. But then it will fail on the fourth subpattern... I trust you can see how this continues until all subpatterns are matched, and that you also realize how horribly inefficient it is...

For best performance and to avoid an early server upgrade, you should avoid using any character as your parameter-delimiter that might also appear in the parameters themselves.

I would not recommend leaving this situation as-is.

You should also escape the period preceding 'html' with a backslash, as otherwise it means "match any single character" and is therefore ambiguous. Use "\.html"

Jim

gcan

11:02 pm on Sep 28, 2009 (gmt 0)

10+ Year Member



djMorgan, thank you for your reply. OK, I changed everything. Can you take a look again please?

Thank you.


RewriteRule ^multimedia/([0-9a-z-]+)/([0-9a-z-]+)/([a-z]+)-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)-([a-z]+)-([0-9]+)\.html$ multimedia.php?description=$1&country=$2&action=$3&city=$4&cid=$5&pid=$6&album=$7&idm=$8&start=$9 [NC,L]

----------------------------

[mydomain.com...]

jd01

5:15 am on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



RewriteRule ^multimedia/([0-9a-z-]+)/([0-9a-z-]+)/([a-z]+)-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)-([a-z]+)-([0-9]+)\.html$ multimedia.php?description=$1&country=$2&action=$3&city=$4&cid=$5&pid=$6&album=$7&idm=$8&start=$9 [NC,L]

It looks much better and the following depends on your exact application, but you might be able to shorten and speed it up a bit, by matching anything except a /, followed by a / EG

RewriteRule ^multimedia/([^/]+)/([^/]+)/([a-z]+)-([0-9]+)-([0-9]+)-([0-9]+)-([0-9]+)-([a-z]+)-([0-9]+)\.html$ multimedia.php?description=$1&country=$2&action=$3&city=$4&cid=$5&pid=$6&album=$7&idm=$8&start=$9 [NC,L]

The pattern [^/] means:
^ = Is Not
/ = the pattern that Should Not be matched.

If you do not need to check to see if the input is all letters / numbers, you might be a bit faster using [^-] in the hyphenated section, but it really depends on your exact application and the ambiguity you are will to allow for URL input.

gcan

9:07 am on Sep 29, 2009 (gmt 0)

10+ Year Member



jd01, thank you for your reply.

Both variables - "description" and "country" are just descriptions for search engines and don't affect my application in any way.

I could use ([^/]+) but the question is about security. Is it safe? I don't know a lot about htaccess files/rewriting.

jd01

10:10 am on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it should be... It simply says to match 'any character that is not a /' and that's the information to store as a back-reference and then reference in the Query_String you are sending to your PHP. As long as you check your $_GET variables within your PHP script (which should be done with all passed variables anyway) there should be no issue at all.

.htaccess is not like PHP, where you have to worry too much about what it receives. Either the pattern matches or doesn't, and since you are probably checking all information being passed to PHP within the PHP itself, there's no need for the 'double check'. If you are not checking all variables passed into PHP, then I highly recommend you do, because it's where the security issue is most likely to arise if there is one.

You cannot 'break into' a site through mod_rewrite like you can with a 'scripting language'.

gcan

10:28 am on Sep 29, 2009 (gmt 0)

10+ Year Member



Yes, of course I check $_GET variables within my php scripts.

I asked this question just to be sure that it's OK not to check the variables in htaccess file (if they are checked in php file).

Thank you very much.

jd01

10:31 am on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not a problem... Glad I could help.

Yes, of course I check $_GET variables within my php scripts.

You might think that would be standard procedure for everyone, but unfortunately for some it's not... Which is too bad.

jdMorgan

1:02 pm on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Both variables - "description" and "country" are just descriptions for search engines and don't affect my application in any way.

Expanding on the "security" aspect a little, I should point out that if the above is true, then you have a "competitive security problem."

Assuming that I'm the Webmaster for your most serious compititotr what is to stop me from throwing up a bunch of links to the web to "www.your-domain.com/sleazy-rat-infested-hotels-apartaments-hostels-united-kingdom-cat-0-13-0-0-en-0.html" and "www.your-domain.com/lice-infestations-at-hotels-apartaments-hostels-united-kingdom-cat-0-13-0-0-en-0.html" with apt descriptive text in the links, and making these show up as valid links to your site in search results?

You must pass and check *all* of the parameters against your database. If the description and country do not *exactly* match (character-for-character) the expected values given the action, city, cid, pid, album, idm, and start values (as applicable), then pull the correct values from your database, and force a 301-Moved Permanently redirect to the correct URL.

That handles malicious linking. You may also wish to apply some level of "intelligent analysis" to requested URLs which do not resolve, keying off the most-important combinations of parameters to try to find a match if the URL isn't quite correct. For example, it's quite common to get requests for URLs which are followed by periods, commas, quote marks, spaces, exclamation points, and the HTML tag closing character ">" due to faulty links or faulty auto-linking code in forums, blogs, etc. And of course, there are always typos and common spelling errors if a link is manually entered.

As a simple example, you might get a request for www.example.com/hotels-apartaments-hostels-united-kingdom-cat-0-13-0-0-en-0.html." because a forum auto-linked a URL, wrongly including the period at the end of a sentence.

Sometimes, the "security hole" isn't where you most expect it to be...

Jim

gcan

2:56 pm on Sep 29, 2009 (gmt 0)

10+ Year Member



djMorgan, thank you very much for this message.

Some time ago I was thinking about this problem. I checked many websites, including large and serious websites. Many of them have descriptive teksts in URLs which can be changed to any text, and the website still displays the same content because these descriptions are not checked against database. So, I decided then that it's not a problem at all. But you are right, the compitirors may create bunch of links which will show the same content.

Now I am going to change my scripts so that php script checks all the parameters. If it will be not possible (some texts are from language files, not the database) I will remove descriptions from all URLs.

Is it a big pluss in the eyes of search engines to have some descriptive teksts in URLs if I have Titles and Descriptions in the <heads> ?

I have checked my website and it's not possible to open any content URL is followed by additional charachers like html¦ or html.... The browser displays 404 page.

All other parameters (except 2 descriptions) are checked by my php scripts.

jdMorgan

3:07 pm on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Now I am going to change my scripts so that php script checks all the parameters. If it will be not possible (some texts are from language files, not the database) I will remove descriptions from all URLs.

If it is possible to put the link on one of your pages, then it is possible to check that link-text...

> Is it a big plus in the eyes of search engines to have some descriptive teksts in URLs if I have Titles and Descriptions in the <heads> ?

Yes, the text in the URL is important as both a ranking factor and as an 'eyeball-catcher' in the search results -- remember that words matching the search terms will be bolded in the search results.

I *would not* remove the descriptive text from your links -- Find a way to check it.

Jim

gcan

3:18 pm on Sep 29, 2009 (gmt 0)

10+ Year Member



And one more question. What about customized 404 pages?
Can they be considered as dublicate content?

jdMorgan

11:25 pm on Sep 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, because when served, they are properly accompanied by a server response code of 404-Not Found, and URLs resulting in that response are not indexed by search engines.

This assumes that the server is properly configured; It is easy to make a mistake when declaring custom error documents that results in a 302-Found response, and this happens quite frequently to Webmasters who don't read the ErrorDocument directive documentation carefully and/or who do not test their error responses with a server headers checker. It's the kind of mistake where not spending 3 minutes reading (or not understanding what was read) can cost millions of dollars and hundreds of jobs, and unfortunately, it happens fairly often.

Jim

gcan

11:48 am on Sep 30, 2009 (gmt 0)

10+ Year Member



Just one example that even big websites contain descriptions in their URLs which are not checked against DB:

[games.yahoo.com...]
[games.yahoo.com...]
[games.yahoo.com...]
[games.yahoo.com...]

About 404 pages. My question was not correct. I wanted to ask about errors which return scripts.

For example:
[domain.com...]

So, if id #5 is deleted, the script will show some info telling that this id doesn't exist. It will be not a real 404 page.

Caterham

12:58 pm on Sep 30, 2009 (gmt 0)

10+ Year Member



So, if id #5 is deleted, the script will show some info telling that this id doesn't exist. It will be not a real 404 page.

It is not that difficult to tell PHP to output a "404 Not Found" status header.

g1smd

2:10 pm on Sep 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it doesn't exist the script must send a 'HEADER 404' message.

jdMorgan

6:25 pm on Sep 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> even big websites contain descriptions in their URLs which are not checked against DB...

Yes, well when your pages out-rank the Yahoo Games pages in search, then you won't need to worry about wasted ranking factors due to duplicate-content problems. How many hundreds of PR8, PR9, and PR10 inbound links do you have? Got any to spare? :)

Sometimes, looking at 'big Web sites' is not an appropriate thing to do, unless your site is also a 'big Web site'...

Jim

gcan

12:54 pm on Oct 2, 2009 (gmt 0)

10+ Year Member



Thank you very much to all. I changed my script and all variables are checked now. Header 404 is sent now if $id doesn't exist.

jdMorgan

1:39 pm on Oct 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cool! You may never know if this has actually helped you, but at least you know that bogus (or mis-typed) descriptions in links now can't hurt you.

Jim