Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite help needed please

mod rewrite help needed

         

Stuart Wright

12:31 am on Feb 11, 2008 (gmt 0)

10+ Year Member



Hello folks,
I need to create search engine friendly urls for my PHP generated web pages.
I would like, for example:

http://avplay.example.com/The Dirty Dozen Blu-ray Review/8741.html
to be rewritten to
http://avplay.example.com.com/index.php?showreview=8741

How would I do this, please?
Thanks in advance

[edited by: jdMorgan at 12:55 am (utc) on Feb. 11, 2008]
[edit reason] example.com [/edit]

g1smd

12:47 am on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What you are calling searchengine-friendly is anything but. Avoid spaces and underscores in URLs. Use hyphens or dots instead. Avoid Capitalisation Of Words Too. Use all lower-case.

As for the rewrite, this question comes up almost every day. Have a look at some earlier threads here for some pointers, as well as looking at the sticky posts pinned at the top of the forum.

Your example is very dangerous because I guess that it will accept www.domain.com/any-randoms-words-i-care-to-insert-here/8741.html and that is a major Duplicate Content issue waiting to happen. You are almost inviting people to abuse your site, making bogus URLs indexible, unless your scripting also checks the requested URL against the actual title of the page and then generates a 404 error if it does not match. That's a couple of extra lines in your script, rather than in the .htaccess file.

Stuart Wright

9:16 am on Feb 11, 2008 (gmt 0)

10+ Year Member



Thanks for the advice.
I can't see any Sticky threads at the top of the Apache Web Server forum. This site would be so much better if it used vBulletin.

The numeric bit at the end is the unique review ID which is the only bit of the URL which is used. The review title will be completely ignored and is for search engine use only.
I'll have a hunt around but I'm clueless at the ReWrite bit and I frankly would rather someone just told me what to put as I need to know this kind of stuff too rarely to make it worth learning it (I'm not lazy, just have a massive forum to run).

We could have anything in the url really. E.g. something like
http://avplay.example.com/the-dirty-dozen-blu-ray-review/showreview/1234.html
using the showreview bit to identify that we want to run the showreview script
and the 1234 numeric bit is the unique review id.
So it gets translated to
http://avplay.example.com/index.php?showreview=1234

[edited by: Stuart_Wright at 9:23 am (utc) on Feb. 11, 2008]

[edited by: jdMorgan at 3:49 pm (utc) on Feb. 11, 2008]
[edit reason] No URLs, please. Please see Terms of Service. [/edit]

wilderness

11:44 am on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't see any Sticky threads at the top of the Apache Web Server forum. This site would be so much better if it used vBulletin.

Perhaps you should read the FAQ?
Requires cookies and being logged in.

jdMorgan

4:17 pm on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This thread is untenable from the standpoint of our forum charter [webmasterworld.com] and our Terms of Service [webmasterworld.com]. If you do not wish to learn, then this is likely not the right forum for you. I'd suggest you look elsewhere or hire somebody. You've got a big forum to run, true, but many of us who volunteer here have our own clients or sites to look after.

Such "I can't be bothered" statements don't exactly motivate members to help you here.

And if you like vBulletin, compare a vBulletin forum with this one on dial-up. Now try it on a PDA or cell phone. Be sure you do these test while in Europe, or in a region where you must pay by the kilobyte, too. This site is fast and efficient, and forgoes the kind of 'fluff' that bloats up other forums.

The rule you seek is trivial.


RewriteRule ^[^/]+/([0-9]+)\.html$ /index.php?showreview=$1 [L]

but of course, it comes with the duplicate-content threat that g1smd warned of, and has some side-effects (you must use server-relative or canonical links on your pages). In order to prevent your competitors from creating 'junk' links to your site so as to drive it out of the SERPs, you'll need to validate the 'text' part of the link within your script, and return a 404-Not Found for all invalid URLs.

Jim

[edit] Amended reference to duplicate-content warning. [/edit]

[edited by: jdMorgan at 10:21 pm (utc) on Feb. 11, 2008]

Stuart Wright

6:44 pm on Feb 11, 2008 (gmt 0)

10+ Year Member



JD Thanks. I guessed that I might need to pay someone if specific help were not freely available.
It's not that I can't be bothered, JD, it's that I'm working a 60 hour week and have a family to entertain, so, though I would certainly enjoy learning this stuff, I unfortunately don't have time to do so.
I did have a look through existing threads but, unless there is an instance of someone replicating exactly my needs, then I'm not going to find something appropriate, and this stuff is way too far over my head for me to modify something close.

Perhaps you could explain this duplicate content threat you refer to as I don't understand it.
I'll go look up 'canonical' in a dictionary now as it's not a term I am familiar with.
As I say, the text part of the link is irrelevant and it is the numeric code prior to the '.html' which is the important bit.
I'm baffled that the text part of the link is at all relevant and that someone could or would want to drive our site out of SERPs (another term I'll go look up).

Your help is appreciated. Thank you.

[edited by: Stuart_Wright at 6:52 pm (utc) on Feb. 11, 2008]

jdMorgan

7:20 pm on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Only 60 hours -- Luxury!

> the text part of the link is irrelevant.

It's irrelevant to you, but not to search engines... Let's say I'm a competitor, and I want to cause you grief. I can link to your /<irrelevant>/<idNumber>.html URLs like this:
example.com/real-junk-on-this-site/1234.html
example.com/more-defective-products/1234.html
example.com/OSHA-recalled-products/1234.html
example.com/major-online-scammers/1234.html

Any and all of those links will return the "showreview=1234" page. So not only do you get the "benefit" of all those "nice" keywords in the linked URL, you also get the same content showing up at all four of them -- And do realize that since the first part of the URL-path is ignored, the combinations for potentially-malicious links (or even just misspelled URLs) are endless.

Now Google hates duplicate content; They don't want the same content to appear in their index under more than one "canonical" URL. They will remove all duplicate URL listings, and *they* get to pick which ones they remove, taking the choice out of your control. They *are* influenced however by incoming links and all other rank-weighting factors, so the "control" may actually go to your malicious competitor if he tries hard enough. This is, as we say in the business, "non-optimal."

Therefore, I recommend you check these 'irrelevant' URL-path-parts in your script, because they are in fact quite important.

...

If you do URL rewriting from a /subdirectory/pageID form to a /page?names=args form, then the browser resolves all page-relative included-object links on the page by using the domain/directory/subdirectory/ path currently indicated in its address bar, and appending the relative link. Search engine spiders use the same rule, although they don't have an address bar per se.

Therefore a link such as
<img src="images/logo.gif"> appearing on your page at example.com/blue-ray/1234.html will be requested from the canonical URL example.com/blue-ray/images/logo.gif -- Probably not what you want.

The solution is to use server-relative links, such as
<img src="/images/logo.gif">
or canonical links, such as
<img src="http://example.com/images/logo.gif">

I've used the term "canonical" in several ways here. Like many words, the intended meaning or "focus" varies according to context. But as used here, the meaning revolves around the concepts of "right and usual," "complete," "orthodox," or "conforming to the rules."

So a canonical URL is one that is the "one and only" URL that can be used to reach any given content.
A canonical URL may also mean the "full, formal URL" as in http://www.example.com/blue-ray/1234.html
Definition [google.com]

[added] SERPs: Search Engine Results Pages [/added]

Jim

[edited by: jdMorgan at 7:23 pm (utc) on Feb. 11, 2008]

Stuart Wright

8:00 pm on Feb 11, 2008 (gmt 0)

10+ Year Member



Ah very interesting. Thank you for both those nuggets of wisdom.
To combat the first issue, I should really pass the text element of the url through as a query string as well as the page id. Since this text will always be the title of the review (in lower case with spaces replaced with dashes), when I read the title from the database and it doesn't match the given text, I can force a 404 error with php.

But I guess I'm going to need a $2?
One last favour - how do you pass the contents between one pair of slashes as $1 (the text element) and the number before the '.html' (the review id) as the $2?

All my scripts use canonical links, so I'm safe there.

Many thanks for your generous advice.

jdMorgan

9:35 pm on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The URL-path-part matching any parenthesized sub-pattern is assigned to a $n variable (where n is 1 to 9) in the order indicated by the left parentheses, i.e. nesting is allowed. So in this case, add parentheses around the first subpattern to capture the "page title", and then back-reference it in the substitution path as $1. The "page" number will then be $2.

Jim

g1smd

11:11 pm on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, you need to compare the hyphenated part of the URL with the page title as stored in the database and reject with a 404 error, anything that does not match it. Otherwise, every page of your site will have an infinite number of URLs that could work.