homepage Welcome to WebmasterWorld Guest from 50.19.206.49
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 58 message thread spans 2 pages: < < 58 ( 1 [2]     
301'ing over 50 thousand pages
Tom_Cash




msg:4278369
 3:57 pm on Mar 8, 2011 (gmt 0)

Hey guys,
Where we work, we have a product index of over 50 thousand items.

I generated some PHP to automatically write 55k worth of re-write for me because I don't know apache server configuration directives and regex well enough.

This method worked well - however, almost crashed the server.

In reality, we can only cope with 10ks worth of re-writes.

This is an issue however, as we have 35k worth of pages in Googles index.

How can I get around this?

Any advice would be super!

Kind regards,
Tom.

 

Tom_Cash




msg:4283694
 2:46 pm on Mar 18, 2011 (gmt 0)

Thanks for you reply.

Change all the Redirect instructions to use RewriteRule syntax with the [R=301,L] flags. The target must also include the protocol and domain name.

Okay, cool. Why would you recommend this? I tried it, and it stopped working. I'm really confused. :| Do you think we have an unusual server configuration or something?

The internal rewrites must be moved to the very end of the file. Currently the first internal rewrite grabs all the requests, and the longer rule at the end never gets to run. The "catch all" rule must be the very last rule of all.

By "catch all" do you mean this statement:

# REMOVE PHP EXTENSIONS
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ $1.php [L,QSA]


1. Is it right that + and - in the old URL become _ in the new URL?

Yeah.

2. Is the 7ua11 in the old URL, still 7ua11 in the new URL, or is it 7ub21? If the latter, then RewriteRule has NO WAY to know how to change the numbers.

Yeah, 7ua11 is the same in both URLs.

g1smd




msg:4283846
 8:01 pm on Mar 18, 2011 (gmt 0)

The fix is merely getting the rules in the right order:
- all of the external redirects listed before any of the internal rewrites,
- within the list of external redirects, the most specific first and the more general last,
- within the list of internal rewrites, the most specific first and the more general last,
AND
- getting the correct syntax for all of the code.

Use RewriteRule syntax for all of the rules otherwise you risk rules running out of order and introducing unwanted multiple step redirection chains and/or external redirects exposing previously rewritten requests back out on to the web as URLs. Rules within .htaccess are evaluated in "per-module" order and not in the exact order they are listed in the file.

Yes, the one with .* pattern is the "catch all" rule and must be last.

Tom_Cash




msg:4284999
 12:13 pm on Mar 21, 2011 (gmt 0)

Thanks for another reply. :)

I'm not sure exactly what makes the difference between an interanal and external rewrite... I followed your instructions otherwise, still getting errors all over the shop.

Here's the new file:

# ERROR DOCUMENTS
ErrorDocument 400 http://website.com/error
ErrorDocument 401 http://website.com/error
ErrorDocument 403 http://website.com/error
ErrorDocument 404 http://website.com/error
ErrorDocument 500 http://website.com/error

RewriteEngine On

# HOME
RewriteRule /index http://website.com [R=301,L]

# REDIRECT OLD CONTENT
# PRODUCT GROUPS
RewriteRule /search/+/all+manufacturers/+/cnc/2/10/1/ /products/cncs/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/drive/4/10/1/ /products/drives/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/servo+drive/18/10/1/ /products/servos/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/encoder+%26+resolver/5/10/1/ /products/motors_and_encoders/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/plc/14/10/1/ /products/plcs_and_software/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/computer/3/10/1/ /products/indnettrial_pcs_and_hmis/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/monitor/10/10/1/ /products/monitors/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/robot/16/10/1/ /products/robots/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/power+supply/15/10/1/ /products/power_supplies/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/safety+equipment/17/10/1/ /products/safety_equipment/1 [R=301,L]
RewriteRule /search/+/all+manufacturers/+/comms/34/10/1/ /products/communications/1 [R=301,L]

# CONTACT
RewriteRule /enquire.php /general_enquiry [R=301,L]

# HELP
RewriteRule /help_policies.php /help/policies [R=301,L]

# SITE MAP
RewriteRule /visual_sitemap.php /site_map [R=301,L]

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://website.com/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ http://website.com/equipment/index.php?name=$1&id=$2 [L]

# REMOVE PHP EXTENSIONS
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ $1.php [L,QSA]

g1smd




msg:4285017
 12:45 pm on Mar 21, 2011 (gmt 0)

Rule order looks OK. RewriteRule pattern matching is localised "per-directory" and therefore the rule cannot "see" the leading slash of a request.

The 301 redirect target needs to include the protocol and the domain name, not just the path part of the redirected-to URL.

The internal rewrite (second from last rule) with the [L] flag should not include the domain name as that makes it into a 302 redirect. You need an internal rewrite here.

I'm not sure exactly what makes the difference between an internal and external rewrite
Ouch! That's a big problem.

A redirect tells the browser to make a new request for a different URL. The URL shown in the browser URL bar will change when that request is made. A redirect is a URL to URL translation. The rule target is a URL with domain name included. It uses the [R=301,L] flags for a 301 redirect. If you forget to add the R=301 flag, you get a 302 redirect.

An internal rewrite accepts an incoming URL request and silently translates it to get the content from some non-default location inside the server, without revealing to the outside world what that location is. A rewrite is a URL to filepath mapping. The rule uses the [L] flag and does not include the domain name. Including the domain name makes it a 302 redirect, not an internal rewrite.

Tom_Cash




msg:4285066
 2:25 pm on Mar 21, 2011 (gmt 0)

Thanks for such a comprehensive reply. Like I said, I'm very new to RegEx. I know what I want to do, and why I want to do it... I just don't have the RegEx know-how yet.

I've continued to follow your excellent advice and more of my commands are starting to work.

I've managed to get a lot of it working, following your advice. Here's my file so far with just the bits that still don't work.

# ERROR DOCUMENTS
...
RewriteEngine On

# HOME
...

# REDIRECT OLD CONTENT
# PRODUCT GROUPS
RewriteRule search/+/all+manufacturers/+/cnc/2/10/1/ http://website.com/products/cncs/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/drive/4/10/1/ http://website.com/products/drives/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/servo+drive/18/10/1/ http://website.com/products/servos/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/encoder+%26+resolver/5/10/1/ http://website.com/products/motors_and_encoders/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/plc/14/10/1/ http://website.com/products/plcs_and_software/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/computer/3/10/1/ http://website.com/products/indnettrial_pcs_and_hmis/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/monitor/10/10/1/ http://website.com/products/monitors/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/robot/16/10/1/ http://website.com/products/robots/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/power+supply/15/10/1/ http://website.com/products/power_supplies/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/safety+equipment/17/10/1/ http://website.com/products/safety_equipment/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/comms/34/10/1/ http://website.com/products/communications/1 [R=301,L]

# CONTACT
...

# HELP
...

# SITE MAP
...

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://website.com/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ http://website.com/equipment/index.php?name=$1&id=$2 [L]

# REMOVE PHP EXTENSIONS
...


Here's the full file:

# ERROR DOCUMENTS
ErrorDocument 400 http://website.com/error
ErrorDocument 401 http://website.com/error
ErrorDocument 403 http://website.com/error
ErrorDocument 404 http://website.com/error
ErrorDocument 500 http://website.com/error

RewriteEngine On

# HOME
RewriteRule index http://website.com [R=301,L]

# REDIRECT OLD CONTENT
# PRODUCT GROUPS
RewriteRule search/+/all+manufacturers/+/cnc/2/10/1/ http://website.com/products/cncs/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/drive/4/10/1/ http://website.com/products/drives/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/servo+drive/18/10/1/ http://website.com/products/servos/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/encoder+%26+resolver/5/10/1/ http://website.com/products/motors_and_encoders/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/plc/14/10/1/ http://website.com/products/plcs_and_software/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/computer/3/10/1/ http://website.com/products/indnettrial_pcs_and_hmis/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/monitor/10/10/1/ http://website.com/products/monitors/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/robot/16/10/1/ http://website.com/products/robots/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/power+supply/15/10/1/ http://website.com/products/power_supplies/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/safety+equipment/17/10/1/ http://website.com/products/safety_equipment/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/comms/34/10/1/ http://website.com/products/communications/1 [R=301,L]

# CONTACT
RewriteRule enquire.php http://website.com/general_enquiry [R=301,L]

# HELP
RewriteRule help_policies.php http://website.com/help/policies [R=301,L]

# SITE MAP
RewriteRule visual_sitemap.php http://website.com/site_map [R=301,L]

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://website.com/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ http://website.com/equipment/index.php?name=$1&id=$2 [L]

# REMOVE PHP EXTENSIONS
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ $1.php [L,QSA]

[edited by: Tom_Cash at 2:34 pm (utc) on Mar 21, 2011]

g1smd




msg:4285070
 2:30 pm on Mar 21, 2011 (gmt 0)

The ErrorDocument directives must NOT include the domain name otherwise they will all return a 302 Found response.

ErrorDocument 404 /error404.php or similar is what you need.


The pattern
search/+/all+manufacturers/+/cnc/2/10/1/ is looking for a URL request like: example.com/search/////////////allllllllllllmanufacturers///////////////cnc/2/10/1/

If there is a literal "+" in the URL request, it needs to be escaped \+ with the backslash.

Literal periods in patterns need to also be escaped.

The second last rule presents as a 302 redirect. That should be an internal rewrite. Lose the domain name from the rule target. Retain the [L] flag.

[edited by: g1smd at 2:42 pm (utc) on Mar 21, 2011]

Tom_Cash




msg:4285073
 2:38 pm on Mar 21, 2011 (gmt 0)

Thanks again mate.

ErrorDocument 404 /error404.php or similar is what you need.

This didn't work for some reason. Nor did:
ErrorDocument 404 error404.php

Unusual...

I made the other change you recommended, still not much luck.

Here's the code:

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://new.lektronix.net/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]

Tom_Cash




msg:4286251
 1:44 pm on Mar 23, 2011 (gmt 0)

Bumpity bump.

g1smd




msg:4286433
 7:10 pm on Mar 23, 2011 (gmt 0)

If you request the old URL, does the server send a redirect response?

If it does not, then the pattern is wrong. If it does, what is the URL you are redirected to? Is that the right URL for the new location? If not, in what way does it differ from what you expect?


If you request the new URL, does that result in the correct content being shown? If not, what is shown?

Tom_Cash




msg:4288505
 2:30 pm on Mar 28, 2011 (gmt 0)

Thanks for another reply mate. :)

Something really odd has happened. Since last looking at the code, things are working differently!

I swear it's the server, being weird...

Now, say you append equipment/16955/siemens/plc/6es5+322-0aa11/6es5322011/ to the url of the website, you get taken to the following:

http://www.example.com/?name=siemens-6es5_430_7la11&id=3403

Borderline perfect... How can I fix this?

The code so far...

# ERROR DOCUMENTS
ErrorDocument 400 http://example.com/error
ErrorDocument 401 http://example.com/error
ErrorDocument 403 http://example.com/error
ErrorDocument 404 http://example.com/error
ErrorDocument 500 http://example.com/error

RewriteEngine On

# HOME
RewriteRule index http://example.com [R=301,L]

# REDIRECT OLD CONTENT
# PRODUCT GROUPS
RewriteRule search/+/all+manufacturers/+/cnc/2/10/1/ http://example.com/products/cncs/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/drive/4/10/1/ http://example.com/products/drives/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/servo+drive/18/10/1/ http://example.com/products/servos/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/encoder+%26+resolver/5/10/1/ http://example.com/products/motors_and_encoders/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/plc/14/10/1/ http://example.com/products/plcs_and_software/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/computer/3/10/1/ http://example.com/products/indnettrial_pcs_and_hmis/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/monitor/10/10/1/ http://example.com/products/monitors/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/robot/16/10/1/ http://example.com/products/robots/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/power+supply/15/10/1/ http://example.com/products/power_supplies/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/safety+equipment/17/10/1/ http://example.com/products/safety_equipment/1 [R=301,L]
RewriteRule search/+/all+manufacturers/+/comms/34/10/1/ http://example.com/products/communications/1 [R=301,L]

# CONTACT
RewriteRule enquire.php http://example.com/general_enquiry [R=301,L]

# HELP
RewriteRule help_policies.php http://example.com/help/policies [R=301,L]

# SITE MAP
RewriteRule visual_sitemap.php http://example.com/site_map [R=301,L]

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://example.com/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]

# REMOVE PHP EXTENSIONS
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ $1.php [L,QSA]


Cheers!

g1smd




msg:4288507
 2:36 pm on Mar 28, 2011 (gmt 0)

There appears to be an unwanted "index.php to / external redirect" that occurs AFTER the URL has been rewritten to the internal server filepath with attached parameters.

This unwanted redirect "exposes" the rewritten URL back out on to the web. This is often a problem when you have bits of mod_rewrite code in the root .htaccess file and other bits in separate .htaccess files in various folders.

In this case, however, it is probably this rule that affects things:
RewriteRule index http://example.com [R=301,L]

After the URL request has been rewritten to point to the internal server path at index.php?parameters (by the code near the very end of your ruleset), mod_rewrite processing starts again and this "index rule" unfortunately matches the current value of that internal pointer and the rule therefore redirects to strip the index part from that path. In doing so, it exposes the parameters back out on to the web as a new URL.

If you use the Live HTTP Headers extension for Firefox you will see the double redirect that this produces.

The fix is to only do the index redirect for Direct Client (i.e. external) Requests. Change it to this new code:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
RewriteRule ^(([^/]+/)*)index\.php$ http://www.example.com/$1 [R=301,L]


Additionally, remove the domain name from all of your ErrorDocument directives. With a domain name in the directive, they each return an incorrect 302 status.

httpwebwitch




msg:4288519
 3:04 pm on Mar 28, 2011 (gmt 0)

one of my favourite examples of URL fluff:

www.amazon.com/Snape-Kills-Dumbledore-on-page-606/dp/0439785960/

It's a URL structuring problem that's been around as long as word-stuffing SEO tactics, and it's done wrongly more often than it's done right.

g1smd




msg:4288521
 3:05 pm on Mar 28, 2011 (gmt 0)

Lovely. If I get accepted to speak at SMX London in May, do you mind if I use that example?

I had others, but that's a classic.

chewy




msg:4288671
 7:50 pm on Mar 28, 2011 (gmt 0)

Just in case no one else says this...

You 2 wizards are better than any Dumbledore!

I sure hope I get to read the case somewhere with real world examples writ large.

And... if they don't accept one or both of you for SMX, there is something seriously wrong with the system.

jdMorgan




msg:4288727
 9:50 pm on Mar 28, 2011 (gmt 0)

Once the redirection and rewriting stuff is working, do not forget to re-address the ErrorDocument problem. Using a protocol and domain name in ErrorDocument directives results in a server status of 302 being returned for ALL erroneous requests. To be clear, when requesting a bogus URL, Googlebot will see a 302 redirect when you want it to see a 404-Not Found or 410-Gone. This is nothing short of SEO suicide...

Jim

Tom_Cash




msg:4288881
 8:45 am on Mar 29, 2011 (gmt 0)

Thanks again g1smd... This code has really come along thanks to you! I'm so close to getting this sorted now!

Your were right about that index file redirect, I removed that and problem solved. It worked!

However... (why is there always a "however"?)

It only worked for URLs with a structure similar to below:

equipment/16955/siemens/plc/6es5+322-0aa11/6es5322011/

This is my fault for not making things clear earlier on. I can fully understand what you were asking, now that it works.

The part in bold does not always have a plus or minus sign. Other examples that cause a 404 are below:

equipment/23049/lust/servo+drive/mc7408/mc7408/
equipment/65217/mitsubishi/drive/fr-a520-3.7k/fra52037k/

So I thought I'd try and do it myself, with no luck. My theory was to make the product id (in bold) one paramater.

Here's my attempt:

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([a-zA-Z0-9_-]+)/([^/]+)/$ http://new.lektronix.net/equipment/$2-$4/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]


... It didn't work.

Any advice?

As for this....

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.php\ HTTP/
RewriteRule ^(([^/]+/)*)index\.php$ http://www.example.com/$1 [R=301,L]


No luck. :\ Am I right in thinking this replaces my old index script?

Once the redirection and rewriting stuff is working, do not forget to re-address the ErrorDocument problem. Using a protocol and domain name in ErrorDocument directives results in a server status of 302 being returned for ALL erroneous requests. To be clear, when requesting a bogus URL, Googlebot will see a 302 redirect when you want it to see a 404-Not Found or 410-Gone. This is nothing short of SEO suicide...

Cheers for your input jdMorgan, much appreciated.

I'm really trying to get that sorted but to no avail. I've tried the following options, which all don't work:

ErrorDocument 400 /error
ErrorDocument 400 error
ErrorDocument 400 /error.php
ErrorDocument 400 error.php


I'm a little confused how it works without the main URL involved.

g1smd




msg:4288906
 8:58 am on Mar 29, 2011 (gmt 0)

What is the filename of the file that delivers your error messages? Refer to that filename in the ErrorDocument. You refer to it as a local file to serve in the event of an error, not as a URL.

As for your pattern. Now that the pattern with + and _ is working, you now need a separate pattern in a separate rule to deal with each of the other variants of the product code syntax. Do not amend the pattern that works. That is done and dusted for URLs with *that* format.

Construct new rules for URLs with *other* formats.

Tom_Cash




msg:4288944
 11:21 am on Mar 29, 2011 (gmt 0)

The filename that delivers the 404 is a PHP file called error.php... Unless this shouldn't/can't be done?

Regarding the pattern, I can see where you're coming from with writing other variants but we have 63 thousand products and the old developer just let anything get through in any order.

Coming up with a rule for each combination is going to be a lot of work... It doesn't seem feasable?

Is there any way of making product id (in bold) a generic string which allows a-zA-Z0-9-_+. through at once? I can't seem to make it stick.

equipment/16955/siemens/plc/6es5+322-0aa11/6es5322011/

This is my code so far:

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# This one works for urls like this: equipment/16955/siemens/plc/6es5+322-0aa11/6es5322011/
# Old Structure: equipment/<unique-id>/<brand>/<category>/<product-id>/<product-id-slug>/
# New Structure: equipment/<brand>-<product-id>/<unique-id>
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)\+([^-]+)-([^/]+)/([^/]+)/$ http://example.com/equipment/$2-$4_$5_$6/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]

# REDIRECT OLD EQUIPMENT URLS TO NEW ONES
# This one works for urls like this: equipment/23049/lust/servo+drive/mc7408/mc7408/
RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/([^+]+)/([^/]+)/$ http://example.com/equipment/$2-$4/$1 [R=301,L]
RewriteRule ^equipment/([a-zA-Z0-9_-]+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]


It works well for the following types of URL:

equipment/16955/siemens/plc/6es5+322-0aa11/6es5322011/
equipment/23049/lust/servo+drive/mc7408/mc7408/

I could really do with something more generic.

Thanks again for your help so far.

g1smd




msg:4289222
 7:59 pm on Mar 29, 2011 (gmt 0)

Although there are 63 000 products, there might only be a few rule formats required.

Maybe this is the time to tighten up exactly what you allow in a URL?

The problem comes from URLs where something "changes" e.g. where "-" becomes "+" or vice versa. That is why a single "generic" rule may be "too difficult".

Tom_Cash




msg:4289528
 3:19 pm on Mar 30, 2011 (gmt 0)

Thanks for a prompt reply.

Although there are 63 000 products, there might only be a few rule formats required.

I'm not sure you know... We have over 3000 manufacturers who all have their own way of writing product numbers. There could be any number of combos.

Maybe this is the time to tighten up exactly what you allow in a URL?

I agree and I'm already trying. I've written a really explicit slug script in PHP to ensure no future URLs will have nothing other than a-z0-9_- in them.

It's a very strict new structure. I know what I want to do, just struggle to execute it with RegEx.

I'm just trying 301 all the old content so that our complete Google index of over 30,000 URLs doesn't become redundant on the changeover.

The problem comes from URLs where something "changes" e.g. where "-" becomes "+" or vice versa. That is why a single "generic" rule may be "too difficult".

Would this still be the case even though all the old URL structures will eventually dissapear?

g1smd




msg:4289730
 7:49 pm on Mar 30, 2011 (gmt 0)

Can you identify a number of "patterns" in the stuff you want to redirect?

Can you write some rules for translating characters when redirecting: 0-9 stays as 0-9, a-z stays as a-z and retains case, but what happens to + - _ , ; and other characters? What are those mapped to?

It is a case of breaking the whole list of URLs into sets and writing a pattern (hence rule) for each set.

swa66




msg:4289971
 7:43 am on Mar 31, 2011 (gmt 0)

The best possible advise one could give you in my opinion is to learn regexp so you understand them fully for yourself.
It's not that hard, and considering how long this thread has gone already, you'd have that understanding by now.

The whole issue -given the full dataset- is to figure out for yourself if you're going to be able to come up with a simple enough set of rules in the end.

We can't decide that for you.

As an alternative, in case the complexity of the problem is too large for a "few" rewrite rules to handle with a "few" regexps, you could instead also use a RewriteMap in mod_rewrite. Rewritemaps can be a flavour of a dbm file (dbm files are the ones you want). This will give you a very efficient "database" that you can use in the mod_rewrite to map old to new. A dbm file is a hashed lookup tree in a binary format and it is designed to be fast. It should handle 50K entries easily enough, but there the "flavor" of which type of dbm you're using can come and have a say on how big it will let you grow the database.
Note that both the supported set of flavors as well as the default flavor of the dbm file is a compile time choice. It you didn't compile apache yourself, you might need to figure out how it was compiled if you're going to generate the dbm file elsewhere.

[httpd.apache.org...]

[click on the right on rewritemap for yourself, the redirect here at WebmasterWorld eats the #rewritemap unfortunately.]

g1smd




msg:4289985
 8:18 am on Mar 31, 2011 (gmt 0)

There are several ways to solve this, but it is likely that a smallish number of RewriteRule directives can cover all possible URL patterns.

The problem isn't one of coding, merely the OP spending the time to identify the patterns and list them out.

Tom_Cash




msg:4308321
 3:40 pm on May 5, 2011 (gmt 0)

It's been ages since I've been here... I took the advice of swa66 and went and learnt, in more depth, Reg Ex:

I figured out what I wanted to do and I did it! Very very pleased. For anyone who is interested, here's my final code:

RewriteRule ^equipment/([^/]+)/([^/]+)/([^/]+)/(.+)/([^/]+)/$ http://www.example.com/equipment/$2-$4/$1 [R=301,L]
RewriteRule ^equipment/(.+)/([0-9]+)$ equipment/index.php?name=$1&id=$2 [L]


Because the old developer was using all sorts of characters in variable 4, I had to use the . followed by the + because of URLs similar to below:

equipment/71571/aaeon_technology/hmi-pc+based/aaeon+pc+based+touchscreen+unit/aaeonpcbasedtouchscreenunit/

On the new website, that more simply becomes:

equipment/aaeon+technology-aaeon+pc+based+touchscreen+unit/71571

Then I catch the variables in the PHP and cut out any duplicate URLs...

It's been a long thread, but I got there... *sigh*

Thanks for your help, EVERYONE! :D (Especially g1smd!)

g1smd




msg:4308461
 8:23 pm on May 5, 2011 (gmt 0)

Have this been fully tested? It looks a little too simplistic to work for all situations and URL combinations.

In particular, does it absolutely nail the conversions of + and - and other characters between old and new URL?

I am not sure why you think that
(.+) would be a better choice than ([^/]+) here? What processes led to that decision?
Tom_Cash




msg:4308641
 8:32 am on May 6, 2011 (gmt 0)

I used .+ because it can deal with the '+' within the string, whereas [^/]+ wasn't...

I've not tested all 63000 URLs, but I have tested over 200 cases and counting...

Do you think .+ is a bad idea?

g1smd




msg:4308946
 8:12 pm on May 6, 2011 (gmt 0)

(.+) means "any character, one or more times" and is greedy. It reads to the end of the whole URL then has to back off and retry hundreds of times.

([^/]+)/ means "not a slash one or more times, followed by slash". It stops at the next slash.
Tom_Cash




msg:4309200
 7:51 am on May 7, 2011 (gmt 0)

I must have not been implimenting it properly because it's worked on second attempt... How odd...

Thanks for that buddy. I really appreciate all the help you've given me here.

This 58 message thread spans 2 pages: < < 58 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved