rewriting filenames that change name over time, but have a constant ID - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

rewriting filenames that change name over time, but have a constant ID

FaceOnMars

3:38 pm on May 24, 2012 (gmt 0)

10+ Year Member

I maintain a web directory which houses about 3000 "user owned listings" which can be updated at any time by the listing owner.

The URL of the listing is established as follows:

www.mydomain.com/subdirectory/LISTING_OWNER_NAME-ID.html

The tailing "-ID.html" will always remain fixed and is permanent; however, the LISTING_OWNER_NAME will change on occassion.

I've been partially successful in setting up a RewriteCond/Rule set which will sometimes 301 the old LISTING_OWNER_NAME to the new URL which has an updated LISTING_OWNER_NAME:

RewriteCond %{REQUEST_FILENAME} !LISTING_OWNER_NAME-ID.html$
RewriteRule ^(.*)ID.html$ /subdirectory/LISTING_OWNER_NAME-ID.html [R=301,L]

(.htaccess is located in /subdirectory)

The above works when the new LISTING_OWNER_NAME is of equal or lesser character length than the previous LISTING_OWNER_NAME, but fails (and 404's) when the new LISTING_OWNER_NAME is of a greater length.

Have tried a number of different ways to account for this, but have been stumped. Was hoping someone might shed some insight into this?

g1smd

8:40 pm on May 24, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You should do this redirect on the page itself.

If the site is database driven it's a trivial few lines of code added to the PHP script.

If it's static .html pages, and you can invoke the PHP parser for .html pages (or rename the files to have a .php extension), you can use htaccess to detect the ID number in the URL request, rewrite that request to the internal file that will serve the content, then on the page itself compare the requested URL with what the URL should have been (define it at the top of the page) and redirect to that if it is different to the URL that was requested.

For this to work, you store the actual files on the hard drive with just the ID number as their name (and the .html or .php extension).

In htaccess:

RewriteRule ^([^-]+-)+([0-9]+)\.html$ /$2.php [L]

(or

/$2.html [L]

)

In the php or html file:

define (THIS_PAGE_CANONICAL, "name-of-page-44");
$requestedPage = _SERVER[ 'REQUEST_URI' ];

if ($requestedPage !== THIS_PAGE_CANONICAL) {
HEADER ('HTTP/1.1 301 Moved Permanently');
HEADER ('Location: http://www.example.com/' . THIS_PAGE_CANONICAL);
}

The above code goes at the very top of the file.

Remember this. URLs are used "out there" on the web. Filenames are used "here" inside the server. They are not at all the same thing, merely related by the server configuration.

In this case, the key is in using filenames that consist only of the ID number, the file having the URL definition at the top of the file and then comparing the requested URL with what it should have been. The rewrite connects the requested URL with the actual numbered file that will serve the content.

Get the rewrite working first. Try adding the redirect code after that. Try it with a test file, like

9999.php

or

9999.html

first.

The code above does NOT take into account that the files may be located in folders on your site. That's specific implementation detail for you to work through.

FaceOnMars

11:19 pm on May 24, 2012 (gmt 0)

10+ Year Member

g1smd: thank you for taking the time to put forth an interesting alternative!

I'm not sure if you're saying it's not possible to do what I was seeking to do in particular, or rather that your method is better in some manner?

I'd be all for improving the overall configuration to be more efficient, but for the time being was kind of just looking for a solution for the current configuration (even though it might not be optimal) ... especially given my workload at the moment.

However, I would certainly be all for ultimately migrating into a more efficient paradigm as you may have just provided ... although I'm a little sketchy on the RewriteRule you've provided:

RewriteRule ^([^-]+-)+([0-9]+)\.html$ /$2.php [L] (or /$2.html [L])

... specifically, if this is a single RewriteRule to account for all pages via one regular expression?

It seems to be since I did test it out with one page (and of course came up with a 404 on 9999.html until I renamed the name-of-file-9999.html to 9999.html).

I can see how this would make for a much shorter & more efficient .htaccess file! I've actually been trying to evaluate the current overhead and scalability limitations with respect to my current configuration -- which has an .htaccess file which now has slightly over 6000 lines (3000 pairs) and is constantly growing.

I'm assuming that invoking the PHP parser would still be more efficient than a large .htaccess file.

If I understand your code block to be placed at the top of the individual pages correctly, it won't 301 unless the requested page does not match the "name-of-page-ID".

Yes, it is a database driven site, but the pages are automatically created as static .html pages upon addition or modification of any given record (or in batch if I elect to so).

It would take a bit of doing to convert things over ... which as I alluded to above, I don't have the time at the moment (but will in another couple of months). So, still kind of looking for a band aid for the moment, but certainly grateful for relaying/illustrating what looks to be a much more scalable/elegant pathway. For that matter, I might even reconsider whether or not I actually need "static" pages to begin with and revamp at a much deeper level.

Thanks again.

g1smd

5:40 am on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Yes. One RewriteRule does for the whole site.

If it was a pure database driven site, the bit of PHP at the top of the page would be very slightly different. It would look in the database for "name of page" rather than having it hard coded at the top of the page. I use this system a lot.

I couldn't imagine trying to redirect by having thousands of individual rules in the htaccess file. It looks unmanageable.

In the code above, each file defines the URL that is acceptable to access it. The URL isn't defined in the htaccess, merely all potential URLs for any file are rewritten ("mapped") to the approriate file. The file content will be displayed only if the requested URL was 100% correct. If it was not, then the file issues a redirect to the correct URL and the process starts over again.

This has the effect that links from other pages and backlinks from other sites will continute to work after the URL is changed. Additionally, you can run Xenu LinkSleuth over your site and it will quickly find any links you forgot to update - they will appear in the "list of redirected URLs" report.

I'll say this method is miles better (but I am biased).

If there was no way to turn on PHP, another way to do it would be to place a rel="canonical" tag at the top of the page and use javascript to perform a browser redirect to the canonical URL. This method is a long way from optimum as it's not a pure 301 redirect. This would still use the same htaccess file and require the filenames to be just the ID number.

I can't think why you get the current results that you get at the moment. Nothing springs to mind.

g1smd

1:33 pm on May 25, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Typo: _SERVER should have been $_SERVER

Spelling: 'continute' should be 'continue'.

FaceOnMars

11:47 pm on May 25, 2012 (gmt 0)

10+ Year Member

My thousands of redirects aren't really too much of a management issue, since a script will automatically rewrite the entire .htaccess file upon update/addition/deletion of a record; however, having one RewriteRule looks to be way more efficient ... at least regarding the .htaccess processing overhead AND also less of a chance of a syntax error surfacing into the file (due to unexpected characters creeping into the rules) & crashing any part of the site which relies upon it. I'm still not sure if PHP parsing overhead would make it a "wash" or still allow for a gain in efficiency?

The scripts which make calls to the database are actually written in Perl, so I'm assuming I'd be able to come up with an equivalent code (if I ultimately migrate to a dynamically driven [and not static] record display page, but perhaps SSI could work if I keep them static ... then again, SSI might be an even bigger resource hog vs. PHP). If all else fails, I could always just have the record page be written in PHP (if I migrate away from static pages) ... even if the rest of the scripts are Perl based.

My current RewriteRules currently performs exactly as you've indicated with your method with respect to still having "old" backlinks work which are slightly off (minus the case I've raised in this post when a record is modified to have a longer "name-of-page"). I've learned to view this as being essential in so far as helping with Google and backlinks on the rest of the web. Still, on the face of it, I do like your method more! Again, I'm just pressed for time & try to be conservative in so far as not changing a configuration such as this too hastily for fear of unintended consequences.

Yeah, I don't believe I'd be into using javascript rel="canonical" approach ... I'd actually prefer my existing bloated .htaccess file method - even with it's current deficiency (which might be a smaller "hole").

Thanks again for sharing your method ... will probably give it a whirl in another couple/few months!

g1smd

7:44 am on May 26, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

With the code detailed above, canonicalisation is forced from the URL definition within each page file rather than from an entry in the htaccess file.

One other positive effect of the new code is that when a page file is missing from the server there is an immediate 404 response at the originally requested URL, whether this be the new URL for the page or any of the previous old URLs that might have been associated with it.

With your current system (and where the page file is missing but the htaccess rule for it is still in place), non-canonical requests for missing pages are first redirected to the last known canonical URL, and only after that second request for the page is a 404 response issued.

There's one more thing missing from the code above. The on-page PHP code would also need to test the requested hostname and if it were non-canonical (whether or not the correct path was requested) redirect to the canonical hostname and correct path. This avoids the multiple step redirection chain for non-www requests for an old page name when the site now uses www URLs that would happen if the non-www to www redirect were handled by the htaccess file.

There's many ways to achieve what you want to do, covering a wide range of server efficiency and ease of maintenance factors. I use this stuff all the time, but usually its a PHP and database driven system. It was an interesting exercise coming up with a way to adapt that for your situation.

Thanks for posting such an interesting question. It makes a change from the normal line of questioning here, most of which has been answered many hundreds and in some cases several thousand times before.

FaceOnMars

4:21 pm on May 26, 2012 (gmt 0)

10+ Year Member

I see what you're getting at re: allowing the page to 404 if it's deleted. To this end, I've actually built in a routine to create a tailored "This listing is no longer active" page upon the deletion of a record, yet have the name of the listing (appearing inside the page) as well as links to the primary categories it once resided within.

I automatically handle these pages with the .htaccess file as well and by scripting & renaming the file accordingly:

www.mydomain.com/subdirectory/inactiveID.html

When a listing is deleted, it's record moves from the 'active' table into a 'deleted' table under the same database. The previous rewrite rules I stated only get created from the 'active' table. While the following rewrite rules (for inactive listings) only get created from the 'deleted' table and appear after all active cond/rules:

RewriteCond %{REQUEST_FILENAME} !inactiveID.html$
RewriteRule ^(.*)ID.html$ /subdirectory/inactiveID.html [R=301,L]

Again, there's a cond/rule for each inactive listing ... so this adds to the size of the .htaccess file. I realize this increases the bloat, but I think it might actually help guide visitors by providing a positive affirmation that the listing they may have been seeking (via an old backlink) is no longer active as well as to provide internal links to related listings. For example, a point and shoot camera from 2003 might no longer be in production, but you could create an "inactive" page which provides spec info, links to support docs, etc. and also a link to the camera's manufacture category and even the same newer generation product line.

Since I've been concerned with Google Panda updates recently, I've also placed a <meta name="robots" content="noindex" /> tag at the top of these pages. I'm not sure if it's necessary or if I'm being overly cautious about not creating an inordinate proportion of "thin content"?

Mabye it is somewhat nuanced, but I think all of this tends to help when you add it all up, especially for relatively larger sites. I'm normally able to find answers to typical questions via searching, so this is one of only a handful of posts I make asking for help ... since it's something which is a bit difficult to find "by name" (I had hard time coming up with a title).

g1smd

12:00 am on May 27, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Pages that technically have "gone" should usually return a "410 Gone" HTTP status, but I'm not so sure whether I'd do that if the page still contains some useful info.

PHP can generate any HTTP status code you want and add it to the HTTP header sent before the HTML page. I use the full range of commonly available codes on the sites I deal with.

FaceOnMars

4:45 pm on May 28, 2012 (gmt 0)

10+ Year Member

I have to admit that I don't use all status codes, but review them from time to time & try to abide by the specifications where possible. "410" seems to be an interesting one upon further consideration, but also perhaps a bit narrow in scope in terms of it's real world application with respect to providing useful information to humans regarding any circumstantial information which such pages may have served in the past in relation to the current state of affairs. Regardless, I can see how adhering to the specifications of status codes is a very good practice and will generally help your site!

Back to my original question, I had encountered a similar issue when trying to comprise the RewriteCond/Rule for the "inactive" pages which gave me a clue of sorts (but couldn't quite "connect the dots") ... as to why the cond/rule wasn't working as I had hoped. The Cond/Rule below resulted in a 404 for the following request:

www.mysite.com/subdirectory/name-of-listing-ID.html

RewriteCond %{REQUEST_FILENAME} !ID.html$
RewriteRule ^(.*)ID.html$ /subdirectory/ID.html [R=301,L]

(the actual files were renamed to just "ID.html")

Via trial and error, I appended the "inactive" in front of the "ID" to cause it to 301 as intended.

RewriteCond %{REQUEST_FILENAME} !inactiveID.html$
RewriteRule ^(.*)ID.html$ /subdirectory/inactiveID.html [R=301,L]

Not looking to dwell on this, but thought maybe this makes it more apparent for someone with better "eyes" them myself to see the issue at play which might be running through both examples? Maybe there's a need to explicitly define allowable characters in regex in front of "ID.html$"?

g1smd

5:00 pm on May 28, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Never use (.*) at the beginning or in the middle of a RegEx pattern. It is greedy, promiscuous and ambiguous. It means "read everything to the very end and capture it". The problem is you then have ID.html "after the end" and the parser has to use back off and retry "trial matching" to work out what you meant. This is extremely inefficient.

If you're implementing the scheme I detailed above, the single RewriteRule does for both active and inactive pages all together. There's no way for htaccess to know which folder to target unless there's a clue in the requested URL.

As for HTTP status codes, searchengines look only at the number. In the HTTP header, "HTTP/1.1 200 Welcome" is just as valid as "HTTP/1.1 200 OK". The text following the number and the on-page wording are purely for humans.

lucy24

5:39 am on May 29, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"410" seems to be an interesting one upon further consideration, but also perhaps a bit narrow in scope in terms of it's real world application with respect to providing useful information to humans regarding any circumstantial information which such pages may have served in the past in relation to the current state of affairs.

Oh, it's not for humans. In fact you can serve 'em the same physical page they get for a 404. (Apache's default 410 page is even scarier for normal humans than their default 404 page.) It's for the search engines. If you feed them a steady diet of 410s, they will eventually stop trying to crawl the page every hour on the hour-- where "eventually" means "sooner than with a 404". Within the present geological era, at least.