Forum Moderators: phranque

Message Too Old, No Replies

Help with 404ing Some Numbers in URL

Problems with Calendar-based URLs

         

mayest

9:30 pm on Jun 21, 2008 (gmt 0)

10+ Year Member



I'm using a calendar in my CMS to show blog entries by month. Clicking on a link in the calendar might generate a link such as:

http://www.example.com/blog/2007/06/

That would be valid in my case. However, the calendar seems to have no limits to how far back in time it will go (though I am working on stopping it from generating bad URLs in the first place). So, any tool that follows links deeply enough (e.g., Xenu or Google's bot) will find URLs that don't really exist. Unfortunately, the CMS will happily display a page anyway, even for something like this:

http://www.example.com/blog/1925/01/

I started this blog one year ago (certainly not in 1925!), so any URL that indicates a date prior to June 2007 should lead to a 404 error. After fooling around with this for awhile, I finally go this to work:

RewriteRule ^blog/[1-2][0-9]{2}[0-6]/ /calculators/404/ [R=404,L]

At least it seems to redirect anything up to /blog/2006/whatever to my 404 page. It isn't perfect since it will still allow a few bad URLs (for early 2007).

Is this the best way to do this? Specifically, I want to know if there is a simpler regex that will catch /1900/ to /2006/ or anything that is before /2007/06/. Also, should I use a 410 instead of 404?

Thanks,

Tim

jdMorgan

3:37 pm on Jun 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can't "redirect" to an error page like this. I think you'll find the status code is wrong, and that this code generates an external 302-Found redirect.

Instead, simply rewrite to a file in "/calculators" which does not exist, put the 404 error page in that same directory, and declare the ErrorDocument 404 to point to it.

As far as the regex goes, I'd tend to more-restrictive:


# If year 2007 jan-may
RewriteCond $1 ^2007/0[1-5]$ [OR]
# or if NOT year 2007-2029
RewriteCond $1 !^20(0[7-9]¦[12][0-9])
# Rewrite to a non-existent filepath to trigger a 404
RewriteRule ^blog/([0-9]{4}/[0-9]{2})/$ /calculators/nonexistent_path [L]

Replace the broken pipe "¦" character in the code above with a solid pipe before use; Posting on this forum modifies the pipe character.

Jim

g1smd

4:41 pm on Jun 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Open-ended calendars are spider traps.

I am aware of some sites with dates going backwards and forwards by many hundreds of years, with a page for each individual day.

I have even found one with a date several thousand YEARS into the future.

mayest

7:21 pm on Jun 22, 2008 (gmt 0)

10+ Year Member



Jim, that works great. I've finally worked out a way to stop the calendar from generating links to future and past months that don't exist, but I still need the 404 code because Google has already indexed a few of the non-existent pages (and I think that they try random URLs anyway). So, thank you for the help.

Tim