homepage Welcome to WebmasterWorld Guest from 54.204.67.26
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
how to generate a 410 for a large number of urls at once
mrealty




msg:3912746
 5:35 pm on May 13, 2009 (gmt 0)

I have a special situation and was wondering if someone could help.

I have about 10 thousand urls with the following naming conventions:

/product-random-stuff-here-4469.html

up to

/product-more-random-stuff-here-and-maybe-some-other-stuff-14351.html

Those urls have been deleted (between 4469-14351). What would be the best way to generate a 410 for all of those urls in one statement since I want to generate a 410 for all of the ones between and including 4469 - 14351 ?

For all of those who have suffered through this (trying to delete a large quantity of urls from googles index) I know you feel my pain. Putting all of those in .htaccss one by one would take forever. There must be a way to do it more efficiently. Any ideas?

With kind regards.

 

jdMorgan




msg:3913031
 10:46 pm on May 13, 2009 (gmt 0)


# Respond with 410-Gone URL requests matching the template
# "/products-<hyphenated-string>-<numeric string lexically greater than 4468 and less than 14352>.html"
RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-([^\-]+-)+([0-9][b]+)\[/b].html$ - [G]

Be aware that the RewriteCond compares are lexical compares and not numeric. That is, the "number characters" are treated only as text and not as a numeric value. Therefore "4460" and "04460" are not the same.

I assumed that you're putting this code into .htaccess, and that you already have other working rewriterules.

Jim

[edit] Correction as noted below. [/edit]

[edited by: jdMorgan at 6:58 pm (utc) on May 20, 2009]

mrealty




msg:3913687
 4:51 pm on May 14, 2009 (gmt 0)

Jim: That may be exactly what I'm looking for. I was reading this page when I saw your post.

[httpd.apache.org...]

and I may have to keep reading because I don't quite understand what you have on the rewriterule line. As long as that rewrite rule will grab all pages beginning in /product- and then only pay attention to the numeric string at the very end right before the .html then that sounds like that's it.

I just wasn't sure of this part "([^\-]+-)+" The template is, /product- is always at the beginning of the url, and then, for example, 13255.html would always be at the end (never has leading zeros). So the in between could be ignored. That's where I got hung up on what you were matching in between.

Also, you don't think this will kill my rankings in any way, do you? I see googlebot keeps coming back and crawling hundreds if not thousands of the OLD urls, even as of right now. It's wasting time on those old urls when it could be crawling my new urls. So I figured it best to give a 410 on all of them and do a 301 on the most popular ones, just to catch a little traffic before I kill it altogether. What do you think?

jdMorgan




msg:3914078
 2:40 am on May 15, 2009 (gmt 0)

Study-up on regular-expressions to understand that pattern (see the resources cited in our Forum Charter). I tend to 'show no mercy' in my use of regex here in this forum -- I don't try to simplify it at the expense of ambiguity or inefficiency. Just note that "^" at the start of an alternate-character [group] means "NOT", and that character-escaping rules inside and outside of groups are different.

Also, quantifiers apply to the preceding single character or to the preceding parenthesized sub-string and/or sub-pattern. Those hints should get you going... :)

If you are concerned about your ranking, I'd say that if you have indeed killed it, it was by removing all of those product pages which presumably had at least a few inbound links, and which 'voted' the PageRank gleaned from those inbound links to your home page, category pages, etc. Changing a 404 to a 410 after the fact won't make any difference to the ranking effects of having removed those pages.

Be advised that Google will likely continue to try to fetch those old URLs for a long, long time -- and may come back in ten years and check them again -- at least, according to one report I read here in the past 24 hours... Leave this 410-Gone code in place forever.

You may want to read my further comments on this subject in my second post in this recent thread [webmasterworld.com].

Jim

mrealty




msg:3917307
 6:36 pm on May 20, 2009 (gmt 0)

I cut and pasted that code into my .htaccess and got a 500 error. Is there something not quite right with it? Was there supposed to be a $1 and then a $2? I was hoping to have something I could build upon but...it would need to work. Let me know if I need to change part of it. Thanks again.

P.S. there are other working rewrite rules in my .htaccess

jdMorgan




msg:3917318
 7:02 pm on May 20, 2009 (gmt 0)

$1 is irrelevant, as it contains "random stuff" not pertaining to the 'id' number test.

The 500 error was likely caused by an unclosed parenthesis... Corrected in original post above.

When you get a server error, go straight to your server error log... A very useful log file, that.

Jim

mrealty




msg:3917358
 7:46 pm on May 20, 2009 (gmt 0)

Being a newb, the error doesn't tell me how to fix it :-) But I will learn. I wanted to start with a good example. I'm not good enough yet to troubleshoot.

Thanks again. Your expert advice is worth paying for. Is there a paypal donate button somewhere around here? (without having to do a full blown subscription?)

With kind regards.

http_error_log:
.htaccess: RewriteRule: cannot compile regular expression '^product-([^\\-]+-)+([0-9]+\\.html$'

jdMorgan




msg:3917474
 1:00 am on May 21, 2009 (gmt 0)

If it wasn't clear, the error has been corrected in the code I posted above. This is to prevent others from copying bad code.

Jim

mrealty




msg:3917498
 1:57 am on May 21, 2009 (gmt 0)

Yes, it was clear. I've applied it and it works for the tests I ran against it.

I've dusted off an old UNIX book from my shelf, turned to the "Regular Expressions" chapter, and am reading up. I'll be a master at it soon, just need some good examples. Thanks for providing one.

Cheers.

mrealty




msg:3917946
 5:56 pm on May 21, 2009 (gmt 0)

But didn't work for many urls. So I changed it and now it works for all. My example "random-stuff-here-and-maybe-some-other-stuff" meant random could be numeric, could be character, could be anything random. So your match was not matching certain strings which contained numeric sets in the string. Modified yours to below and now works like a champ:

RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-(.*)-([0-9]*).html$ - [G]

g1smd




msg:3917989
 6:52 pm on May 21, 2009 (gmt 0)

While it might be "working", it probably doesn't work like you think it might.

So (.*) means "match everything to the end of the URL", then it has to back off and retry again and again to find the hyphen. This is very inefficient.

Additionally, since * means zero or more, a URL like /product--.html will match your pattern. In this instance that's likely not a problem, just be aware of it for next time.

jdMorgan




msg:3918006
 7:10 pm on May 21, 2009 (gmt 0)

That should not have mattered, since the pattern I posted doesn't look for any specific characters. It looks for "one or more characters not a hyphen, followed by a hyphen, and all of the preceding one or more times, followed by one or more numbers, followed by a literal period, followed by 'html'."

The only condition that I can think of that would 'break' the code I posted would be two consecutive hyphens in the requested URL-path, and that's easy enough to fix using "^product-([^\-]+-+)+([0-9]+)\.html$"

Your method will work (with a notable decrease in efficiency), but you should escape the literal period in any case:

^product-(.+)-([0-9]+)\.html$

Note also that neither the "any-stuff-here" nor the "numbers" URL-path-parts are allowed to be empty, as they are with your code.

Jim

mrealty




msg:3918040
 8:07 pm on May 21, 2009 (gmt 0)

Hi: Thanks again for the responses.

Re: a URL like /product--.html will match your pattern

According to my template, there will always be a product number and in the range as stated, so there will never be a -.html. I don't believe it has any other choice than to be inefficient given there were only two things it could match.

Re: jdMorgan:

Yes, there were several hundred urls that had double dashes. But I believe it was also not matching when it had the zip code as the second to last set of numbers in the dashes. The urls were random really (in between what was given as a constant), they don't necessarily contain perfect cases of product-stuff-stuff-stuff-alwaysstuff-5000.html

It is

product- anything can happen in between -range of numbers between 4469..14351.html

g1smd




msg:3918045
 8:13 pm on May 21, 2009 (gmt 0)

What I was saying was that your pattern responds and rewrites for patterns that you think are non-valid. That is, those URLs are valid for the rewrite as coded. That is, if someone types in that URL, your site will respond... in some cases that would be an undesirable thing, as it allows people to manipulate your listings by linking to non-valid URLs. In this case it isn't an issue as the URL returns 410. In another site that might have been a pattern for a rewrite to something that did produce content of sorts. In that case it could very well be a problem.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved