Forum Moderators: phranque
I have about 10 thousand urls with the following naming conventions:
/product-random-stuff-here-4469.html
up to
/product-more-random-stuff-here-and-maybe-some-other-stuff-14351.html
Those urls have been deleted (between 4469-14351). What would be the best way to generate a 410 for all of those urls in one statement since I want to generate a 410 for all of the ones between and including 4469 - 14351 ?
For all of those who have suffered through this (trying to delete a large quantity of urls from googles index) I know you feel my pain. Putting all of those in .htaccss one by one would take forever. There must be a way to do it more efficiently. Any ideas?
With kind regards.
# Respond with 410-Gone URL requests matching the template
# "/products-<hyphenated-string>-<numeric string lexically greater than 4468 and less than 14352>.html"
RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-([^\-]+-)+([0-9][b]+)\[/b].html$ - [G]
I assumed that you're putting this code into .htaccess, and that you already have other working rewriterules.
Jim
[edit] Correction as noted below. [/edit]
[edited by: jdMorgan at 6:58 pm (utc) on May 20, 2009]
[httpd.apache.org...]
and I may have to keep reading because I don't quite understand what you have on the rewriterule line. As long as that rewrite rule will grab all pages beginning in /product- and then only pay attention to the numeric string at the very end right before the .html then that sounds like that's it.
I just wasn't sure of this part "([^\-]+-)+" The template is, /product- is always at the beginning of the url, and then, for example, 13255.html would always be at the end (never has leading zeros). So the in between could be ignored. That's where I got hung up on what you were matching in between.
Also, you don't think this will kill my rankings in any way, do you? I see googlebot keeps coming back and crawling hundreds if not thousands of the OLD urls, even as of right now. It's wasting time on those old urls when it could be crawling my new urls. So I figured it best to give a 410 on all of them and do a 301 on the most popular ones, just to catch a little traffic before I kill it altogether. What do you think?
Also, quantifiers apply to the preceding single character or to the preceding parenthesized sub-string and/or sub-pattern. Those hints should get you going... :)
If you are concerned about your ranking, I'd say that if you have indeed killed it, it was by removing all of those product pages which presumably had at least a few inbound links, and which 'voted' the PageRank gleaned from those inbound links to your home page, category pages, etc. Changing a 404 to a 410 after the fact won't make any difference to the ranking effects of having removed those pages.
Be advised that Google will likely continue to try to fetch those old URLs for a long, long time -- and may come back in ten years and check them again -- at least, according to one report I read here in the past 24 hours... Leave this 410-Gone code in place forever.
You may want to read my further comments on this subject in my second post in this recent thread [webmasterworld.com].
Jim
P.S. there are other working rewrite rules in my .htaccess
Thanks again. Your expert advice is worth paying for. Is there a paypal donate button somewhere around here? (without having to do a full blown subscription?)
With kind regards.
http_error_log:
.htaccess: RewriteRule: cannot compile regular expression '^product-([^\\-]+-)+([0-9]+\\.html$'
RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-(.*)-([0-9]*).html$ - [G]
So
(.*) means "match everything to the end of the URL", then it has to back off and retry again and again to find the hyphen. This is very inefficient. Additionally, since
* means zero or more, a URL like /product--.html will match your pattern. In this instance that's likely not a problem, just be aware of it for next time.
The only condition that I can think of that would 'break' the code I posted would be two consecutive hyphens in the requested URL-path, and that's easy enough to fix using "^product-([^\-]+-+)+([0-9]+)\.html$"
Your method will work (with a notable decrease in efficiency), but you should escape the literal period in any case:
^product-(.+)-([0-9]+)\.html$
Jim
Re: a URL like /product--.html will match your pattern
According to my template, there will always be a product number and in the range as stated, so there will never be a -.html. I don't believe it has any other choice than to be inefficient given there were only two things it could match.
Re: jdMorgan:
Yes, there were several hundred urls that had double dashes. But I believe it was also not matching when it had the zip code as the second to last set of numbers in the dashes. The urls were random really (in between what was given as a constant), they don't necessarily contain perfect cases of product-stuff-stuff-stuff-alwaysstuff-5000.html
It is
product- anything can happen in between -range of numbers between 4469..14351.html