Welcome to WebmasterWorld Guest from 54.198.69.193

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

how to generate a 410 for a large number of urls at once

     
5:35 pm on May 13, 2009 (gmt 0)

5+ Year Member



I have a special situation and was wondering if someone could help.

I have about 10 thousand urls with the following naming conventions:

/product-random-stuff-here-4469.html

up to

/product-more-random-stuff-here-and-maybe-some-other-stuff-14351.html

Those urls have been deleted (between 4469-14351). What would be the best way to generate a 410 for all of those urls in one statement since I want to generate a 410 for all of the ones between and including 4469 - 14351 ?

For all of those who have suffered through this (trying to delete a large quantity of urls from googles index) I know you feel my pain. Putting all of those in .htaccss one by one would take forever. There must be a way to do it more efficiently. Any ideas?

With kind regards.

10:46 pm on May 13, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member




# Respond with 410-Gone URL requests matching the template
# "/products-<hyphenated-string>-<numeric string lexically greater than 4468 and less than 14352>.html"
RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-([^\-]+-)+([0-9][b]+)\[/b].html$ - [G]

Be aware that the RewriteCond compares are lexical compares and not numeric. That is, the "number characters" are treated only as text and not as a numeric value. Therefore "4460" and "04460" are not the same.

I assumed that you're putting this code into .htaccess, and that you already have other working rewriterules.

Jim

[edit] Correction as noted below. [/edit]

[edited by: jdMorgan at 6:58 pm (utc) on May 20, 2009]

4:51 pm on May 14, 2009 (gmt 0)

5+ Year Member



Jim: That may be exactly what I'm looking for. I was reading this page when I saw your post.

[httpd.apache.org...]

and I may have to keep reading because I don't quite understand what you have on the rewriterule line. As long as that rewrite rule will grab all pages beginning in /product- and then only pay attention to the numeric string at the very end right before the .html then that sounds like that's it.

I just wasn't sure of this part "([^\-]+-)+" The template is, /product- is always at the beginning of the url, and then, for example, 13255.html would always be at the end (never has leading zeros). So the in between could be ignored. That's where I got hung up on what you were matching in between.

Also, you don't think this will kill my rankings in any way, do you? I see googlebot keeps coming back and crawling hundreds if not thousands of the OLD urls, even as of right now. It's wasting time on those old urls when it could be crawling my new urls. So I figured it best to give a 410 on all of them and do a 301 on the most popular ones, just to catch a little traffic before I kill it altogether. What do you think?

2:40 am on May 15, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Study-up on regular-expressions to understand that pattern (see the resources cited in our Forum Charter). I tend to 'show no mercy' in my use of regex here in this forum -- I don't try to simplify it at the expense of ambiguity or inefficiency. Just note that "^" at the start of an alternate-character [group] means "NOT", and that character-escaping rules inside and outside of groups are different.

Also, quantifiers apply to the preceding single character or to the preceding parenthesized sub-string and/or sub-pattern. Those hints should get you going... :)

If you are concerned about your ranking, I'd say that if you have indeed killed it, it was by removing all of those product pages which presumably had at least a few inbound links, and which 'voted' the PageRank gleaned from those inbound links to your home page, category pages, etc. Changing a 404 to a 410 after the fact won't make any difference to the ranking effects of having removed those pages.

Be advised that Google will likely continue to try to fetch those old URLs for a long, long time -- and may come back in ten years and check them again -- at least, according to one report I read here in the past 24 hours... Leave this 410-Gone code in place forever.

You may want to read my further comments on this subject in my second post in this recent thread [webmasterworld.com].

Jim

6:36 pm on May 20, 2009 (gmt 0)

5+ Year Member



I cut and pasted that code into my .htaccess and got a 500 error. Is there something not quite right with it? Was there supposed to be a $1 and then a $2? I was hoping to have something I could build upon but...it would need to work. Let me know if I need to change part of it. Thanks again.

P.S. there are other working rewrite rules in my .htaccess

7:02 pm on May 20, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



$1 is irrelevant, as it contains "random stuff" not pertaining to the 'id' number test.

The 500 error was likely caused by an unclosed parenthesis... Corrected in original post above.

When you get a server error, go straight to your server error log... A very useful log file, that.

Jim

7:46 pm on May 20, 2009 (gmt 0)

5+ Year Member



Being a newb, the error doesn't tell me how to fix it :-) But I will learn. I wanted to start with a good example. I'm not good enough yet to troubleshoot.

Thanks again. Your expert advice is worth paying for. Is there a paypal donate button somewhere around here? (without having to do a full blown subscription?)

With kind regards.

http_error_log:
.htaccess: RewriteRule: cannot compile regular expression '^product-([^\\-]+-)+([0-9]+\\.html$'

1:00 am on May 21, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If it wasn't clear, the error has been corrected in the code I posted above. This is to prevent others from copying bad code.

Jim

1:57 am on May 21, 2009 (gmt 0)

5+ Year Member



Yes, it was clear. I've applied it and it works for the tests I ran against it.

I've dusted off an old UNIX book from my shelf, turned to the "Regular Expressions" chapter, and am reading up. I'll be a master at it soon, just need some good examples. Thanks for providing one.

Cheers.

5:56 pm on May 21, 2009 (gmt 0)

5+ Year Member



But didn't work for many urls. So I changed it and now it works for all. My example "random-stuff-here-and-maybe-some-other-stuff" meant random could be numeric, could be character, could be anything random. So your match was not matching certain strings which contained numeric sets in the string. Modified yours to below and now works like a champ:

RewriteCond $2 >4468
RewriteCond $2 <14352
RewriteRule ^product-(.*)-([0-9]*).html$ - [G]

6:52 pm on May 21, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



While it might be "working", it probably doesn't work like you think it might.

So

(.*)
means "match everything to the end of the URL", then it has to back off and retry again and again to find the hyphen. This is very inefficient.

Additionally, since

*
means zero or more, a URL like
/product--.html
will match your pattern. In this instance that's likely not a problem, just be aware of it for next time.
7:10 pm on May 21, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



That should not have mattered, since the pattern I posted doesn't look for any specific characters. It looks for "one or more characters not a hyphen, followed by a hyphen, and all of the preceding one or more times, followed by one or more numbers, followed by a literal period, followed by 'html'."

The only condition that I can think of that would 'break' the code I posted would be two consecutive hyphens in the requested URL-path, and that's easy enough to fix using "^product-([^\-]+-+)+([0-9]+)\.html$"

Your method will work (with a notable decrease in efficiency), but you should escape the literal period in any case:


^product-(.+)-([0-9]+)\.html$

Note also that neither the "any-stuff-here" nor the "numbers" URL-path-parts are allowed to be empty, as they are with your code.

Jim

8:07 pm on May 21, 2009 (gmt 0)

5+ Year Member



Hi: Thanks again for the responses.

Re: a URL like /product--.html will match your pattern

According to my template, there will always be a product number and in the range as stated, so there will never be a -.html. I don't believe it has any other choice than to be inefficient given there were only two things it could match.

Re: jdMorgan:

Yes, there were several hundred urls that had double dashes. But I believe it was also not matching when it had the zip code as the second to last set of numbers in the dashes. The urls were random really (in between what was given as a constant), they don't necessarily contain perfect cases of product-stuff-stuff-stuff-alwaysstuff-5000.html

It is

product- anything can happen in between -range of numbers between 4469..14351.html

8:13 pm on May 21, 2009 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



What I was saying was that your pattern responds and rewrites for patterns that you think are non-valid. That is, those URLs are valid for the rewrite as coded. That is, if someone types in that URL, your site will respond... in some cases that would be an undesirable thing, as it allows people to manipulate your listings by linking to non-valid URLs. In this case it isn't an issue as the URL returns 410. In another site that might have been a pattern for a rewrite to something that did produce content of sorts. In that case it could very well be a problem.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month