homepage Welcome to WebmasterWorld Guest from 54.226.180.223
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
410 Gone for dynamic URLs
I need to return 410 Gone for URLs that shouldn't have been crawled
riatkstarley




msg:4622499
 11:10 am on Nov 11, 2013 (gmt 0)

Hi,

I need to return 410 Gone for a bunch of URLs that shouldn't have been crawled in the first place and are now 404ing. We've fixed the error that caused them initially, but there are around 3000 that have been indexed. These are all dynamic and follow patterns such as:

http://www.example.com/product-tag/nameoftag/page/2/?filter_region=92
http://www.example.com/product-category/nameofcategory/?filter_product_cat=22,269

I would like to use .htaccess to return 410 Gone for these (as suggested by a kind person on the SEO forum) but I'm unsure of how to implement regex to catch all of these. Any help would be massively appreciated.

Thanks,

Ria

 

lucy24




msg:4622514
 12:58 pm on Nov 11, 2013 (gmt 0)

You'll need more than two illustrations to make a RegEx pattern. I don't see any unifying theme, except for the parts you've obfuscated:
/product-
/nameof
?filter_
I don't suppose any of those are part of the real URLs.

If you can explain in English what the pattern is, we'll see about hammering out a RegEx.

Variables:
name of requested page(s)
name of parameter(s)
value or value range of parameter(s)

riatkstarley




msg:4622518
 1:07 pm on Nov 11, 2013 (gmt 0)

Hi Lucy,

The two examples were representative of a bunch of similar ones - so on example one above, where the region filtered was 92, it could have been 58, or 24, or anything. Similarly with the category example, the ID of the category could be anything numerical.

Essentially, I'd like any filter parameters to be removed from the index. These are primarily tags and categories - the tags could be regions, colours, varieties etc. The range of each parameter will be from 0 to no more than 500.

In the examples, I obfuscated the domain and the /nameoftag/ and /nameofcategory/ - the other variables were as they are in the actual URLs. Name of tag could be, for example, chardonnay, and the category could be organic.

Thanks for your help!

phranque




msg:4622619
 9:08 pm on Nov 11, 2013 (gmt 0)

welcome to WebmasterWorld, riatkstarley!


i would use a RewriteCond to catch all the QUERY_STRING values that start with or contain a 'filter_' variable and follow that with a RewriteRule using the G flag.
it's possible that might be too simple and would catch too much.

[edit]missing "or" in "start with or contain"[/edit]

[edited by: phranque at 4:08 am (utc) on Nov 12, 2013]

lucy24




msg:4622661
 3:59 am on Nov 12, 2013 (gmt 0)

How many different paths can carry the "filter_" parameter? Since the %{QUERY_STRING} part requires a Condition, you want to constrain the search as tightly as possible so the condition doesn't have to be evaluated on every single request. Both of your examples involve directory-index pages (either physical directory or URL made to look that way, doesn't matter). So at a minimum:

RewriteRule /$ et cetera


so you only evaluate the RewriteCond if the request was for a directory.

Do you want to discard all URLs that contain the "filter_blahblah" parameter, or do you want to discard the parameter and keep the rest of the query? If the latter, does "filter_" always come at the beginning of the query string? Can it be followed by other stuff? If yes to both, do any other parameter names begin in f?

Best case involves

^filter_[a-z]+=[\d,]+&(more-stuff-here)

where more-stuff-here becomes %1 in a redirect. Worst case involves

(.*?|^)filter_[a-z]+=[\d,]+($|&more-stuff-here)

with %1 and %2 in the redirect.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved