Forum Moderators: phranque

Message Too Old, No Replies

Long invalid urls creating dupe content in WMT

Suspect: a possible missing slash in htaccess

         

MelissaLB

2:18 pm on Jul 22, 2011 (gmt 0)

10+ Year Member



Hi there everyone,

I noticed a few weeks back there was a thread here about someone having almost the exact same issue as us here: [webmasterworld.com...]

We had started noticing many many errors in WMT showing duplicate content that was being created due to the improper association of categories with other categories. We did our best to correct this with the use of 301's and rel canonical.

the more and more I keep digging the more I suspect that the original source for this problem lies somewhere in our htaccess file. I have a hunch that there is a missing slash somewhere that caused a flurry of invalid urls.

I was hoping to get some input as This is a language that I don't really speak and it looked as though the members here were able to help the person in the above mentioned thread so I thought I'd go ahead and post the main body of our ht access file to see if anyone can see something that we overlooked that could cause such a problem.

ok here goes>>>


----------------------------------------------


SetEnv TZ America/Halifax
AddDefaultCharset ISO-8859-1

php_flag display_errors Off
php_flag zlib.output_compression On
php_value zlib.output_compression_level 5

RewriteEngine On

RewriteCond %{HTTP_HOST} !^(www\.example\.com|example\.intranet\.nvi) [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

#anti-hotlink codes were asked to remove
#RewriteCond %{HTTP_REFERER} !^$
#RewriteCond %{HTTP_REFERER} !^(http(s)?://(www\.)?example.com|http(s)?://example\.intranet\.nvi)/ [NC]
#RewriteCond %{HTTP_REFERER} !search\?q=cache [NC]
#RewriteCond %{HTTP_REFERER} !google\. [NC]
#RewriteCond %{HTTP_REFERER} !bing\. [NC]
#RewriteCond %{HTTP_REFERER} !yahoo\. [NC]
#RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]

#Rewrite any CMS pages
RewriteRule ^content/(.*) /cms.php?key=$1 [NC,L]

#Rewrite the wrong licen bands with license name pages
RewriteRule ^License/Bands/([^/]+)/pCat/(.*) /License/Bands/pCat/$2 [R=301,L]

#RewriteRule ^(.*)/store/(.*) - [R=404,L,NC]
RewriteRule ^(.*)/store/(.*) http://www.example.com [R=301,L]

#Rewrite any store pages
RewriteRule ^store/SMOKE_SHOP_19_ONLY(.*) / [NC,L]
RewriteRule ^store/(.*) /products.php?key=$1 [NC,L]
RewriteRule ^store(.*) /products.php [NC,L]
RewriteRule ^License/(.*) /products.php?key=$1 [NC,L]
RewriteRule ^License(.*) /products.php [NC,L]

#ExpiresByType application/x-Shockwave-Flash A2592000
#ExpiresByType image/gif A2592000
#ExpiresByType image/png A2592000
#ExpiresByType image/jpg A2592000
#ExpiresByType image/jpeg A2592000

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from 147.84.200.85
deny from 165.98.225.68

ReWriteBase /
ReWriteCond %{HTTP:accept-encoding} (gzip.*)
ReWriteCond %{REQUEST_FILENAME}.gz -f
ReWriteRule ^(.*) $1.gz [L]

<FilesMatch "\.(css.gz)$">
AddEncoding x-gzip .gz
ForceType text/css
</FilesMatch>

<FilesMatch "\.(html.gz)$">
AddEncoding x-gzip .gz
ForceType text/html
</FilesMatch>

<FilesMatch "\.(js.gz)$">
AddEncoding x-gzip .gz
ForceType text/javascript
</FilesMatch>

<IfModule mod_expires.c>
ExpiresActive on
ExpiresByType image/jpeg A604800
ExpiresByType image/gif A604800
ExpiresByType image/png A604800
ExpiresByType application/x-shockwave-flash A604800
ExpiresByType audio/mpeg A604800
</IfModule>

g1smd

4:34 pm on Jul 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's a lot of issues in there, the same ones that come up several times every week.

First though, you have code where the comment says "rewrite" but the code actually performs a "redirect". That tells me you are not clear on what both do.

It's also not clear which one you actually want in each place.

A redirect tells the browser to request a different URL in a new HTTP transaction. The user sees the URL in the browser address bar change.

A rewrite maps the external URL request to the actual place inside the server where the content resides and directly serves that content.

You should organise your .htaccess file as follows:

- block unwanted requests
- redirects to new URLs (with non-www to www rule last)
- rewrites to fetch content

and then all the other stuff unrelated to mod_rewrite.

lucy24

8:30 pm on Jul 22, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a language that I don't really speak

It is good to start from this position :) You can seriously injure yourself with a Regular Expression-- even in a text editor, never mind your .htaccess file.

RewriteRule ^(.*)/store/(.*) http://www.example.com [R=301,L]

The element ^(.*) is never necessary unless you need to capture. It simply means "there may or may not be stuff at the beginning of the request". Combined with built-in RegEx rules it means "capture as much as you can-- and then backtrack to make room for the last occurrence of /store/". And then you've set up another problem because the mandatory part of the request begins with a slash-- which it would only do if there is a preceding part, because Rewrites always assume a base of www.example.com/ with trailing slash. The following set of rules have it right, assuming you are only talking about a top-level /store/ directory. If you've got more than one directory with the same name, things get messier.

#Rewrite any store pages
RewriteRule ^store/SMOKE_SHOP_19_ONLY(.*) / [NC,L]

Here too the (.*) isn't needed because you aren't doing anything with it.

RewriteRule ^store/(.*) /products.php?key=$1 [NC,L]
RewriteRule ^store(.*) /products.php [NC,L]
RewriteRule ^License/(.*) /products.php?key=$1 [NC,L]
RewriteRule ^License(.*) /products.php [NC,L]

These two pairs of rules are probably not doing what you intended them to do. Are you distinguishing between directories and subdirectories, or between things that have and don't have a query string? If you inadvertently swapped rules 2 and 1, or rules 4 and 3, then 1 and 3 would be swallowed up into 2 and 4 since .* includes absolutely anything including a slash. This is perilous. You probably want something involving [^/] (anything other than a slash). And, since you're capturing it in order to feed it into the query string, you almost certainly want a + rather than a * because there has to be something to get captured.

MelissaLB

5:31 pm on Jul 25, 2011 (gmt 0)

10+ Year Member



Thanks for the input guys. I've shown this to my developer to see what his opinion is on this but I really feel like I'm on the right track with this.

I have a very strong suspicion that some of our issues with the long invalid urls with extraneous parameters could be coming from the issue mentioned by lucy24. the /store/ issue may be causing this. Although I originally thought the issue was caused by a missing slash rather than an additional slash this could like it could be a major clue to the issue.

We've been seeing a lot of these types of urls:
http://www.example.com/store/License_Category_Example/Produc Example/pCat/Top_Level_Category_Example/pProd/11

This is the correct url that should show:
http://www.example.com/store/License_Category_Example/Produc Example/
or:
http://www.example.com/store/Top_Level_Category_Example/pProd/11

It just looks as though as the bot crawls through it's not dropping the paramenters of the previous page and we see this happening just about everywhere on the site but now more prominently on the Categories with pagination as well as with products and categories that have been recently deleted (these have a redirect- i think) to point toward a general page of the site when that particular product is removed from the site. So when we delete them from the site we start finding that product url becoming associated with unrelated categories in WMT.

Anyway, Thanks again for the input. I can't comment on much of it as I mentioned, I dont speak the language, I'm just directing others, but I have a feeling our answer is somewhere in here!