Forum Moderators: phranque

Message Too Old, No Replies

Need example of complete htaccess to fix all duplicate content issues

Confused about all the variables

         

AndyA

3:05 pm on Dec 10, 2006 (gmt 0)

10+ Year Member



While there is a ton of information on this subject at WebmasterWorld, it is spread about in different threads, often dealing with specific issues that don't always apply to other sites. I thought it would be nice to have it all together in one place, especially for those of us who do not understand how to make these changes.

There are so many issues with Google indexing/crawling strange URLs that often don't exist on a site. I have added code to my htaccess file to prevent many of these situations from happening, yet I still keep coming across new ones that need to be addressed. I just added one, (to remove a "?" from the URL when one shouldn't exist) and find another issue, (double // in URL) which was fixed, is now "unfixed."

So, obviously something is conflicting with something else. Could someone post an example of an htaccess file that prevents all of these strange things from happening:

- Add extra slashes in URL (http://example.com/folder//page.html)
- Remove dots that don't belong in URL (http://example.com./)
- Remove index.html from URL (http://example.com/folder/index.html to http://example.com/folder/)
- Remove question marks that don't belong (http://example.com/page.html?)
- 301 Redirect from www. to non www. or vice versa.

And there may be others I'm overlooking as well.

And anything else that could normally be expected. I know one of those rules, I believe the one to remove the index.html from the URL, has to be before the www. to non-www. one, or it can cause a server issue.

I've searched, but haven't found everything all in one place. If there is such a thread, I'd appreciate being directed to it.

I'm trying to fix my site, but as fast as I fix something, something else pops up that could cause a potential problem.

Thanks in advance. Here is my current htaccess:

Options +FollowSymlinks
Options -MultiViews
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?
RewriteRule ^(([^/]*/)*)index\.html?$ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]
RewriteCond %{THE_REQUEST} \?
RewriteRule .? http://example.com%{REQUEST_URI}? [R=301,L]
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^example\.com$
RewriteRule (.*) http://example.com/$1 [R=301,L]
RewriteCond %{REQUEST_URI} ^/(.*)//+(.*)
RewriteRule .* http://example.com/%1/$2 [R=301,L]
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^64\.233\.172\.
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F,L]
RewriteRule ^([^.]+\.[^/]+)/ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?myspace\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?blogspot\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?suddenlaunch\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?ebay\.com/ [NC,OR]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?livejournal\.com/ [NC]
RewriteRule .*\.(jpe?g¦gif¦bmp¦png)$ - [F,L]

jdMorgan

6:47 am on Dec 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just wanted to post and let you know that I'm actually testing some code for URL canonicalization and to 'fix' typos in incoming links to prevent duplicate content problems.

However, the code is complicated because I've set several stringent requirements: First, the code has to work in .htaccess on Apache 1.3.x despite the documented mod_rewrite bug [archive.apache.org] in that version. And second, all corrections and canonicalizations must be accomplished with a single external redirect (I want to avoid multiple chained redirects on any given request).

First-pass coding and testing is done and results look good, but like you, every time I think I've got all of the URL-errors covered, something else comes up. So, it'll be a few more days, at least...

Jim

CainIV

2:33 am on Dec 29, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would be interested in this as well Jim. If and when you can complete testing, please mention this in the Google forum as well so that others can test it and use.

Thanks,
Todd

AndyA

9:39 pm on Jan 2, 2007 (gmt 0)

10+ Year Member



Thanks for the update, Jim.

I keep seeing strange requests for URLs that my server is returning a 200 OK response, and I don't like that! I thought I had everything taken care of, but apparently not.

When this is done, I hope it gets pinned, because it will save lots of people the frustration of dealing with duplicate content penalties due to their server environment not being Google friendly. Bits and pieces of effective code for htaccess are spread out all over the Internet, but no one has a fix for all of the most common problems.

I'll be watching this thread for updates! And thanks again, Jim.

jdMorgan

2:29 am on Jan 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just spent the better part of today polishing and testing. Still have some work to do, though; It's rather complex.

If you're seeing 200-OK responses for strange URLs in your logs, you should get after those URLs with a server headers checker, and see what's going on.

One common problem is that some hosts enable content-negotiation and MultiViews by default. This causes Apache to look for a 'best-fit' file if the requested URL does not resolve to an existing file. You'll rarely see a 404-Not Found on the server if MultiViews are enabled.

If you see symptoms like this, change your existing Options directive in (or add one) to your .htaccess file:


Options +FollowSymLinks -MultiViews

Jim

coopster

7:29 pm on Jan 3, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member




You'll rarely see a 404-Not Found on the server if MultiViews are enabled.

Could you explain your thoughts here a bit more, jd? I have a properly configured site currently running with MultiViews enabled. I can request the following pages where only the first one is a real document and I get the corresponding response codes ...

http://www.example.com/directory/actualfilename   200 
http://www.example.com/directory/actualfilenam 404
http://www.example.com/directory/actualfilenames 404
... and the content delivered is the proper page for the 200 and my ErrorDocument for the 404.

jdMorgan

8:31 pm on Jan 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My comment is perhaps overly-general if taken out of the context of this thread. My head is definitely wrapped around the code I've been working on, and so I'm stuck in that rut. From the documentation:

A MultiViews search is enabled by the MultiViews Option. If the server receives a request for /some/dir/foo and /some/dir/foo does not exist, then the server reads the directory looking for all files named foo.*, and effectively fakes up a type map which names all those files, assigning them the same media types and content-encodings it would have if the client had asked for one of them by name. It then chooses the best match to the client's requirements, and returns that document.

So, MultiViews will partially handle many of the URL fix-ups we've discussed in the past, with the critical distinction that it does an internal rewrite to the 'best-fit' document, rather than an external redirect. As such, it prevents many 404-Not Found errors, but does not help at all with domain or URL-canonicalization in search results. Actually, it hinders such canonicalization, because it prevents a 404 from being returned in many cases.

As such, MultiViews are best used for sites which can return pages in several languages and/or formats, rather than for implementing 'friendly' URLs or for other applications beyond multiple language, encoding, and MIME-type support.

Jim

coopster

2:01 pm on Jan 4, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



My comment is perhaps overly-general if taken out of the context of this thread.

I kind of thought that might be the direction but wanted to confirm instead of assume.

I agree with your last statement regarding the intent and purpose of the Content Negotiation module and at the same time would add that a properly configured server with either a type-map or MultiViews does indeed lend itself nicely to extensionless URIs ('friendly' URLs). I've used this practice effectively on more than one occasion.

I didn't intend to get this thread off-topic, just wanted to firm up some of the MultiViews part of the discussion. Thanks for the comments.

jdMorgan

8:55 pm on Jan 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A solution for the original poster's question has been posted here [webmasterworld.com].

Jim