Forum Moderators: phranque

Message Too Old, No Replies

MultiViews and duplicate content

Avoiding penalties

         

coopster

1:42 pm on Jul 24, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



In a recent discussion regarding mod negotiation, MultiViews and type maps [webmasterworld.com] member encyclo stated:


mod_negotation and MultiViews are powerful tools when used correctly, but they do come with a penalty - they open the door to widespread duplicate content on the site as the same resource resolves under many different URIs. Therefore it is usually best to disable it unless required.

Can you expound on this a bit? I understand what you are saying regarding resolving the same resource. For example, with MultiViews on, the resource

index.htm
in the home directory could be found using endless URIs:
http://example.com/ 
http://example.com/index
http://example.com/index.htm
http://example.com/index/fakefile
http://example.com/index/fakedir/
http://example.com/index/fakedir/fakefile

... all returning the same content if the server-side isn't handling the parsing of the path and serving different pages accordingly. However, if the links on the site are all written as I designed the site, how could I get penalized for duplicate content? I'm trying to think this through ... I suppose even if I as the developer did not use those links on my pages it would not stop somebody else from doing so in an external link and causing me problems ...? That would certainly be a malicious act, but nonetheless possible I guess.

lammert

2:19 pm on Jul 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One thing you can do, is comparing the actual document URL (stored in the environment variable DOCUMENT_URI) with the requested URL. I have setup a website with MultiViews and SSI which contains the following code in the head of every shtml file:

<!--#IF EXPR="$DOCUMENT_URI=$REQUEST_URI" -->
<meta name="robots" content="index,follow,noarchive">
<!--#ELSE -->
<meta name="robots" content="noindex,follow">
<!--#ENDIF -->

You can fine-tune this by replacing the $DOCUMENT_URI variable by the most wanted version of your URL, for example:

<!--#IF EXPR="index.html=$REQUEST_URI" -->
...

coopster

2:42 am on Aug 2, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I was thinking along the same lines, lammert. Except
DOCUMENT_URI
isn't always available -- yes, it seems to be readily available as a
var
in mod_include [httpd.apache.org] so you could indeed check it using SSI but I most often use PHP for server-side work. There are other ways to check the requested doc though, so there are indeed workarounds.

I was thinking returning a 404 as opposed to the

"noindex,follow"
would be more appropriate -- your thoughts?

lammert

11:14 am on Aug 2, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a special reason for the "noindex,follow" instead of a 404. It is a website with only one page, but in several languages. I wanted people visiting the site via www.example.com to see the version best matching their language settings, but I also wanted all language versions to be indexed by the search engines. The site contains various pages like index.en.html, index.de.html, index.nl.html, etc.

httpd.conf contains:

<Directory "/var/www/html">
Options MultiViews +Includes
DirectoryIndex index
AddHandler server-parsed .html
</Directory>

<VirtualHost *:80>
ServerName www.example.com
DocumentRoot /var/www/html
Redirect 301 /index.html http://www.example.com/
</VirtualHost>

First some explanation of the httpd.conf. I preferred to use index.LANG.html instead of index.html.LANG. Therefore I have set index as the default directory index, and use a 301 redirect from index.html for those visitors used to type in www.example.com/index.html. I added the SSI handle to files with the .html extension, because I prefer that the user sees file.html instead of file.shtml.

Search engines do not use language negotation AFAIK, so the contents of index.en.html will be sent to them by default when they request www.example.com/. For this file DOCUMENT_URI (/index.en.html) and REQUEST_URI (/) are different and the robot meta tag "noindex,follow" is sent. Search engines won't display this page in the SERPs because of the "noindex".

On the index.en.html page are links to all other language pages. Because of the "follow" in the robot meta tags, these pages are crawled. They are crawled with their real name "/index.de.html" so REQUEST_URI equals DOCUMENT_URI. Therefore the "index,follow,noarchive" meta tag is added. The indivual language pages will be placed in the index because of this.

Because every language page contains a link to index.en.html, the crawler will eventually also request the page index.en.html with a REQUEST_URI of "index.en.html". This version of the page will be indexed.

Nett effect is that all the pages "index.LANG.html" are indexed, and that I have no duplicate content problems with www.example.com/ being equal to www.example.com/index.en.html.