Forum Moderators: phranque
mod_negotation and MultiViews are powerful tools when used correctly, but they do come with a penalty - they open the door to widespread duplicate content on the site as the same resource resolves under many different URIs. Therefore it is usually best to disable it unless required.
Can you expound on this a bit? I understand what you are saying regarding resolving the same resource. For example, with MultiViews on, the resource
index.htmin the home directory could be found using endless URIs:
http://example.com/
http://example.com/index
http://example.com/index.htm
http://example.com/index/fakefile
http://example.com/index/fakedir/
http://example.com/index/fakedir/fakefile
<!--#IF EXPR="$DOCUMENT_URI=$REQUEST_URI" -->
<meta name="robots" content="index,follow,noarchive">
<!--#ELSE -->
<meta name="robots" content="noindex,follow">
<!--#ENDIF -->
You can fine-tune this by replacing the $DOCUMENT_URI variable by the most wanted version of your URL, for example:
<!--#IF EXPR="index.html=$REQUEST_URI" -->
...
DOCUMENT_URIisn't always available -- yes, it seems to be readily available as a
varin mod_include [httpd.apache.org] so you could indeed check it using SSI but I most often use PHP for server-side work. There are other ways to check the requested doc though, so there are indeed workarounds.
I was thinking returning a 404 as opposed to the
"noindex,follow"would be more appropriate -- your thoughts?
httpd.conf contains:
<Directory "/var/www/html">
Options MultiViews +Includes
DirectoryIndex index
AddHandler server-parsed .html
</Directory>
<VirtualHost *:80>
ServerName www.example.com
DocumentRoot /var/www/html
Redirect 301 /index.html http://www.example.com/
</VirtualHost>
First some explanation of the httpd.conf. I preferred to use index.LANG.html instead of index.html.LANG. Therefore I have set index as the default directory index, and use a 301 redirect from index.html for those visitors used to type in www.example.com/index.html. I added the SSI handle to files with the .html extension, because I prefer that the user sees file.html instead of file.shtml.
Search engines do not use language negotation AFAIK, so the contents of index.en.html will be sent to them by default when they request www.example.com/. For this file DOCUMENT_URI (/index.en.html) and REQUEST_URI (/) are different and the robot meta tag "noindex,follow" is sent. Search engines won't display this page in the SERPs because of the "noindex".
On the index.en.html page are links to all other language pages. Because of the "follow" in the robot meta tags, these pages are crawled. They are crawled with their real name "/index.de.html" so REQUEST_URI equals DOCUMENT_URI. Therefore the "index,follow,noarchive" meta tag is added. The indivual language pages will be placed in the index because of this.
Because every language page contains a link to index.en.html, the crawler will eventually also request the page index.en.html with a REQUEST_URI of "index.en.html". This version of the page will be indexed.
Nett effect is that all the pages "index.LANG.html" are indexed, and that I have no duplicate content problems with www.example.com/ being equal to www.example.com/index.en.html.