Forum Moderators: phranque

Message Too Old, No Replies

Microsoft versus Apache - Microsoft wins?

Is there an alias design fault in Apache?

         

bcrbcr

7:17 pm on Apr 15, 2007 (gmt 0)

10+ Year Member



I am struggling with an alias or canonical domain issue these last few days. Some of my websites run on Apache/Linux/Unix, others run on Windows IIS. The Apache/Linux/Unix sites appear to have a duplicate set of files - which creates a duplicate content issue in the SEs (I think...). If I am right then this creates a PR reduction problem with the likes of Google.

I have solved the www versus non-www redirecting with the Mod Redirect part of my htaccess file. So it is not that problem

My sitemap generator program on one of my sites found 30 pages, where only 15 exist. But on further investigation, 30 do exist, but 15 are straight duplicates.

My principal site - and preferred domain - in this case is
www.example.co.uk/ - a small site with 15 pages.
This site has a PR of 3 on the home page

There is also a site out there called
www.example.co.uk./ - with an extra dot after the UK and before the last forward slash. I did not create this site

This "extra" site has another 15 files, and the index page has a PR of zero.

Both sites are visible on a Google search, and both sites come up cleanly within a browser, with identical content.

My understanding of the Google-Voodoo is that this represents a duplicate content, and the content filter would be applied - thus reducing PR etc.

I have checked quite a few sites now, and those that appear to be hosted on Apache servers all appear to have the same problem.

Those sites hosted on windows servers don't display the same characteristic.

Example in my local area www.quux-foo.com (Windows based)
When you enter www.quux-foo.com./ (with an extra dot) a clean site is served as www.quux-foo.com (without the offending dot)

I have raised this question on Google's webmaster forum but no-one seesm to want to take the issue or discussion on.

I have read that Aapche and other servers create various aliases for internal purposes and shorthand processing work. This is one of the reasons that www.example.com and example.com exist side by side, and we need to adjust through the .htaccess file for one preferred domain, as I understand it. On another site I am told that the HELM management system creates aliases by default.

I have assumed that this extra set of file with the extra dot has come from the same server source for technical purposes ...

So here we are in an Apache forum - where I assume people have a more specific Apache server knowledge and experience than on a general Google forum. So if anyone wants to respond, some questions -

- does anyone know why this happens?
- Is it a feature of Apache servers? (maybe I'm wrong)
- Am I right in saying that a penalty exists because there IS duplicate content created?
- How do I fix it? - and
- do I need to fix it - or are the extra files irrelevant, a mirage, and SEs REALLY don't them into account.

[edited by: jdMorgan at 8:35 pm (utc) on April 15, 2007]
[edit reason] No URLs or specifics, please. See TOS. [/edit]

jdMorgan

8:30 pm on Apr 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



- does anyone know why this happens?
Because of the DNS and server configuration -- all done by humans.

- Is it a feature of Apache servers? (maybe I'm wrong)
No, it's a feature of DNS and server configuration options. It has nothing to do with Apache vs. IIS.

Hosting companies typically configure the domain and its www subdomain to resolve to the same resources (files, scripts, etc.) on the server. Left like that, you run the risk of duplicate content. Why do they do it? Because some of their customers want to use the www.example.com subdomain, some want to use the example.com domain, and the hosting companies don't want to be bothered with this once the account is activated, so they set up both and leave it to you to deal with (if you're even aware of it).

- Am I right in saying that a penalty exists because there IS duplicate content created?
Not a penalty, but rather a diluting effect; You spread your link popularity and PageRank across two or more URLs.

- How do I fix it? - and
Do a search here on WebmasterWorld for "canonical domains" and "canonical URLs", and implement the suggestions found on as as-needed basis. Redirect all non-canonical URLs to the canonical equivalent. Examples are:

example.com/
www.example.com/
xyz.example.com/
example.com/index.html
www.example.com/index.html
xyz.example.com/index.html

That's six URls -- all pointing to the same file. Most SEO-savvy members will recommend redirecting them all to either example.com/ or www.example.com/ -- Your choice, but pick one, link to it consistently, and redirect all of the others.

- do I need to fix it - or are the extra files irrelevant, a mirage, and SEs REALLY don't them into account.
Yes. Search engines will be happiest (and you will, too) if every resource in your domain has one and only one URL. If alternates exist, they should be 301-redirected to the proper URL.

The good news is that for new, simple sites, most of these issues can be precluded/handles with only four directives -- all previously posted here.

Jim

bcrbcr

10:25 pm on Apr 15, 2007 (gmt 0)

10+ Year Member



Jim
Thanks for the reply - and apologies for the specifics domains.

Most of what you have said seems to tie up with my own suppositions.

All the aliases that you have listed I have covered with my standard .htaccess treatment.

I still have a problem with this "trailing dot" domain, ie

www.domain.com./ (final dot before the slash)

Can you direct me to a source to explain the syntax to redirect this one?

I can do the front end redirects, and the /index.html's etc. This dot before the slash is giving me problems

Mod rewrite seems to have special rules with the main host name up to the forward slash - the folders and files after that are relatively easy to deal with.

But a dot after the ".com" (ie .com./) or a dot after the ".co.uk" (.co.uk./) is very confusing

I really appreciate your input

bryan

Achernar

10:34 pm on Apr 15, 2007 (gmt 0)

10+ Year Member Top Contributors Of The Month



My sitemap generator program on one of my sites found 30 pages, where only 15 exist. But on further investigation, 30 do exist, but 15 are straight duplicates.

My principal site - and preferred domain - in this case is
www.example.co.uk/ - a small site with 15 pages.
This site has a PR of 3 on the home page

There is also a site out there called
www.example.co.uk./ - with an extra dot after the UK and before the last forward slash. I did not create this site


If by this you mean that your sitemap program has analyzed your site and found 15 links with hostname www.example.co.uk/ and 15 with www.example.co.uk./ , it means that somewhere on your pages there is one link that points to www.example.co.uk./ . There is no other way the program could have come with this hostname.

Example in my local area www.quux-foo.com (Windows based)
When you enter www.quux-foo.com./ (with an extra dot) a clean site is served as www.quux-foo.com (without the offending dot)

[microsoft.com....] works perfectly from here. And they run it on Microsoft-IIS/6.0. ;)
The problem is not specific to a type of webserver. A hostname with a final dot is valid. In the sense that it resolves (when a DNS server is queried). And there is nothing wrong with it.

Now, for a solution to your situation. First, check your page for any reference to links with a hostname with a final dot.
Second, you can configure apache to redirect www.example.co.uk./ to www.example.co.uk/

RewriteCond %{HTTP_HOST} ^.*example\.co\.uk\.$ 
RewriteRule ^(.*)$ http://www.example.co.uk/$1 [R]

jdMorgan

10:43 pm on Apr 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In root .htaccess:
 # Redirect to remove trailing port number or period (or both) from hostname
RewriteCond %{HTTP_HOST} ^www\.example\.co\.uk(:[0-9]+¦\.¦\.:[0-9]+)$ [NC]
RewriteRule (.*) http://www.example.co.uk/$1 [R=301,L]

or, alternately:

# Redirect all non-canonical domain variants to canonical domain
RewriteCond %{HTTP_HOST} !^www\.example\.co\.uk$
RewriteRule (.*) http://www.example.co.uk/$1 [R=301,L]

Replace the broken pipe "¦" characters in the patterns above with a solid pipe before use' Posting on this forum modifies the pipe character.

Jim

jdMorgan

10:46 pm on Apr 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The most common source of 'dotted' domains is forum software that auto-links URLs. When a user types in a URL, and it is at the end of a sentence, then the period is included in the auto-linked URL. For example, [webmasterworld.com....]

Jim

bcrbcr

10:58 pm on Apr 15, 2007 (gmt 0)

10+ Year Member



Jim, Achernar
Appreciate your replies again.
It's late here in Spain so I'll look at this tomorrow.
I've just found some of your reference documents, Jim, from January and December, so I'm trawling through that also.
I'll also double check links
Thanks
Bryan

g1smd

11:10 pm on Apr 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



IIS includes a "flaw" that Apache does not have.

IIS is case insensitive, so your index page on an IIS server could be accessed using... index.html, index.htmL, index.htMl, index.htML, index.hTml, index.hTmL, index.hTMl, ... and all other permutations up to INDEX.HtML, INDEX.HTml, INDEX.HTmL, INDEX.HTMl, and INDEX.HTML.

Now that is a problem!

bcrbcr

6:41 am on Apr 16, 2007 (gmt 0)

10+ Year Member



OK guys
Found 1 link on site with .co.uk. (in my site map of all places)
So that will be removed, then I get bak to repairing the problem with some code.

Thank you so much for your help
B

bcrbcr

9:40 am on Apr 16, 2007 (gmt 0)

10+ Year Member



Jim
You're a genius.

I have fixed the bad link which was causing mysitemapbuilder to recognise the extra dot series of files (as Achernar pinted out), then used your second rule suggestion

RewriteCond %{HTTP_HOST}!^www\.domain\.co\.uk$
RewriteRule (.*) [domain.co.uk...] [R=301,L]

which I hadn't thought of doing - ie if the domain is NOT written like this, then re-write it - much cleaner and covers nearly every situation I was worried about. (I think I understood that right didn't I?)

Thanks loads
B

bcrbcr

4:25 pm on Apr 16, 2007 (gmt 0)

10+ Year Member



Jim
Final part of this - I forgot to mention I have FrontPage on this site, and the solution locked me out.
So I have found the following solution fom another site which deasl with FrontPage and Mod Rewrite

In the FrontPage directory of _vti_bin, plus the subdirectories of _vti_adm and _vti_aut I have added a line with
Options +FollowSymlinks

And all works well (I hope).

No technical issues with this are there?

Again thanks for your help
B

jdMorgan

4:44 pm on Apr 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, no issues.

I'm surprised you found that "on another site," since it was apparently first-reported by WebmasterWorld member "chopin2256" here [webmasterworld.com], and credited to member "Bumpski". :)

See post #1496336 in that thread for details.

Jim

bcrbcr

4:51 pm on Apr 16, 2007 (gmt 0)

10+ Year Member



Thanks again Jim
page was (if you can accept specifics - apologies)
[wordpress.org...]

I'll spend more time looking through webmaster world next time
B

jtara

4:58 pm on Apr 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As pointed-out by another poster, the www/non-www duplication has nothing to do with technical internals of web servers, but with historical conventions. When the web was new, sites almost exclusively used "www". There has been a slow move in the direction of dropping the "www", and even sites that prefer "www" alias without the "www" in order to catch user errors in type-ins.

As far as the trailing dot - what a great find, which I am surprised has not been discussed here before. (At least I hadn't noticed it.) It points out how important it is to keep-up with changes that may at first glance appear not to be related to your site.

Auto-linking has become popular, and I'd never imagined that this flaw in auto-linking software would exist so widely and have this impact.

Add "trailing dot removal" to the list of "must have" rewrites!

I can offer a bit of insight as to why the "." is being accepted by browsers in the first place: a "." is, indeed, legal at the end of a domain name. By adding the final ".", the domain name becomes a "fully-qualified domain name", or FQDN. This indicates (normally, to the local operating system and/or network) that the domain name should not be further suffixed with a default domain name.

It's a bit of arcana that is unknown to and ignored by 99+% of Internet users. There are very few situations where the average user would ever need to add the "." for the domain name to properly-resolve. (Although the people at news.com - which is really news.com.com - might have a use internally... :) )

I really think this is a browser flaw, as well as an obvious flaw in the auto-linking software. I think browsers should remove the trailing dot from the "host" header that they send to the website.

scraulb

5:11 pm on Apr 16, 2007 (gmt 0)

10+ Year Member



I tried this on mysite but had to change it to this:

RewriteEngine on
RewriteCond %{HTTP_HOST}!^www\.mysite\.com$
RewriteRule (.*) [mysite\.com$1...] [R=301,L]

Note the no / before $1 on RewriteRule.

I was getting in IE [mysite.com...] until I got rid of the /

Does this seem correct?

P.S. [webmasterworld.com....] does not redirect
to [webmasterworld.com...]

P.S.S if you look up in google it seems to not be indexing site:http://www.webmasterworld.com./

jdMorgan

5:38 pm on Apr 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did you add that code to httpd.conf or conf.d, instead of .htaccess? If so, then the leading slash will be present in the URL-path examined by RewriteRule, and will need to be accounted for.

For use in server config files, you can also use:


RewriteRule ^/(.*)$ http://www.example.com/$1 [R=301,L]

which acts the same, but explicitly shows the difference between URL-path-patterns in server-config versus .htaccess files.

Or, to make the code "portable" between the config files and the top-level .htaccess file,


RewriteRule ^/?(.*)$ http://www.example.com/$1 [R=301,L]

(I don't recommend doing that unless you truly need it, though.)

> P.S. [webmasterworld.com....] does not redirect
...which is why I used it as an example... :)

Jim

scraulb

6:14 pm on Apr 16, 2007 (gmt 0)

10+ Year Member



Aaah.. Yes we put it in the httpd.conf. That would explain it!

Thanks