Below I'm showing that sometimes Google is intentionally ignoring the content of the robots.txt file. It shows prohibited pages in its index even if it knows that it mustn't do this.
All pages of my site http://www.example.com have a standard "mailto:" link named "kontakt".
I noticed that if I ask Google for: site:www.example.com it returns "about 6140 results". But if I ask for site:www.example.com kontakt it reports only 2190 results.
Where's the rest?
Let's see what happens if I ask for site:www.contaact.com -kontakt - we obtain "Results 1 - 3 of about 4,380" <snip>
Apart from 1 (one) RTF file and one NOT LINKED TO and forgotten file - the rest are pages that are PROHIBITED BY ROBOTS.TXT FILE!
The clue is here: the prohibited pages are at :2317 port, which has different robots.txt file: http://www.example.com:2317/robots.txt
User-Agent: * Disallow: / Google DOES NOT index the content of the pages, which means that it DOES KNOW what IS DISALLOWED. But it takes use of all data it has from other sources: URL and anchor text which can carry a lot of data.
I can prove that I did not modify the ":2317/robots.txt" file: the port 2317 is used by "GeneWeb 4.10" - a specialized Web server for drawing genealogical trees, which is a part of standard Debian Linux distribution. The robots.txt is produced by the server itself and it cannot be edited/modified.
Sad, isn't it?
[edited by: goodroi at 1:29 pm (utc) on Mar. 26, 2007] [edit reason] Please no specific links [/edit]