G. ignores robots.txt at other ports, gives false page count

Below I'm showing that sometimes Google is intentionally ignoring the content of the robots.txt file. It shows prohibited pages in its index even if it knows that it mustn't do this.

All pages of my site http://www.example.com have a standard "mailto:" link named "kontakt".

I noticed that if I ask Google for:
site:www.example.com
it returns "about 6140 results". But if I ask for
site:www.example.com kontakt
it reports only 2190 results.

Where's the rest?

Let's see what happens if I ask for
site:www.contaact.com -kontakt
- we obtain "Results 1 - 3 of about 4,380"
<snip>

Apart from 1 (one) RTF file and one NOT LINKED TO and forgotten file - the rest are pages that are PROHIBITED BY ROBOTS.TXT FILE!

The clue is here: the prohibited pages are at :2317 port, which has different robots.txt file: http://www.example.com:2317/robots.txt

User-Agent: *
Disallow: /

Google DOES NOT index the content of the pages, which means that it DOES KNOW what IS DISALLOWED. But it takes use of all data it has from other sources: URL and anchor text which can carry a lot of data.

I can prove that I did not modify the ":2317/robots.txt" file: the port 2317 is used by "GeneWeb 4.10" - a specialized Web server for drawing genealogical trees, which is a part of standard Debian Linux distribution. The robots.txt is produced by the server itself and it cannot be edited/modified.

So, if you have some content that should not appear in the Google results, it is not enough to create a correct robots.txt. You should probably think of some cloaking in pages which have links to the disallowed pages (by JavaScript or so).

Sad, isn't it?

[edited by: goodroi at 1:29 pm (utc) on Mar. 26, 2007]
[edit reason] Please no specific links [/edit]

G. ignores robots.txt at other ports, gives false page count

Google shows in its index pages that it never saw, even prohibited

OnetSzukaj

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week