homepage Welcome to WebmasterWorld Guest from 107.21.135.68
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
G. ignores robots.txt at other ports, gives false page count
Google shows in its index pages that it never saw, even prohibited
OnetSzukaj




msg:3289744
 2:58 pm on Mar 22, 2007 (gmt 0)

Below I'm showing that sometimes Google is intentionally ignoring the content of the robots.txt file. It shows prohibited pages in its index even if it knows that it mustn't do this.

All pages of my site http://www.example.com have a standard "mailto:" link named "kontakt".

I noticed that if I ask Google for:
site:www.example.com
it returns "about 6140 results". But if I ask for
site:www.example.com kontakt
it reports only 2190 results.

Where's the rest?

Let's see what happens if I ask for
site:www.contaact.com -kontakt
- we obtain "Results 1 - 3 of about 4,380"
<snip>

Apart from 1 (one) RTF file and one NOT LINKED TO and forgotten file - the rest are pages that are PROHIBITED BY ROBOTS.TXT FILE!

The clue is here: the prohibited pages are at :2317 port, which has different robots.txt file: http://www.example.com:2317/robots.txt
User-Agent: *
Disallow: /

Google DOES NOT index the content of the pages, which means that it DOES KNOW what IS DISALLOWED. But it takes use of all data it has from other sources: URL and anchor text which can carry a lot of data.

I can prove that I did not modify the ":2317/robots.txt" file: the port 2317 is used by "GeneWeb 4.10" - a specialized Web server for drawing genealogical trees, which is a part of standard Debian Linux distribution. The robots.txt is produced by the server itself and it cannot be edited/modified.

So, if you have some content that should not appear in the Google results, it is not enough to create a correct robots.txt. You should probably think of some cloaking in pages which have links to the disallowed pages (by JavaScript or so).

Sad, isn't it?

[edited by: goodroi at 1:29 pm (utc) on Mar. 26, 2007]
[edit reason] Please no specific links [/edit]

 

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved