Forum Moderators: open
Google: Net Hacker Tool du Jour [wired.com] (-Wired)
[edited by: jdMorgan at 1:30 am (utc) on Mar. 9, 2003]
The link I cited describes the problem and the fix for it. There is no way to keep Google from listing a link to a page by using robots.txt to disallow that page. The link will be listed in the Google results with no title or description, because the page has not been spidered (as requested by robots.txt). But the link itself will be displayed, simply because it has been found - for example, on any other "allowed" page.
I understand the behaviour based on the definition of "indexing", but wish this behaviour was otherwise. I do not have any "private" pages on the web, but I have plenty of pages where direct entry from a search engine may provide a confusing or "non-optimal" user experience. Also, I really would prefer to keep my "contacts" forms behind the page which describes their terms of use, and not wave their URLs around making them easier for harvesters to spot. For these pages, I am now using the method in the post that I cited above, at the cost of decentralized spider control and a bit of extra bandwidth.
As far as I know, Google and Ask are the only SEs which display this behaviour; others interpret a disallow as "don't mention it." As a result, I have special cases for them in my robots.txt.
I don't spend much time complaining about things beyond my control, I just find work-arounds. If Google decides not to display links to disallowed pages in the future, that would be great, and if not, I'll live with the fix I found. There are bigger problems to be dealt with on the Web, and this one is pretty minor.
Pragmatically, :)
Jim
It's not up to Google to be responsible for server admin mistakes. If they are low tech enough to run OS's and servers prone to hackers, it's not Googles job to cover their mistake. It's up to the admins to be as good at running their own server as google is at running an se.
If you publish it on the web - they will find it.