Forum Moderators: Robert Charlton & goodroi
At any rate, robots.txt will serve to block Gbot (and other compliant 'bots) from fetching resources from a server with only these three lines:
User-agent: *
Disallow: /
Jim
Just because you don't have an "http:..." pointing to something doesn't mean your server isn't serving up a directory listing. You need to remember that you also provide information to someone in order to have a domain resolve and that (insert search engine name here) also has access to that information.
A -Indexes on the Option statement in Apache can prevent this. I'd bet that there is something similar for the IIS folks.
Also some of the folks who you are allowing to look at things might also have posted links somewhere or are using a lees than trustworthy software on their systems.
Oh and then there are the proxies.
[edited by: theBear at 9:09 pm (utc) on July 18, 2007]
Once I tested a fetching script called "Wiki Reflection." It is a php script that fetches wiki content for user defined titles, such as: "fetchingScript.php?title=UNIX" will fetch the Wiki page on Unix, and display it.
Last week, I happened to check site:mysite.com, and viola! Google indexed 700+ pages with this pattern: fetchingScript.php?title=X.
There is no link to this script page, and it is absolutely isolated. So there is no possibility for Google to reach it by following a link pointing to that page.
The only possibility is that Google fetched the whole root directory (the script resides on the root directory).
Some of these pages I have sent email links to people to view. Pages with pics for the family. When I redid my old website I saved the old files under a different name, like contactold.html instead of contact.html. All these old pages are now indexed. Thought google didn't like duplicate content? When I make updates on clients pages, I usually rename the old page to something else, cause sometimes they change their mind and want to go back and because I like to keep a backup. Google is out of control!
I know for a fact there are no links. This has happend on more than one server. One is a brand new site, on a new server. I have a front page up with no links that says under construction. Sent my client a link to get in the back door so they could view the site....
mumbles - As several people have tried to tell you in several ways, when you let pages sit on your server and don't block them, they're liable to get spidered.
Jim Morgan's scenario is likely enough that I'm going to repeat it and empasize the relevant text...
...Or you visited *their* site from one of your pages, and ended up in their publicly-available "stats" page as a referrer link.
I always have developers I'm working with either use a password protected directory, or put the meta robots noindex tag in all temporary or development pages that are in a public directory...
<meta name="robots" content="noindex, nofollow">
Google indirectly suggests that it's not the Toolbar in this "Google Information for Webmasters FAQ"...
[google.com...]
Why is Googlebot downloading information from our "secret" web server?It's almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if there's a link to your "secret" web server or page on the web anywhere, it's likely that Googlebot and other web crawlers will find it.
So, I don't think it's Google that's out of control. When you leave old files in a public area on your webserver, they're public. At the least, create a folder, call it "old", block it somehow, and put your old stuff in there.
Note:
Google has apparently moved its relevant explanation on this topic. New location here...
Why is Googlebot downloading information from our "secret" web server?
[scholar.google.com...]
[edited by: Robert_Charlton at 3:22 am (utc) on April 20, 2008]
The robots.txt "block" still allows the URLs to appear as URL-only entries in the SERPs.The noindex meta tag, or proper password protection, is much more robust.
Yes, the noindex meta tag is much better than robots.txt to keep references to your URLs from being indexed.
One further point on this...
robots.txt is to prevent spidering. The noindex meta tag is to prevent indexing. Don't use robots.txt in addition to the noindex meta tag to prevent indexing of a particular page or pages.
Because the use of robots.txt will prevent Googlebot from spidering the page with the noindex meta, Google will never see the noindex meta and therefore might end up indexing the URL to the robots.txt-blocked page if it finds it somewhere on the web.
As several people have tried to tell you in several ways, when you let pages sit on your server and don't block them, they're liable to get spidered.