If anyone visits a URL with a browser that has a Google toolbar installed - or using the Google Chrome browser - they know the URL exists.
Also if you have Adsense integrated with the page.
Thanks tedster, that was it then - I've checked the page in Chrome. I had no idea they used that to gather urls.
It can also be a link from all those #*$!rapers, which is not listed yet.
The same thing happened to me and it turned out that my hosting company published a page of newly created domains. Google crawled that page and ended up indexing my test site. Now when I create a new site I noindex it until I *want* it in the index.
Domains don't need your hosting company's help to be discovered. It's public record. The original question was about an unlinked page within a domain.
Hmm. Surprised g### doesn't do like all those bad robots and run around with a shopping list of likely directory names.
/admin/ ? Check.
/images/ ? Check.
/styles/ ? Check.
And so on.
Now that they've indexed everyone's sitemap.xml and robots.txt, you'd think they would be getting hungry again.
Don't "noindex" a dev site, block access with htpasswd rules.
My guess would be not the toolbar, but publicly available server logs. In the past, Google has denied that the Toolbar was involved, and currently it appears that (by omission) they would perhaps be denying the same thing about Chrome. I'm including dates in the references below to indicate the history. We had a discussion about a similar problem here. Note the Google FAQ quote I posted about the "secret" web server...
Why is Google indexing my entire web server?
That quote has gone 404 several times, but was quoted specifically by Matt Cutts on his blog (note the title of the article)...
Generic Toolbar Indexing Debunk Post
by Matt Cutts on July 19, 2008
The Google FAQ link is also broken in Matt's post. I'll quote it here....
|Why is Googlebot downloading information from our "secret" web server? |
Itís almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if thereís a link to your "secret" web server or page on the web anywhere, itís likely that Googlebot and other web crawlers will find it.
Part of this quote is now incorporated into current Google WMT documentation...
Googlebot - Webmaster Tools Help
Chrome isn't mentioned. Can't say whether they would use it for Googlebot discovery or not.
I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.
Bottom line, if you don't want something indexed, use noindex and/or password protection.
Didn't I hear somewhere recently that G is now scanning non-linking text on a page for possible new sites to index?
I've also just noticed that G started indexing some random files in our disllowed directories. They're in cgi-bin and show up in supplemental results when doing site: example.com. We've had /cgi-bin disallowed for all robots for many years, AND I made sure nothing there is included in the sitemap.xml, yet I can still see at least 3 files: 2 .htm's and one called x.out indexed just with the URL and no descriptions. Can't put noindex in them because they are not actually complete .htm files, they're used as template includes for other pages and in the case of x.out is just a plain old data base!
I did a fetch as googlebot in WMT and they all came back as "Denied by robots.txt", but you can't submit them in that case, so hopefully G will get a clue anyway.
|G is now scanning non-linking text |
They do, and it leads to a lot of incomplete and truncated URLs showing as 404 errors in WMT.
If you have files that are used only as server-side includes, you should use htaccess to block access to the folder from the web.
There are also some sites that monitor every domain that has been sold and when they find a website hosted there, they publish a link to it.
the best way to stop a bot from crawling AND indexing a requested resource is to respond with a 401 Unauthorized status code.
|10.4.2 401 Unauthorized |
The request requires user authentication. The response MUST include a WWW-Authenticate header field (section 14.47) containing a challenge applicable to the requested resource. The client MAY repeat the request with a suitable Authorization header field (section 14.8). If the request already included Authorization credentials, then the 401 response indicates that authorization has been refused for those credentials. If the 401 response contains the same challenge as the prior response, and the user agent has already attempted authentication at least once, then the user SHOULD be presented the entity that was given in the response, since that entity might include relevant diagnostic information. HTTP access authentication is explained in "HTTP Authentication: Basic and Digest Access Authentication"
HTTP Authentication: Basic and Digest Access Authentication:
here's the Authentication and Authorization How-To for the Apache HTTP Server:
and the Windows Server documentation to Configure Basic Authentication (IIS 7):
Basic Authentication <basicAuthentication> : Configuration Reference : The Official Microsoft IIS Site: