g1smd - 9:44 am on Oct 5, 2012 (gmt 0)
Someone asked this in another thread,
This page has been blocked by robots.txt but is still indexed?
and I thought the answer to be important enough to copy over to this thread.
What do you mean by "indexed"?
Google records the fact that a URL exists as soon as it sees a link to it. It immediately adds the URL to its database, for later crawling.
A URL "exists" as soon as a link is created pointing to a web resource - even if it is subseqently found that the hostname doesn't respond, or there's no page by that name on that hostname, or that page crawling is blocked by a robots.txt rule. The URL itself still "exists" for all of the time that there's a link with that URL in, found somewhere on the web.
If the hostname responds but the resource is blocked by an entry in the robots.txt file Googlebot will not fetch it (but page preview might) but Google will still keep a note that the URL "exists".
In order to determine the HTTP status for the URL and index the content on the page, Googlebot has to fetch it and will only do so if it is not blocked by robots.txt.
The page might return 301, 404, 403 or other non-content status codes. If the page returns 200 OK, only then is the on-page content indexed. However, if the page itself contains a meta robots noindex directive, the content will not appear in any search results.