TheMadScientist - 6:54 pm on May 30, 2010 (gmt 0)
When a bot crawls a robots.txt file, particularly Googlebot, what is it doing with the Disallow entries?
Making a note to NOT visit those locations on your site.
The noindex allows crawling of the pages. It is an explicit note to SEs to not index the page the tag is present on. (The index is searched to generate results, so noindex removes it from searches by removing it from the index.)
If you want to know this is what they are doing, put a noindex meta tag on a URL only listing since you know it keeps pages from being returned in the results or listed as part of the site.
The URL only listing will remain, because Google is doing what your robots.txt file tells it to: NOT visiting (crawling, spidering) the page(s) listed as disallowed in the robots.txt to find the noindex tag.
If they were crawling (spidering) the page(s) disallowed in robots.txt, then the noindex tag on those pages would work and you would not see the URL only listing...
* This note from Mu seems like it could explain some of the complaints about G crawling (spidering) pages disallowed in robots.txt doesn't it?
Crawling = Spidering, Visiting, Accessing
Parsing = Processing, Analyzing, Scanning
Indexing = Listing a reference to, Adding to Possible Results
They crawl a URL, then parse the information to determined the page should be returned in the results and where, and if it should be shown to searchers it is indexed.
To understand the direct relationship between the word index and results as used by search, think DataBase... In databases an index is used as a reference or 'key' to make the information more quickly accessible from storage. It's basically a 'note' or 'short reference' that says for 'blah' look here in the storage system (disk, memory, etc.). So, what they do to make the results more searchable and returned quicker is use an index of the possible choices to be returned for a search, then search the index for the 'key' or 'reference' to the location of the actual information for the results they generate to show visitors, hence: Index = Short reference to the storage location of the actual information shown in the results. Another way to think of it is as a 'catalog' of the possibilities the results are generated from.
ADDED: The best example of an index is probably a map...
You look for a city name in an alphabetized list and it says: A2 so you go to col A row 2 on the actual map and only have to look at a small portion of the possibilities to find the city you were looking for rather than having to review the entire map until you find it.
If a reference is not in the index of a search engine the page does not show in the results, but this does not mean they don't use or have the information... It simply means when someone searches for it they say: 'Sorry, can't find it. Try again.'