Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: mademetop
What robots.txt does is: Disallow all pages from being indexed but allow a few URLs. However, on those allowed pages, there are links to other part of the page, would SEs follow the links and still crawl those pages?
Thank you very much.
Ignoring that level of detail, if a URL is not Disallowed, it will be fetched, and all links on it will be extracted. The URLs in those links will then be checked against the robots.txt file, and again, those not Disallowed will be fetched, and the process repeats.
URLs *not* Disallowed will be fetched, and those which are Disallowed will not be fetched, but this latter assertion assumes that the robot is compliant with the Standard for Robot Exclusion -- Badly-broken or malicious robots may fetch any URL they find.
Also be aware that Disallowed URLs may still appear in the search engine's search results as "URL-only" listings if links are found to those URLs on non-Disallowed pages elsewhere on the Web; Despite the fact that their robots won't fetch these URLs, the search engines will either list them as URLs-only listings, or may construct a "page title" using the link text from the links that they find elsewhere.
To prevent a URL from being listed in search results, it is necessary to *allow* it to be fetched, but include an on-page <meta name="robots" content="noindex"> tag in the <head> section of the (HTML) page. If that can't be done, then it's best to cloak the URL. I'm sorry to have to recommend cloaking, but the search engines have forced it upon us in order to keep URLs out of search with their recent "Hidden Web" and "Deep Web" spidering endeavors.