Does Disallow in robots.txt Stop Crawling?

Forum Moderators: open

Message Too Old, No Replies

Does Disallow in robots.txt Stop Crawling?

If use disallow in robots,txt, can it stop being crawled?

rockyfp

3:15 am on Jan 2, 2008 (gmt 0)

The situation is that, one site has duplicate contents from another domain (under same company; to be updated at a later stage).

What robots.txt does is: Disallow all pages from being indexed but allow a few URLs. However, on those allowed pages, there are links to other part of the page, would SEs follow the links and still crawl those pages?

Thank you very much.

jdMorgan

3:56 am on Jan 2, 2008 (gmt 0)

robots.txt is URL-based. More specifically, it is URL-prefix based. A compliant spider will not fetch a URL that matches the URL-prefix given in the robots.txt file. I'm intentionally glossing over the various proprietary and semi-proprietary 'extensions' to the robots.txt protocol here -- The ones that allow wild-cards and disallowing certain filetypes. More information about these must be gathered from the individual search engines' robots help pages.

Ignoring that level of detail, if a URL is not Disallowed, it will be fetched, and all links on it will be extracted. The URLs in those links will then be checked against the robots.txt file, and again, those not Disallowed will be fetched, and the process repeats.

URLs *not* Disallowed will be fetched, and those which are Disallowed will not be fetched, but this latter assertion assumes that the robot is compliant with the Standard for Robot Exclusion -- Badly-broken or malicious robots may fetch any URL they find.

Also be aware that Disallowed URLs may still appear in the search engine's search results as "URL-only" listings if links are found to those URLs on non-Disallowed pages elsewhere on the Web; Despite the fact that their robots won't fetch these URLs, the search engines will either list them as URLs-only listings, or may construct a "page title" using the link text from the links that they find elsewhere.

To prevent a URL from being listed in search results, it is necessary to *allow* it to be fetched, but include an on-page <meta name="robots" content="noindex"> tag in the <head> section of the (HTML) page. If that can't be done, then it's best to cloak the URL. I'm sorry to have to recommend cloaking, but the search engines have forced it upon us in order to keep URLs out of search with their recent "Hidden Web" and "Deep Web" spidering endeavors.

Jim

rockyfp

6:09 am on Jan 2, 2008 (gmt 0)

Thanks Jim, very helpful. Have a wonderful new year!

Rocky