homepage Welcome to WebmasterWorld Guest from 54.196.207.55
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
Forum Library, Charter, Moderators: mademetop

General Search Engine Marketing Issues Forum

    
Does Disallow in robots.txt Stop Crawling?
If use disallow in robots,txt, can it stop being crawled?
rockyfp




msg:3538579
 3:15 am on Jan 2, 2008 (gmt 0)

The situation is that, one site has duplicate contents from another domain (under same company; to be updated at a later stage).

What robots.txt does is: Disallow all pages from being indexed but allow a few URLs. However, on those allowed pages, there are links to other part of the page, would SEs follow the links and still crawl those pages?

Thank you very much.

 

jdMorgan




msg:3538583
 3:56 am on Jan 2, 2008 (gmt 0)

robots.txt is URL-based. More specifically, it is URL-prefix based. A compliant spider will not fetch a URL that matches the URL-prefix given in the robots.txt file. I'm intentionally glossing over the various proprietary and semi-proprietary 'extensions' to the robots.txt protocol here -- The ones that allow wild-cards and disallowing certain filetypes. More information about these must be gathered from the individual search engines' robots help pages.

Ignoring that level of detail, if a URL is not Disallowed, it will be fetched, and all links on it will be extracted. The URLs in those links will then be checked against the robots.txt file, and again, those not Disallowed will be fetched, and the process repeats.

URLs *not* Disallowed will be fetched, and those which are Disallowed will not be fetched, but this latter assertion assumes that the robot is compliant with the Standard for Robot Exclusion -- Badly-broken or malicious robots may fetch any URL they find.

Also be aware that Disallowed URLs may still appear in the search engine's search results as "URL-only" listings if links are found to those URLs on non-Disallowed pages elsewhere on the Web; Despite the fact that their robots won't fetch these URLs, the search engines will either list them as URLs-only listings, or may construct a "page title" using the link text from the links that they find elsewhere.

To prevent a URL from being listed in search results, it is necessary to *allow* it to be fetched, but include an on-page <meta name="robots" content="noindex"> tag in the <head> section of the (HTML) page. If that can't be done, then it's best to cloak the URL. I'm sorry to have to recommend cloaking, but the search engines have forced it upon us in order to keep URLs out of search with their recent "Hidden Web" and "Deep Web" spidering endeavors.

Jim

rockyfp




msg:3538631
 6:09 am on Jan 2, 2008 (gmt 0)

Thanks Jim, very helpful. Have a wonderful new year!

Rocky

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved