Forum Moderators: phranque

Message Too Old, No Replies

How to correctly block robots that obey robots.txt?

         

toplisek

7:31 pm on Jan 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would like to block search engines that read robots file:

User-agent: *
Disallow: /subDomainName/
Disallow: /testingArea/

What is technically meaning within
testingArea and what subDomainName?

jdMorgan

12:45 am on Jan 7, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The string following the Disallow directive is the string which search robots will use as a prefix match when they decide if they should fetch a particular URL. In the given example, robots would consider it OK fetch the following URLs:

/subDomainName (no slash)
/Subdomainname/ (different upper/lowercasing)
/subDomainNames/ ("s" does not match)
/subDomainNamz/ ("z" does not match)
... and anything else that did not start with *exactly* "/subDomainName/"

Search engines would not fetch any directory or folder path that starts with *exactly* "/subDomainName/", so all files in all directories below that directory would not be fetched.

More information here [robotstxt.org].

Note that many search engines will list any URL that they find a link to anywhere on the Web. So you can end up with URLs listed in search results even if the robot never fetched that URL. Instead, it may appear as a "URL-only listing", or they may construct a title for the listing by using the link text found on the page that links to your URL.

If this is a problem, then you must allow the URL in robots.txt (do not "Disallow" it), and then put a "noindex" meta-robots tag in the HTML of the page itself. If the URL does not resolve to an HTML page, then you may also signal this "noindex" request by configuring your server to send the "X-Robots" HTTP header in response to requests for that URL. However, this HTTP header is fairly new and is not recognized by *all* robots, just the major ones.

If you have more robots.txt questions, I commend our Sitemaps, Meta Data, and robots.txt [webmasterworld.com] forum to you. :)

Jim