Forum Moderators: goodroi

Message Too Old, No Replies

General questions

General robots.txt queries

         

makzan

10:39 am on May 6, 2005 (gmt 0)

10+ Year Member



What is the purpose of blocking robots from certain files?
Can using robots.txt help search engine rankings?
Are there any robots which search for email addresses on a site to collect them for SPAM?

jatar_k

11:36 pm on May 6, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Welcome to WebmasterWorld makzan,

1. to keep files or directories from being spidered and listed in SE's, though this isn't fool proof, they should be otherwise protected if there is something you really dont want spidered

2. No

3. Tons, I would refer to them as 'email harvesters' though

makzan

2:56 pm on May 7, 2005 (gmt 0)

10+ Year Member



So, I assume its better to let in the spiders that are genuine instead of trying to block email harvesters?
If so, which should I let in?

ThomasB

10:14 pm on May 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



makzan, nobody can give you a clear answer for that. There might be niche search engines, non-english search engines, ... you might want to target. Usually the email harvesters are bad bots and don't care about robots.txt anyway. It's like telling a thief to leave the money.

Reid

4:08 am on May 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



robots.txt is used to control good bots, it is not a security tool by any means.

robots.txt is handy for blocking certain directories or search engines that you don't want or care to be in.
If a bot uses 100mb of bandwidth each month to crawl and only sends you 2 referrals, you might want to block it.
It is also a good tool to tell the bots not to index certain sections of the site.
If you have a members only area you don't want that in the SERP's
If you use dynamic content (a page has more than one URL) you could block all but the proper URL for that page.
Block click tracking scripts.
Block pages that are nuthing but java.
maybe there is a page that you just don't want in the SERP's for some reason.
If you have pages that are so similar (red-widgets and reddish-widgets) that you are afraid of duplicate content penalty but (for the user) you really want both pages, you could simply block one to avoid a possible penalty.

So with use of robots.txt on a good robot (in major SE's) IMHO I believe that - yes- robots.txt can help improve ranking.

Bad bots - there are lots of bad bots (or undesirable in your niche) that do obey robots.txt
for these there are so many just deal with them as they come, each new bot that shows up just do a little search to see what it is and decide to block or not. Some bots you may want to block from certain directories while others may be a different case.

Dijkgraaf

11:13 pm on Jun 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For question 3, see [projecthoneypot.org...]

StupidScript

11:49 pm on Jun 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



MSN has a pretty good overview of robots.txt with examples here [search.msn.com].

Basically:

//Instructions for all robots who care

User-agent: *

or
//Instructions for one particular robot who cares
User-agent: msnbot

//Do not index in this folder
Disallow: /scripts/

//Do not index these types of files
Disallow: /*.js$

//Limit crawl frequency if they're hitting too hard (MSNBot) in seconds
Crawl-delay: 120