Welcome to WebmasterWorld Guest from 54.227.83.19

Forum Moderators: goodroi

Message Too Old, No Replies

Block all Nutch variants. Possible?

     

morags

12:57 pm on Jun 4, 2007 (gmt 0)

10+ Year Member



I've just been crawled by another Nutch variant. Rather than having to add the user agent to robots.txt for each variant, is it possible to use a wildcard to disallow all spiders which have 'Nutch' anywhere in the user agent?

e.g.

User-agent: *Nutch*
Disallow: /

Ta.

goodroi

7:33 pm on Jun 4, 2007 (gmt 0)

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



according to the nutch website
Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:

User-agent: Nutch
Disallow: /

good luck

morags

8:28 pm on Jun 5, 2007 (gmt 0)

10+ Year Member



Thanks for the pointer. I've had this specified in robots.txt for some time - and still they appear. Maybe some variants do obey this instruction, but sadly not all.

Will add the wildcards too - and see if it makes any difference.

incrediBILL

8:37 pm on Jun 5, 2007 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The only way to truly block all the noise, including all the nutch's, is to do your robots.txt and .htaccess file in WHITELIST format so everything else goes away. Tell 'em nicely in robots.txt and keep 'em out by force in .htaccess.

Example:

# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin

# everyone else jump off a cliff
User-agent: *
Disallow: /

Conard

8:37 pm on Jun 5, 2007 (gmt 0)

10+ Year Member



Trying to block all the Nutch bots with robots.txt will not stop them.

I've caught several variants in directories they are "asked" not to crawl. I block them and all bots with .htaccess from places they shouldn't be.

incrediBILL

9:02 pm on Jun 5, 2007 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I block them and all bots with .htaccess

You block the ones that want to be seen and the rest are having a party on your server right now as I'm typing this as they zip right past .htaccess in stealth mode.

However, it's the best you can do with the tools Apache gives with the server.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month