Welcome to WebmasterWorld Guest from 54.197.66.254

Forum Moderators: goodroi

Message Too Old, No Replies

Block all Nutch variants. Possible?

     
12:57 pm on Jun 4, 2007 (gmt 0)

New User

10+ Year Member

joined:May 27, 2005
posts:24
votes: 0


I've just been crawled by another Nutch variant. Rather than having to add the user agent to robots.txt for each variant, is it possible to use a wildcard to disallow all spiders which have 'Nutch' anywhere in the user agent?

e.g.

User-agent: *Nutch*
Disallow: /

Ta.

7:33 pm on June 4, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 21, 2004
posts:3156
votes: 130


according to the nutch website
Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:

User-agent: Nutch
Disallow: /

good luck

8:28 pm on June 5, 2007 (gmt 0)

New User

10+ Year Member

joined:May 27, 2005
posts:24
votes: 0


Thanks for the pointer. I've had this specified in robots.txt for some time - and still they appear. Maybe some variants do obey this instruction, but sadly not all.

Will add the wildcards too - and see if it makes any difference.

8:37 pm on June 5, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


The only way to truly block all the noise, including all the nutch's, is to do your robots.txt and .htaccess file in WHITELIST format so everything else goes away. Tell 'em nicely in robots.txt and keep 'em out by force in .htaccess.

Example:

# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin

# everyone else jump off a cliff
User-agent: *
Disallow: /

8:37 pm on June 5, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 14, 2001
posts:616
votes: 0


Trying to block all the Nutch bots with robots.txt will not stop them.

I've caught several variants in directories they are "asked" not to crawl. I block them and all bots with .htaccess from places they shouldn't be.

9:02 pm on June 5, 2007 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


I block them and all bots with .htaccess

You block the ones that want to be seen and the rest are having a party on your server right now as I'm typing this as they zip right past .htaccess in stealth mode.

However, it's the best you can do with the tools Apache gives with the server.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members