Forum Moderators: goodroi

Message Too Old, No Replies

NimbleCrawler is Bad Bot, well worth banning.

NimbleCrawler is Bad Bot. Never reads robots.txt file.

         

d_mot

5:39 pm on Feb 9, 2006 (gmt 0)

10+ Year Member



Hey All --

NimbleCrawler is annoying bot that never bothers to access my robots.txt file before it helps itself to my content. I kept getting popped by it on a few separate health care sites I admin, so I banned the owner's IP block outright til he got it fixed.

It's an especially "in your face" type of bot because it claims to follow protocol right in it's UA string. ---> "obeys UserAgent NimbleCrawler" (see below). pffffft.

I banned the bot a while back and also took the time to email it's owners about it. Asked them to stop letting it crawl until they had a better working prototype. That was a while back when I first noticed it. Repealed the ban a few days ago to give it a second chance , and bang, it came right back, still accessing files without first reading the robots.txt at all.

Well worth a permanent ban IMO. Info below.

IP block: 72.5.115.0 - 72.5.115.127
UA: NimbleCrawler
OWNER: healthline.com
STRING: HTTP/1.1 Mozilla/5.0+(Windows;)+NimbleCrawler+1.12+obeys+UserAgent+NimbleCrawler+For+problems+contact:+crawler@healthline.com

d

AtariGuide

5:13 am on Feb 16, 2006 (gmt 0)



There can be a delay between the robots.txt fetch and the actual site crawl, perhaps several days. Please go back further in your logs; you should see a robots.txt fetch.

NimbleCrawler obeys the robots.txt directives for

UserAgent: NimbleCrawler

If there is no UserAgent: NimbleCrawler line, it will obey directives for

UserAgent: *

If you wish to disallow NimbleCrawler from your site, using notepad or vi create a file called robots.txt in the root of your domain, virt.somesite.com/robots.txt with the text:

UserAgent: NimbleCrawler
Disallow: /

For more information, please visit [robotstxt.org...]

NimbleCrawler first performs a run for robots.txt and then begins the actual crawl. Based on the ip address that you are giving for the crawler you are part of a crawl where you should have had less then 10 pages from any domain crawled, with a space of hours or days between the fetches. The crawl you were in does not perform a fresh fetch for robots.txt during this period. The 1.13 crawl refreshes the robots.txt at regular intervals. If you can email crawler@healthline.com with the domains you do not wish crawled we would be happy to immediately remove you from the crawl before our next robots.txt fetch.

We have tried to make this robot extremely well behaved and we will quickly follow up on any problems which are reported and try to fix them.

d_mot

6:36 pm on Feb 16, 2006 (gmt 0)

10+ Year Member



1. I just grepped through all the raw logs for 30 days prior to your bot's visit. Not *once* did niblecrawler request, or get, the robots.txt file at my site. None. Nada. No gets.

2. Your bot did however, help itself and grab a user login page which was explicitly disallowed by the following robots.txt entry:

User-agent: *
Disallow: /memberlogin.htm

This, of course, would be expected behavior by a bot that didn't bother to read the robots.txt file to begin with.

3. Before I posted here at WebmasterWorld I checked & researched and saw that other admins had similar issues as I did with your bot -- they also complained it failed to read the robots.txt.

Therefore, I still stand by my original post & assertion that NimbleCrawler-- as it currently functions-- is a bad bot and worth banning.

d

FYI -- I can assure you that I don't enjoy spending any time debugging your bot's issues with you. It's your bot, you get it working correctly. Until then, I highly recommend everyone ban it unless they want explicitly dissallowed pages spidered and indexed.

Staffa

8:19 pm on Feb 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I also banned the bot a while ago for the same reasons.

incrediBILL

8:02 am on Feb 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt is just about meanlingless these days as ill behaved bots do as they please and you're in a no win scenario as the only way to stop this is to literally block everything and whitelist allowed crawlers instead.

You should allow Google, Yahoo, MSN, etc. by ranges of IPs only as they are faked all over the place.

Next block anything that doesn't start with Mozilla, then you'll still need to filter any user agent starting with Mozilla that contains a subset of the other crawler names and boot them as well.

Only then do you have a supposed visitor vs. a bot and it's still not safe.

Install Alex K's script to stop high speed or low speed scrapers:
[webmasterworld.com...]

You still aren't 100% safe but it's a start and you'll probably whack 90% of it.

Rani

6:44 am on Feb 20, 2006 (gmt 0)

10+ Year Member



Can anyone give me one good reason to allow Nimblecrawler into my site?

Mokita

7:07 am on Feb 20, 2006 (gmt 0)

10+ Year Member



I banned NimbleCrawler from our sites yesterday.

When it first appeared in our logs a month or two ago, I thought it would go away, as none of our sites are even remotely health-related. But it just keeps coming back, and delving deeper and deeper.

I can't see any benefit in it for us - so decided to ban it. Even if the bandwidth is minimal - a principle is involved.

Their website says:

Unlike general purpose search engines, Healthline only searches the top health sites on the Web, so users receive precise and relevant health information without having to sift through pages of unnecessary and unrelated results.

If they find sites that are NOT remotely health related, why would they spider them again and again? Seems very suss to me.

Mokita

9:19 pm on Feb 22, 2006 (gmt 0)

10+ Year Member



Further to my last message:

I initially banned Nimblebot via robots.txt as they strongly claim to honour it. Then I watched carefully...

For couple of days it didn't request robots.txt but last night it did. One hour later it was back crawling pages again.

Now I've banned it via .htaccess, it will only feed off 403 error pages ;)

incrediBILL

1:12 am on Feb 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



See, told you robots.txt is a waste ;)

Key_Master

1:24 am on Feb 23, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It respects and follows robots.txt on my sites.