Forum Moderators: goodroi
NimbleCrawler is annoying bot that never bothers to access my robots.txt file before it helps itself to my content. I kept getting popped by it on a few separate health care sites I admin, so I banned the owner's IP block outright til he got it fixed.
It's an especially "in your face" type of bot because it claims to follow protocol right in it's UA string. ---> "obeys UserAgent NimbleCrawler" (see below). pffffft.
I banned the bot a while back and also took the time to email it's owners about it. Asked them to stop letting it crawl until they had a better working prototype. That was a while back when I first noticed it. Repealed the ban a few days ago to give it a second chance , and bang, it came right back, still accessing files without first reading the robots.txt at all.
Well worth a permanent ban IMO. Info below.
IP block: 72.5.115.0 - 72.5.115.127
UA: NimbleCrawler
OWNER: healthline.com
STRING: HTTP/1.1 Mozilla/5.0+(Windows;)+NimbleCrawler+1.12+obeys+UserAgent+NimbleCrawler+For+problems+contact:+crawler@healthline.com
d
UserAgent: NimbleCrawler
If there is no UserAgent: NimbleCrawler line, it will obey directives for
UserAgent: *
If you wish to disallow NimbleCrawler from your site, using notepad or vi create a file called robots.txt in the root of your domain, virt.somesite.com/robots.txt with the text:
UserAgent: NimbleCrawler
Disallow: /
For more information, please visit [robotstxt.org...]
NimbleCrawler first performs a run for robots.txt and then begins the actual crawl. Based on the ip address that you are giving for the crawler you are part of a crawl where you should have had less then 10 pages from any domain crawled, with a space of hours or days between the fetches. The crawl you were in does not perform a fresh fetch for robots.txt during this period. The 1.13 crawl refreshes the robots.txt at regular intervals. If you can email crawler@healthline.com with the domains you do not wish crawled we would be happy to immediately remove you from the crawl before our next robots.txt fetch.
We have tried to make this robot extremely well behaved and we will quickly follow up on any problems which are reported and try to fix them.
2. Your bot did however, help itself and grab a user login page which was explicitly disallowed by the following robots.txt entry:
User-agent: *
Disallow: /memberlogin.htm
This, of course, would be expected behavior by a bot that didn't bother to read the robots.txt file to begin with.
3. Before I posted here at WebmasterWorld I checked & researched and saw that other admins had similar issues as I did with your bot -- they also complained it failed to read the robots.txt.
Therefore, I still stand by my original post & assertion that NimbleCrawler-- as it currently functions-- is a bad bot and worth banning.
d
FYI -- I can assure you that I don't enjoy spending any time debugging your bot's issues with you. It's your bot, you get it working correctly. Until then, I highly recommend everyone ban it unless they want explicitly dissallowed pages spidered and indexed.
You should allow Google, Yahoo, MSN, etc. by ranges of IPs only as they are faked all over the place.
Next block anything that doesn't start with Mozilla, then you'll still need to filter any user agent starting with Mozilla that contains a subset of the other crawler names and boot them as well.
Only then do you have a supposed visitor vs. a bot and it's still not safe.
Install Alex K's script to stop high speed or low speed scrapers:
[webmasterworld.com...]
You still aren't 100% safe but it's a start and you'll probably whack 90% of it.
When it first appeared in our logs a month or two ago, I thought it would go away, as none of our sites are even remotely health-related. But it just keeps coming back, and delving deeper and deeper.
I can't see any benefit in it for us - so decided to ban it. Even if the bandwidth is minimal - a principle is involved.
Their website says:
Unlike general purpose search engines, Healthline only searches the top health sites on the Web, so users receive precise and relevant health information without having to sift through pages of unnecessary and unrelated results.
If they find sites that are NOT remotely health related, why would they spider them again and again? Seems very suss to me.
I initially banned Nimblebot via robots.txt as they strongly claim to honour it. Then I watched carefully...
For couple of days it didn't request robots.txt but last night it did. One hour later it was back crawling pages again.
Now I've banned it via .htaccess, it will only feed off 403 error pages ;)