Forum Moderators: open

Message Too Old, No Replies

Healthbot / Health and Longevity Project

Healthbot/Health_and_Longevity_Project_(HealthHaven.com)

         

dstiles

4:47 pm on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This bot, new to me, came into a health site today (ok, fair game) and scraped the complete site, pics, css etc included, in under five minutes. Scraped payment forms as well as ordinary pages.

UA: Healthbot/Health_and_Longevity_Project_(HealthHaven.com)
IP: 98.165.214.nnn (dynamic Cox USA)
Robots.txt: No

Still trying to discover how it got in, since the headers were blank.

dstiles

8:57 pm on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just looked at the site HealthHaven.com and read their bot info - top right of home page.

Not nice!

Distributed bot for anyone to run, with financial incentive.

I previously had it in a whitelist from previous experience and sympathies, which is how it got in. After it ignored robots.txt today and turns out to be a potential grub-type bot it's now banned.

GaryK

10:16 pm on Mar 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's too bad. At least you caught it quickly though. I just added it to my list.

keyplyr

4:29 am on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



AFAIK - If it's coming from a dynamic Cox IP address, chances are it is not a robot but a Cox user posing as a bot to log spam. Cox would catch any significant downstream and force the account owner to switch to a fixed IP biz account.

That said, thanks for the heads-up.

dstiles

9:21 pm on Mar 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No. Read the bot page on that site. It's a distributed bot like grub - anyone can run it from their own desktop or network.

The bot isn't high-profile enough to be used for spoofing and it's fairly easy to spot in the logs, with that unusual UA.

As to cox limiting their users' traffic - maybe they do, but I see a lot of trapped cox IPs from US and Canada.

keyplyr

9:38 am on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My bad, didn't notice the "distributed" part on bot page until after I posted.

Pfui

9:49 pm on Mar 15, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@dstiles, you might want to think about whitelisting the okay UAs rather than playing catch-up blacklisting the bad ones. Bill's posted about this before (oh, maybe a year ago; and probably since then, too).

In a nutshell, since the majority of legit UAs begin with "Mozilla," 403 all UAs that don't. Then selectively whitelist your choice of okay bots/hosts/UAs with names beginning with other than Mozilla. For example.:

RewriteCond %{HTTP_USER_AGENT} !^Google-Sitemaps
RewriteCond %{HTTP_USER_AGENT} !^Googlebot

It's still a lot of work weeding out the bots hiding behind/after ^Mozilla. But it's nice knowing you're preventatively protected from the likes of Healthbot, Java, VRTServers' triplet _viewer bots, and literally scores and scores of bad, non-Mozilla UAs.

dstiles

12:11 am on Mar 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the advice but I already have a complex UA trap system that caters for almost all bots and browsers.

The problem anyway isn't non-mozilla UAs but mozilla UAs that are really scrapers, injectors and similar malevolent swine. My trap caters to these as well as to badly behaved bots such as this one which, as I said, I originally decided was a good one.

wilderness

1:39 am on Mar 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In a nutshell, since the majority of legit UAs begin with "Mozilla," 403 all UAs that don't. Then selectively whitelist your choice of okay bots/hosts/UAs with names beginning with other than Mozilla. For example.:

RewriteCond %{HTTP_USER_AGENT} !^Google-Sitemaps
RewriteCond %{HTTP_USER_AGENT} !^Googlebot

Pfui,
Are you using IP's for the 2nd condition or rather UA's ?does it look something like:

RewriteCond %{HTTP_USER_AGENT} ^name
RewriteCond %{HTTP_USER_AGENT} !^Google-Sitemaps
RewriteRule .* - [F]

TIA

Pfui

7:22 am on Mar 17, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The two !^examples aren't a complete code couplet as-is. Rather, they're excerpted from a longish list of this 'n' that conditions and a rule. (I didn't want to hijack the thread:)