Forum Moderators: open
I block all distributed bots on principle: they are generally uncontrolled, have no accountability, hit hundreds of pages at high speeds and use a lot of bandwidth for no return. On my server, using a distributed bot is a guaranteed way of getting a blocked IP.
The ip I checked, responded to port 80 and 8080 going to the site talks about distributed internet services (or services "botnets" provide). So it's not only from comcast.
And headers alone are not enough to detect them. Neither robots.txt do anything. One way I have seen so far that works against the rogue bots (and any hijacked browser) is to store the visitor's IPs in a database and the first time he enters the site to check if he's human by having a simple form there. The problem I think with that is spiders ain't gonna like it.
I added a post on the cloaking forum to see if anyone else did that successfully. I remember when this forum deployed a login form to read posts, it was considered as cloaking although that was sometime ago.
I know it's not only from comcast but I get a lot of non-browser traffic from them: certainly more than most other USA services.
I monitor non-standard header combinations. Most are from "privacy" tools or proxies, often broken in some way. I have a range of header + UA combinations that block most bad bots and I add new ones once or twice a week.
Blocking single IPs isn't always the answer, in any case. I'm currently getting hit by a persistent high-speed set of global crossing IPs - about eight or so, I think, but I've blocked the whole 256 out of pique. They were automatically trapped originally on a faulty header. From this I discovered a new badly-formed UA that will probably trap other bots in due course.
We try very hard to follow proper and respectful crawling behavior. Check out [80legs.pbworks.com...] for some ways on how we do this.
If there are some specific instances where we negatively impacted your site, please let us know by contacting us (http://www.80legs.com/contact.html). We can manually set the rate at which we crawl your site to make sure we don't use up too much of your bandwidth.
Rather than having to contact you, it would be nice if your bot supported the Crawl-Delay directive in robots.txt.
Do I understand correctly that your bot is used to crawl sites for anyone submitting a job request? I ask because frankly, unless I know who you're crawling for, so I can look in my logs and see how much traffic they're sending me, I don't have much incentive to let you do what I would see as wasting my bandwidth.
Oh, and 403 /405 also means "Please do not come back." Some bots treat that status code as "keep hitting the site every few minutes." Not everone adds disallows to robots.txt for every bot going. It's easier and (for most badly configured bots) the only way to block them.
Of course, the problem with distributed bots is that they may not be communicating with each other. :(
Thanks for dropping by. Hope you don't get too much flak. :)
c-66-176-217-11*.hsd1.fl.comcast.net
Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
01/28 13:23:23 /1.1
01/28 13:23:32 /1.1
01/28 13:23:38 /1.1
Aside: Does anyone know if that URI is an exploit or probe, or simply malformed HTTP/1.1 request? I've only seen it in connection with really bad UAs, like Toata, or Hosts/IPs.
"[b]GET[/b] ?80flag=EnOfWvLYGJaaanatiuUuqMC6hs2uGIlijVddraIBGwIaed19*2VN-FePIUU6BMdfUHi1PkRqng1kmOIqLM*h0cq*KIYjKIW*qbYxoYOW*UDnoW_4w5n*djSVC_
sIniPIY4eqJJbbP_P6ZdHE*jSfMWzYmTvrt7pWUrIN0V9nD*BH6W-
IfBitP6YCBJdHBagDDPbhDgyzpHqu0nbV*65oqG9Dl4DC0p66_
VGgYRv9oG48DAVtuAJ2E-mNP9Tu*M7Ey1vyPAL*cpjAL*Wc__
L5Ssd4uU2O9fMpke*86wrFG_MYLaypEP2MSL4tp8kg5TOKVLeEzcb2Psa-j_
Re2qcFX*WxTxhI*CnHCvXtKcnRbm*dglgtNavT_1kuypdFdw-Rc*luFSXRK-
4ZFL7U-ODGcj21eNTqIU*RqoBjx-2oG8lflad_GMsd1a [b]HTTP/1.1[/b]"
The ?80flag thing is part of a custom job one of our customers is running.
I really don't believe that the people running this bot are ever going to be customers.