Gary, thanks. Seems like it would be nice of them to identify who they are crawling for. Until then, I'll block them with Robots.txt and see what happens.
Been blocked for a while. I would also like to block comcast, which sends a heap of junk bots my way, but some of my clients get trade from that part of the world. :(
I block all distributed bots on principle: they are generally uncontrolled, have no accountability, hit hundreds of pages at high speeds and use a lot of bandwidth for no return. On my server, using a distributed bot is a guaranteed way of getting a blocked IP.
dstiles, we discussed this before I think, I am not sure if setting up ips to be blocked on a server does anything. I get hit constantly from bots and they come from all over the place. I see the particular agent Gary posted here from other ips, eg 64.125.222.nnn
The ip I checked, responded to port 80 and 8080 going to the site talks about distributed internet services (or services "botnets" provide). So it's not only from comcast.
And headers alone are not enough to detect them. Neither robots.txt do anything. One way I have seen so far that works against the rogue bots (and any hijacked browser) is to store the visitor's IPs in a database and the first time he enters the site to check if he's human by having a simple form there. The problem I think with that is spiders ain't gonna like it.
I added a post on the cloaking forum to see if anyone else did that successfully. I remember when this forum deployed a login form to read posts, it was considered as cloaking although that was sometime ago.
Blocking IPs that belong to unwanted bots, distributed or not, prevent them coming in under a more common "browser" disguise later on. Some broadband IPs are retained by a single user for longish periods and I do find this helps in the longer term as well as in the immediate term. And, hopefully, it will frustrate the bot handler and make him think (correctly) that his bot is causing the access problem; with luck he'll stop.
I know it's not only from comcast but I get a lot of non-browser traffic from them: certainly more than most other USA services.
I monitor non-standard header combinations. Most are from "privacy" tools or proxies, often broken in some way. I have a range of header + UA combinations that block most bad bots and I add new ones once or twice a week.
Blocking single IPs isn't always the answer, in any case. I'm currently getting hit by a persistent high-speed set of global crossing IPs - about eight or so, I think, but I've blocked the whole 256 out of pique. They were automatically trapped originally on a faulty header. From this I discovered a new badly-formed UA that will probably trap other bots in due course.
I never block IPs of distributed bots, just the user agent.
If the IP, other than that user agent, continues to behave badly it will block itself.
This 'spider-for-hire' launched in April. Since then, I've noted at least two instances where it did not ask for robots.txt.
Aside: Most frequent Host is Abovenet, followed by Comcast, then Charter.
Hi guys - came across this post and thought I would chime in. I actually work on 80legs so thought it would be great to directly answer any questions you may have about our crawler.
We try very hard to follow proper and respectful crawling behavior. Check out [80legs.pbworks.com...] for some ways on how we do this.
If there are some specific instances where we negatively impacted your site, please let us know by contacting us (http://www.80legs.com/contact.html). We can manually set the rate at which we crawl your site to make sure we don't use up too much of your bandwidth.
Welcome to WebmasterWorld.
Rather than having to contact you, it would be nice if your bot supported the Crawl-Delay directive in robots.txt.
Do I understand correctly that your bot is used to crawl sites for anyone submitting a job request? I ask because frankly, unless I know who you're crawling for, so I can look in my logs and see how much traffic they're sending me, I don't have much incentive to let you do what I would see as wasting my bandwidth.
We plan on implementing Crawl-Delay in one of the upcoming releases.
Adding something that specifies who the crawl is for is not a bad idea. We can definitely take that under consideration.
And when you find a site that does not supply the kind of data you're looking for, add it to a "Do not crawl again" list, please. There are hundreds of bots out there trying to get access to sites that have absolutely no relevant content.
Oh, and 403 /405 also means "Please do not come back." Some bots treat that status code as "keep hitting the site every few minutes." Not everone adds disallows to robots.txt for every bot going. It's easier and (for most badly configured bots) the only way to block them.
Of course, the problem with distributed bots is that they may not be communicating with each other. :(
Thanks for dropping by. Hope you don't get too much flak. :)
I block all distributed bots since I do not know how my data will be used. I'm fed up with all the C&D, network complaints, Google removals, etc I am forced to endure because of internet thieves scraping my content and mirroring it across the WWW. It never ends. Any bot that does not clearly define how data will be used will stay blocked across all the web sites I manage.
I block everything except Google, Yahoo, Bing and Ask.
All the rest of them go bouncing off my white-listed firewall.
FYI, I have crawl-delay set for all bots whether they use it or not, the rest of them get a 503 "Service Unavailable" when they get too greedy.
It's bad enough this bot doesn't request robots.txt. Now, it's trying to do something else:
Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
01/28 13:23:23 /1.1
01/28 13:23:32 /1.1
01/28 13:23:38 /1.1
Aside: Does anyone know if that URI is an exploit or probe, or simply malformed HTTP/1.1 request? I've only seen it in connection with really bad UAs, like Toata, or Hosts/IPs.
@incrediBill: How do you work out your "503 'Service Unavailable' when [bots] get too greedy" thing, please?
@shiondev: "We plan on implementing Crawl-Delay in one of the upcoming releases."
That was six months ago. Today? Twenty-two robots.txt hits from the 80legs bot* in four hours from six Hosts, with most hitting mere seconds apart. So much for Crawl-Delay.
Why can't the 80legs Mother Ship retrieve robots.txt and then have your clones touch base with you, using your resources instead of ours? Because this is ridiculous:
*UA: Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
Digsby IM Enables Web Crawlers Control of Your PC & Bandwidth
Plura Processing and 80Legs to Leverage Digsby Network
Those 22 robots.txt hits from the 80legs bot in four hours from six Hosts were ridiculous.
These 30 fake-file hits from the 80legs bot in 45 minutes from 11 more Hosts are insane:
Pfui - if you contact us at [80legs.com...] and tell us which domains we're hitting, it will be much easier for us to identify any potential problems.
shiondev: Thank you for your reply. The "/1.1" and robots.txt problems cited are actual at my end, affecting one niche domain.
My dilemma in contacting you privately is that I prefer my anonymity vis-a-vis WW. For example, less than a handful know my work-specific info. Also, the sole off-WW exchange my SysAdmin had with a bot-runner (politely & simply requesting removal of our Class C from their hit list) was met with surprisingly aggressive resistance.
Regardless, surely a single IP wasn't alone in receiving scores of strange "/1.1" hits from your bot(s) on Feb. 2. Seeing as how your crawler is distributed, wouldn't analysis of the data returned to you show the aberrant activity?
Pfui - we crawl millions of domains and receive virtually zero complaints, so it would really help us to know which domain(s) to look at.
I have no intention in sharing your private information with us if you contact us. My primary interest is making sure 80legs is behaving properly.
The reason you don't get a LOT of complaints is because most webmasters don't even realise what's happening.
I have several dozen domains that I would complain to you about but it's easier to block the bot and ban the IP. For every hit I get from your bot, the bot rider's IP gets canned and they can no longer access anything on my server. In a week or two I may release the IP but a re-hit will kill it again.
I don't allow distributed bots but yours is one I REALLY don't allow.
Okay. I'll send log excerpts via WW to you personally. FYI, while grepping this month's raw apache access_log for your bot's activity, I was stunned to see what kind of hit is 'behind' what truncates to "/1.1" in my tail script...
Note: The following is split to prevent screen scroll. The original "?80flag=" strings are a whopping ~450 characters in one solid block -- x30 of those whopper blocks for Feb. 2 alone (so much for heeding all those robots.txt requests.) In this excerpt, two chars are obfuscated w/ asterisks:
Anyone else see "?80flag=" or "/1.1" hits in your raw logs, independent of robots.txt requests? The UA is:
Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
Pfui - I've disabled crawling to your domain through an internal "do not crawl" list. Thanks for the info.
The ?80flag thing is part of a custom job one of our customers is running.
1.) Thanks for disabling the domain (and again, the CIDR, please).
2.) FWIW, last month's access_log shows 13 similar, although not nearly as huge, "?80flag=" hits, again all in violation of robots.txt:
"GET ?80flag=%255*B%2540152*4d9 HTTP/1.1"
Oh, and those were in addition to 184 robots.txt hits.
3.) On behalf of everyone whose servers were/still will be assaulted and resources wasted by that job:
It's unfortunate you let one customer repeatedly inflict havoc, let alone through scores of minions. And I think it's deplorable when sites distinctly say, "No."
4.) You've been professional and polite, shiondev, thank you. Alas, your bot is neither. I am exceedingly glad to be rid of it.
I am so glad I found this thread. I had a site on my server that was constantly hit with the 80 legs spider though i didn't know it. It was tagging the ?80flag=(and about 450 chars) onto all my URLs and on several of my pages this was causing a terrible IIS issue where the site would freeze because the worker process was maxed out. I thought we were being hacked, so we rewrote several pages to boot bad queries like this yet these attacks just moved on to other pages with similar vulnerabilities. I finally got IIS and my pagecode to properly handle the attacks (and discovered Web Gardens which I'm now using and my site is so much faster - the only good thing to come of this).
I thought what was happening was the work of a hacker because of the IP patterns. We set up a report today to be emailed us each time a hack occurs and that's how I discovered this 80 legs bot was the issue all along. I cannot tell you how frustrating this has been.
Thanks for letting me vent on this forum for my first post. I'm blocking them with my robots file and have emailed them directly asking to remove my site from their search service.
@comicartfans: Welcome. You're surrounded by comrades-in-arms. And we share your frustration and ire over the resources and funds wasted, ditto the hours spent fending off bad bots and making good bots heel.
@shiondev: Alas, my 'confirmed removed' status didn't last long: 11 days. Here's a six-hour period today:
UA: Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
shiondev, you know your crawler's poly-Host visits to my IPs are unwanted and unauthorized. And per your request, I privately provided you with all the log data you required, and then some, as well as my CIDR. Subsequently, you publicly confirmed my do-not-crawl request. And yet, your bot's back.
Please cease and desist.
Five days later, no response, no reply, no cessation. Just more chumps bot-running 80legs for and with PBworks. Here's a 90-minute period today on one site:
I'm getting clusters of about three or four at a time, all different IPs and generally hitting different sites.
I even got one the other day through a Barracuda proxy! You have to wonder about the mental state of someone who sets out to secure their computer and then installs this rubbish on it.
It seems to fall into bot traps fairly easily, which is good, although I wonder how it found the trap URLs given that it always (in my case) hits pages which are served up as no-content 403's and hence no traps. This suggests it's getting information from other, more aggressive bots, as the traps are never triggered by a real browser except perhaps the occasional one with a stupid enhancement plug-in.
In passing, I check most of the IPs I block and for 80legs almost ALL of them come from the big USA-based ISPs (Bell, AT/T etc). I don't suppose it's being advertised on their sites? :)
|The ?80flag thing is part of a custom job one of our customers is running. |
IMO that "?80flag" thing looks malicious and you shouldn't run jobs like that.
It has potential to scare the hell out of novice webmasters (i got more than a few emails over that one) who won't know any better and will end up blocking innocent users running your software.
I'm afraid my attitude to 80legs users is: if they are dumb enough to be using it they deserve to be blocked from the internet. All of it! Unfortunately I can only block my little corner of it. :(
I really don't believe that the people running this bot are ever going to be customers. As noted above, most are on USA IPs and hitting UK sites they would never otherwise visit; and there is no evidence that there are returns to those or any other urls from blocked IPs.
|I really don't believe that the people running this bot are ever going to be customers. |
The people aren't running the bot, technically they're running an IM client that has this bot net built in.
I don't think it's wise at this point to lump the behavior of the bot with the behavior of the unwitting people that are running it and aren't fully aware of what's going on with their PC.
| This 61 message thread spans 3 pages: 61 (  2 3 ) > > |