|80legs Flooding Site - Ignoring Robots.txt|
Literally thousands of requests
So the past few days have seen a massive increase in traffic which looks legit based on IP address but it seems that in fact we have been hit over 80,000 times in the past 48 hours by 80legs!
Who would have thought that <10,000 pages of content would interest the bot that much.
Regardless, blocked them with robots.txt 24 hours ago to noavail, forcing me to block any and all user agents with preg_match '/80legs/i' true :-(
On another thread here on WW back in 2010, many expressed concerns that they might not honour robots.txt in the future; seems like they were correct.
|Regardless, blocked them with robots.txt 24 hours ago to noavail, forcing me to block any and all user agents with preg_match '/80legs/i' |
Just to be grammatically correct, and in case some unfortunate noob finds their was to this forum.
Robots.txt does NOT block anything, rather is a REQUEST to compliant bots, of which there are very few.
there are some old and long threads on 80legs.
Visits from 80legs stopped when I put this in robots.txt sometime during 2011:
If you also have a robots.txt section like this that applies generically to all agents:
put that last in your robots.txt file. Agents should encounter the code that applies specifically to them before they hit the generic section.
@ SteveWh - 80 legs is a distributed bot, running on many machines. It can be config'd differently by different people.
As wilderness said, only some bots will follow robots.txt directives. There are many discussions here at WW attesting that 80 legs (008) does not obey robots.txt.
For a while I also thought it obeyed robots.txt until the day it ignored it and ran my entire site.
Just wanted to note in this thread that contacting them by email does seem to work. They didn't tell me who it was that wanted all of our data, but they did promise not to spider us anymore. Since then, they haven't visited our site.
Ah, so it was actually 80legs?
That's sad as I'd rather think that some other botnet was faking their UA than they were ignoring robots.txt.
Oh well, another one goes south.
Not like I didn't predict when this thing first surfaced that eventually the desire to collect the data would exceed their desire to honor robots.txt.
What happens next when they stop announcing their user agent and use a common browser user agent like the botnet that attacked sites owned by a few of our members a month or so ago?
Then you won't even know who to contact, which will be preferable for them to stem the tide of complaints.
Here is my Mod Security 2.xx rule for 80legs.
SecRule HTTP_User-Agent "80bot" "deny,log,status:403"
These folks just hit my org yesterday, about 6k hits over the 2 hours before I completely blocked them. Massively overloaded our DB and killed the sites. I did use the robots.txt mod as suggested on their site, but I also collected approximately 850 IP's (all overseas) from my web logs and blocked them via iptables rules in my firewall.
I did have a filter setup in my apache server to reject queries by specified agents, however it had no effect. It wasn't a crawler as much as a botnet based DDoS. IMHO - if you see these guys in your logs, block them.
The incidents above involve 80legs.com's spidering at a rate of less than one page per second. My server could have easily handled that, but the spidering of my site was on a different order of magnitude. If it's taking down your server, being hit by a "respectable" company's distributed spider doesn't feel much different from a DDoS attack.
I appreciate that 80legs.com's customer service acknowledges that they are at times responsible for overwhelming the servers their customers hire them to target, and that they make some effort to respect robots.txt. But given that they can manually slow the rate at which their botnet hits a website when they receive a complaint, there does not appear to be any reason why they haven't set reasonable default limits for all websites they spider in order to prevent their botnet from ever running amok.