homepage Welcome to WebmasterWorld Guest from 54.235.16.159
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
80legs Flooding Site - Ignoring Robots.txt
Literally thousands of requests
Andem




msg:4457361
 1:07 pm on May 24, 2012 (gmt 0)

So the past few days have seen a massive increase in traffic which looks legit based on IP address but it seems that in fact we have been hit over 80,000 times in the past 48 hours by 80legs!

Who would have thought that <10,000 pages of content would interest the bot that much.

Regardless, blocked them with robots.txt 24 hours ago to noavail, forcing me to block any and all user agents with preg_match '/80legs/i' true :-(

On another thread here on WW back in 2010, many expressed concerns that they might not honour robots.txt in the future; seems like they were correct.

 

wilderness




msg:4457496
 5:51 pm on May 24, 2012 (gmt 0)

Regardless, blocked them with robots.txt 24 hours ago to noavail, forcing me to block any and all user agents with preg_match '/80legs/i'


Just to be grammatically correct, and in case some unfortunate noob finds their was to this forum.

Robots.txt does NOT block anything, rather is a REQUEST to compliant bots, of which there are very few.

there are some old and long threads on 80legs.

SteveWh




msg:4457735
 9:37 am on May 25, 2012 (gmt 0)

Visits from 80legs stopped when I put this in robots.txt sometime during 2011:

User-agent: 008
Disallow: /

If you also have a robots.txt section like this that applies generically to all agents:

User-agent: *
Disallow: /whatever.html
...

put that last in your robots.txt file. Agents should encounter the code that applies specifically to them before they hit the generic section.

keyplyr




msg:4457745
 10:21 am on May 25, 2012 (gmt 0)

@ SteveWh - 80 legs is a distributed bot, running on many machines. It can be config'd differently by different people.

As wilderness said, only some bots will follow robots.txt directives. There are many discussions here at WW attesting that 80 legs (008) does not obey robots.txt.

For a while I also thought it obeyed robots.txt until the day it ignored it and ran my entire site.

Andem




msg:4459390
 11:40 am on May 30, 2012 (gmt 0)

Just wanted to note in this thread that contacting them by email does seem to work. They didn't tell me who it was that wanted all of our data, but they did promise not to spider us anymore. Since then, they haven't visited our site.

incrediBILL




msg:4459610
 8:26 pm on May 30, 2012 (gmt 0)

Ah, so it was actually 80legs?

That's sad as I'd rather think that some other botnet was faking their UA than they were ignoring robots.txt.

Oh well, another one goes south.

Not like I didn't predict when this thing first surfaced that eventually the desire to collect the data would exceed their desire to honor robots.txt.

What happens next when they stop announcing their user agent and use a common browser user agent like the botnet that attacked sites owned by a few of our members a month or so ago?

Then you won't even know who to contact, which will be preferable for them to stem the tide of complaints.

frontpage




msg:4460438
 5:03 pm on Jun 1, 2012 (gmt 0)

Here is my Mod Security 2.xx rule for 80legs.


SecRule HTTP_User-Agent "80bot" "deny,log,status:403"

bitbucket




msg:4465880
 2:38 pm on Jun 15, 2012 (gmt 0)

These folks just hit my org yesterday, about 6k hits over the 2 hours before I completely blocked them. Massively overloaded our DB and killed the sites. I did use the robots.txt mod as suggested on their site, but I also collected approximately 850 IP's (all overseas) from my web logs and blocked them via iptables rules in my firewall.

I did have a filter setup in my apache server to reject queries by specified agents, however it had no effect. It wasn't a crawler as much as a botnet based DDoS. IMHO - if you see these guys in your logs, block them.

--SJ

hyperkik




msg:4470329
 3:48 am on Jun 28, 2012 (gmt 0)

The incidents above involve 80legs.com's spidering at a rate of less than one page per second. My server could have easily handled that, but the spidering of my site was on a different order of magnitude. If it's taking down your server, being hit by a "respectable" company's distributed spider doesn't feel much different from a DDoS attack.

I appreciate that 80legs.com's customer service acknowledges that they are at times responsible for overwhelming the servers their customers hire them to target, and that they make some effort to respect robots.txt. But given that they can manually slow the rate at which their botnet hits a website when they receive a complaint, there does not appear to be any reason why they haven't set reasonable default limits for all websites they spider in order to prevent their botnet from ever running amok.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved