| This 246 message thread spans 9 pages: < < 246 ( 1  3 4 5 6 7 8 9 ) > > || |
|Attack of the Robots, Spiders, Crawlers.etc|
Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
- Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
- Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
- Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
- Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
- IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
- One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
- Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
What I do not understand is why you restricted access to the whole forum.
Wouldn't it be better if the newest topics could be accessed without login and cookies and only topics older than a few days or one week would require login?
Many newspapers do it like this. The new content is freely accessible and only the "archive" requires registration.
It seems you are damned if you do or your damned if you don't, either you are out of the SE to avoid the bots, or they kick you out for cloaking.
With the prestige and renown of WebmasterWorld I would try the cloaking and explain to google just what you are doing.
Whats the worst that could happen? You are not in the Google database anyway they can't kick you out any further :)
Good luck with what you are doing, its facinating to read and watch.
From Brett answering question...
> If you required users to have cookies,
> but didn't require them to login. Could
> you require everyone but google to have
> a cookie to view the site?
That exact question has been asked of the se's for years, and they say universally, that would be cloaking and against the major se guidelines and you would be subject to removal.
>>But if I were a new visitor coming from G
Unlikely under the current circumstances... :)
The only solution outside of "cloaking" would be to teardrop the bots and release them at a rate you could tolerate.
Maybe GG will chime in.
>> but I'd like to think we could 'open source' a solution
that will probably not be a good idea as Brett loses control of his site. Remember, this is not "feel free to copy" stuff.
As far as the server, I wonder if they'd help. It's one site, aganist many rogue bots. If Brett triples the server power, and 30 of them become 10% more annoying, we're on the same spot.
The best solution seems to talk to Google, Yahoo and figure out something that they alone can access. Brett knows them, plus the search engines want, and need this site.
I've had to implement similar measures on several of my sites, but I choose to allow the known IP blocks that are owned by Google, Yahoo and MSN to crawl freely. I also added a throttling function that allowed visitors to view a set number of pages in a set amount of time and after that they get a "slow down" message. If they don't slow down, then obviously I'm dealing with a bot and not a human, so the IP gets temporarily added to an IP ban list. After 30 days if the abuse stops, the IP is removed from the ban list.
This is all done programatically without need for human intervention of course. I was at first worried about the added overhead of counting page views, but after most of the bots went away there was an overall net reduction in required processing by my servers.
I know Brett deals in a much higher scale, but it seems a similar system could be implemented that wouldn't block the major SE's altogether.
Of course if Brett really wants to get a good night's sleep he could just slap some AdSense on every page, have the check delivered to his bank account in the Cayman Islands, and retire.... ;-)
(Hey, this is in Foo after all.)
bots (legit and otherwise) use up about 30% of my total bandwidth (although they are only 2% of my users).
for me, i just figure the bandwidth is part of the cost of doing business - half of that bandwidth are bots that are sending me traffic (g, msn, yahoo), so that's a good investment - as for the others, my bandwidth costs as a percentage of my profits can carry them - after all, my overall bandwidth costs have come down in the last 4 years.
i'd be interested in other webmaster's % of total bandwidth used by spiders - how about you Brett?
|That exact question has been asked of the se's for years, and they say universally, that would be cloaking and against the major se guidelines and you would be subject to removal. |
Removing yourself is worse, IMO. We've continuously talked about (or read about, in my case) sites that cloak that are allowed to get away with it.
I've had to login for a long time now, because of my ISP. No big deal. No longer having a search function is what bothers me. I didn't mind using the engines, but that option is mostly lost now. There are people like me that search before they ask. Now, I'll be asking things that were already discussed in detail 3 weeks ago. That's not really good for any of us.
I'd never put a site in my profile, so that's a non point for me. (And yes, I've read every post of this entire thread.)
a couple of comments:
1> have you considered banning bots through a hardware firewall? this is method is MUCH more effective than depending on whether or not a bot correctly reads robots.txt file or follows the .htaccess file. it would also be lowering the load on the server.
2> i wonder (thinking out loud) - if increased concern about bots traffic is related to switching to hosting with rackspace. i know they charge quite a bit for bandwidth!
|other webmaster's % of total bandwidth used by spiders |
1:2.64 (spider:human) currently; was parity a month or two ago due to Jagger-hit.
BUT BANDWIDTH IS NOT THE ISSUE! Brett keeps pointing out that he does not have any bandwidth charges (lucky *!?&**!), it is server-load that is the issue. WebmasterWorld is a dynamic site, and the bots are bringing the server to it's knees.
I can see why search will be yet another issue. Search causes server-load issues on my site. Hmm.
> You can use those graphical 'words' that
> this issue sounds more like a DDOS attack:
I am amazed by the number of webmasters that don't get the technolgy issues involved here.
Captcha logins are nothing for people with bots. They simply login by hand, and pass it off to the bot. Many of the bots you can freely download will even set there in the background on a ie toolbar while you walk through the login page by hand. Some bots run off IE and simply need you to login by hand, and then run the spider that drives ie.
Other bots that are in php or perl, simply need the cookie that is generated cut-n-pasted to them.
WebmasterWorld is one of the largest freely crawlable sites on the web today. The majority of the bots you download from a Tucows can spider WebmasterWorld at will. Most of those bots can NOT touch a cgi url based forum like a vbulletin. That means, this is prime pickings for content raiders, scrappers, translators, blog headline generators, and people wanting to generate content. It is becoming clear to me - how rare that is. eg: someone running a little old vbulletin with even a smallest 10k users is not going to understand or appreciate this problem.
Your suggestion excellent and a workable solution I think. That is what I am going to try next year. My only question, is what to do when the ban list gets back up there to 4-5k ips and starts to hurt sys performance?
|It seems you are damned if you do or your damned if you don't, either you are out of the SE to avoid the bots, or they kick you out for cloaking. |
True, but me and HG did get to see the Rolling Stones while in Vegas (thanks FISH dude!) - best concert of about 50 we've seen - so life is good! ;-)
Any way - ya'll that are itchy about not having a search engine to find info : please remember that this is a community and requires a give as well as a take.
Please take a few moments and answer some questions you can here on the active list of unanswered questions [webmasterworld.com].
From the start of this thread, Brett, it sounds like you may be thinking about letting some SE's in again after awhile. Any idea if they'll actually crawl the thousands of existing threads so they'll be available again for site search?
|Any way - ya'll that are itchy about not having a search engine to find info : please remember that this is a community and requires a give as well as a take. |
Understood, but how many of the more knowledgable members will have to keep giving the same stuff time after time due to no search functions available to newcomers? (I don't have an answer, incidentally - this whole thing sure sounds like a sticky wicket to me any way you slice it.)
|Your suggestion excellent and a workable solution I think. That is what I am going to try next year. My only question, is what to do when the ban list gets back up there to 4-5k ips and starts to hurt sys performance? |
An IP-based blacklist in .htaccess or even httpd.conf doesn't scale very well. So I'd rather do it in the CGI itself. I seem to recall that WebmasterWorld is implemented in Perl. Perl hashes are pretty fast. And there are other solutions to make the IP lookup even more efficient. B-Trees and Patricia Trees are some keywords here. These are well researched algorithms.
You need some kind of aging, too. You don't want to block an IP for eternity. Sometimes, these rogue bots come from dial-in IP's and after a day or so the rogue IP will be assigned to a completely different and innocent user.
Brett...Bloody brilliant my friend, bloody brilliant!
I am sitting here laughing at the thought of all these "bigtime" webmasters who now look like deer caught in headlights...Lost, bewildered, stunned.
What you did takes balls. More balls than any hotshot "blackhat" or scrapester has. You've literally kicked ass overnight and are probably toasting to it with an expensive bottle of champagne.
I owe a lot to WW. I make more than 99% of the people on this earth because of WW. (Yes, I damn will brag about it). Over a year ago I kicked the SE habit and couldn't be happier. I don't rely on their BS games and, like you, I have peaceful sleeps at night.
Did I search WW using the Google sitesearch feature? Sure! Am I freaking about it now? No, because the info I need is all in my head. Will I continue to support WW? You bet, for life, because WW will thrive for years and years to come....Yes, WITHOUT the SE's. In fact, I am going to place a PROMINANT link on the front page of my main site right now to WW. Don't know why I didn't do this before, but you can expect bigtime referrals from me now.
For those who are UNINFORMED, you can and DO survive without search engines. In fact, it's a recommended practice. SE traffic? Just gravy :) You have ALL become so dependent on free SE traffic that it is downright sickening. Will humans NOT learn?
Brett is NOT dumb, folks. He's brilliant. He knows what the hell he is doing, whether you can figure it out or not. This is history in the making, mark my words. I have no doubt that over the past 7 days WW has gotten as many backlinks as it did in the past year or two. Do you not think that WW will get any referrals from those links? If you think that WW only ever got referrals from the SE's then you must be as dumb as a doornob. Flame me for my comments, go ahead, but know this: The Truth Shall Set You Free!
Rock on Brett. Looking forward to the internal site search that gets put in place in the future. Until then, let the mud continue to sling! Woohoo, this is better than any damn update thread ;)
Why block the IP in htaccess? Just ban IP at firewall and let the kernel reject then. Better than htaccess file with 150kb+ heh :)
I'm made a daemon that read access_log on the fly ("like tail -f access_log") to kick out bad robots at rule on iptables, custom rules based limiting access by IP (currently with 30, 60, 300, 900, 1800 seconds work for me, but can add longer times to avoid "slow crawling", each one with a limit of pages that can get, excluding images, css, and static content). This work very well, very fast too, dont is perfect, why can ban AOL proxy by example... but... the world doesnt is perfect too heh Can be make in perl/python without performance issues, but if access_log grown fast (like 5gb-10gb/day+) just use the C :)
You may ask for login after user/robot get N pages too... whitelisting the IPs of G, Msn, Y! etc will be need too.
[edited by: nsqlg at 5:12 am (utc) on Nov. 26, 2005]
WHy people are so scared of cloaking I have no clue...
It is ok if the USE is for good purpose.
Brett, in your case you would not tricking search engines therefore you have nothing to worry about. There are many uses for cloaking other than showing optimized text content...and many of these uses are in affect right now as we speak!...and have been for a while!
But do what you will...
|For those who are UNINFORMED, you can and DO survive without search engines. |
Sound like famous last words to me. Don't get me wrong, I'm going to stick with WebmasterWorld, but guess how I heard about it in the first place. Hint: the site name starts with G.
|Why block the IP in htaccess? Just ban IP at firewall and let the kernel reject then. |
Yes, that's an option. But the question is not WHERE to block but rather how to maintain the list of blocked IPs.
I do not think that measures like IP lists are of any use.
Many of those bots today are run by people in their home office which do not even have a static IP address and get always assigned a new one when they connect to their Internet provider.
So only thing that happens is, the next who gets assigned the IP is cut off from webmasterworld. And your IP ban list gets longer and longer.
Again my question. What would happen if you would restrict access only to topics older than a few days?
Let the bots run freely on the new topics but prevent them to spider old posts by requiring login. That would also reduce your system load. However the recent content would be available to new visitors without registering.
And for those bots that circumvent that, only thing you can do is make things a little harder. Like implementing auto logoff after a certain amount of time. That would at least stop them for running wild for hours.
Now I get a better idea about why this happened.
If I had WW and was to manage that, the only solution would have had been like bigger organisations as suggested
Have multiple servers, and a script to write at separate location
Have a Sql based forum
And allow bots to take whatever they want, selective blocking of top 10 % bandwidth consuming bots (non search engines), that is reviewed every X months and kept secret.
What do you say Brett?
I think you could omit the database option now.
I also think that you deserve to earn more money by WW, in form of more ads. At least advertisements should be served to those coming direct from search engines. ( they pay more in a cpc model )
i once have an idea in mind: why not the SEs public their ip list of the crawlers, so webmasters can then exclude the ip from being checked as "bad" crawler. but take care not to be recognized as a clocking site
This has been an interesting development. I've read some good suggestions, but they all come back to bandwidth. You can only push so many bits thru a pipe. Avoiding cloaking issues just makes plain sense. Taking a perfectly good domain and potentially blacklisting it goes against all logic.
I agree that WebmasterWorld can survive without SE's. And isn't that what it's about? Build your site until it can stand on it's own. Good job Brett!
I'm waiting to see if Santa is going to put a search appliance under Brett's tree this year. Of course, then we'll all have to go back to taking about the next update.
A final thought on this.. Brett, why not at least let them in to spider your home page?
Oh, and another thing. Wouldn't it be nice if the SE reps were to take a position on the issue of robots.txt standards? We know you folks are reading this thread.. this issue is more about rogue behaviour than anything else. The irony is this: because of search we have lost search.
I think it would not be even necessary to change the robots.txt standard. I think it would be sufficent if Google, Overture, Espotting and the others would be a little more picky where they put their advertising.
What are those rogue bots doing mainly? Scraping content and feeding pseudo search engines for the only purpose of showing Adwords and the like.
They are stealing and monetizing our content and Google and the others don't give a damn. It's a shame on what kind of sites for example Google allows to show their adwords. Same with the Search Engine Spam issue. Look at those sites which Google is fighting in their SERPs and chances are they are showing Adwords.
(Sorry if I am getting a little off topic.)
Davewray msg 45
You better go the doctors I think your tongue has turned brown.
- most rogue crawlers does not obey robots.txt anyway
- why dissallowing google? But I havent read the whole thread
- If I were brett I would just write a mod for this BB that would put a X seconds delay between viewing pages for unlogged in users, put an exception for IP blocks of Google, Yahoo, MSN, and/or whatever searching engines spiders he would like to allow. Rogue crawlers really hate delays and run away.
- That looks like a google ban. I have seen sites that their "deep" pages are not indexed by google but still having home/contact/faq/etc indexed maintaining their original PR. Probably using "index, nofollow" html tags
Ironically, I was searching google for "google+dogpile site:webmasterworld.com" to check what happened to dogpile days ago (they either did what brett did, or google banned them but that's another story), I saw no results, I figure out what happened to WebmasterWorld, I find out that dogpile is ok with google all indexed and same PR.
I am just sad!
> Many of those bots today are run by people in their home
> office which do not even have a static IP address
Yes. My rule of thumb is that for every bad bot, you will have to ban an average of 4 to 5 ip's to get that bot stopped.
> ban IP at firewall
What firewall software would you suggest that has that easily manageable capability? Under RedHat enterprise.
> like you may be thinking about letting
> some SE's in again after awhile.
That was always the plan. We have to solve the original problem first.
> So I'd rather do it in the CGI itself.
That is part of the system we were/are doing. It is amazing how much faster the site is with all this junk designed for bots taken out.
> You don't want to block an IP for eternity.
That is the hardest part. I was tracking all the ips with "reason for ban" in a spreadsheet. it has about 20k ips in it that we have banned the last few years. I take the ip off after about 5-6 months.
The sad fact, is that some people abuse the site, we ban their ip, and then someone from that ISP gets that IP that is banned and had nothing to do with the original bot attack. That very scenario has happened dozens of times. One of the worst bots we have ever had was on a recycled ip that ended up with one of our best users from time-to-time.
> Looking forward to the internal site search
It is going to take awhile to perfect. Problems will abound everywhere for awhile.
> What would happen if you would restrict
> access only to topics older than a few days?
Then you miss out on the super content down there in the archives/library.
> public their ip list of the crawlers
Here is a few one maintained by a WebmasterWorld moderator: [iplists.com...]
> Brett, why not at least let them in to spider your home page?
I started to make a robots.txt to do just that, and after about the 40'th exception line - I threw in the towel. Given our site structure - there is no way to do it.
> the SE reps were to take a position
> on the issue of robots.txt standards?
Of they have and they were the first ones to break that standard. What's worse - we let them get away with it.
> They are stealing and monetizing our content
YOUR content being the operative word. This site is not here as search engine fodder. This site is here for you.
[edited by: pmac at 7:50 pm (utc) on Nov. 28, 2005]
[edit reason] fixed link [/edit]
|many members regularly hit 1-2k page views a day |
Kill the auto-refresh on the active list page.
It sounds like this would be the perfect opportunity for a search engine to step in and offer to solve your search problem. You could provide an XML feed of your content to them. They could provide a search box. You would get a search feature. They would get (for now) exclusive content for their engine and some great exposure.
As for cloaking, the distinction I've always heard is whether it's for the benefit of visitors or to trick search engines. If you're cloaking to provide default localized information, browser compatibility, or things like that, I don't think any reasonable search engine would consider that bad. The only risk is that they might not be able to tell the difference.
Also, I'm not sure why you wouldn't be able to handle 10-20 million page views per day with multiple servers.
|What firewall software would you suggest that has that easily manageable capability? Under RedHat enterprise. |
Any, I'm use the iptables, you can do something like this (some features used below need be active in kernel compile time, but your kernel already will have I guess) ->
iptables -N DENY_IP
iptables -N DENY_IP_RULE
iptables -A DENY_IP_RULE -m limit --limit 64/hour --limit-burst 16 -j REJECT --reject-with icmp-host-prohibited
iptables -A DENY_IP_RULE -j DROP
iptables -I INPUT -j DENY_IP
# Now the cool part! f*king bad IPs :)
iptables -A DENY_IP -s 126.96.36.199 -j DENY_IP_RULE
iptables -A DENY_IP -s 188.8.131.52 -j DENY_IP_RULE
To change the firewall must have root privileges, but of course this open secutiry issues, so you need write custom tool owned by UID/GID 0 with the sticky bit (chmod 4700) to ban the IPs and expire older IPs. But writing superuser tool need be VERY careful, you dont wanna that somebody gain extra privileges using your app exploiting it? :) If you let, I can help.
| This 246 message thread spans 9 pages: < < 246 ( 1  3 4 5 6 7 8 9 ) > > |