| This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4  6 7 8 9 ) > > || |
|Attack of the Robots, Spiders, Crawlers.etc|
Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
- Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
- Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
- Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
- Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
- IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
- One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
- Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
I still think the best solution is to block the offenders with a firewall.
Ideally you would do this with a firewall device that sits in front of your webserver(s). A good Cisco device would be able to handle a really high request load, and would not cost your webserver a single CPU cycle.
A software based firewall running on the webserver would be ok too - though, since the request needs to be handled by your server still, that's going to result in some usage on that server.
Still - blocking at the level of a software firewall is better than trying to block from the apache .htaccess level - because the firewall operates at a lower level on the TCP/IP stack.
A free and very easy to setup/use software firewall that would run on red hat enterprise is called 'apf firewall'. It includes an IP block list (/etc/apf/deny_hosts.rules).
As far as the hardware firewall - I would definitely suggest you contact the RackSpace support people about this - they definitely are the experts. While the external hardware firewall is definitely the better choice for reducing usage on your server - it would be more expensive. Depending on the exact device they recommend for this, the cost might be in the range of $500 - $5000.
Maybe there's something big I missed that prevents this as being an option, but why not switch to a dynamic forum like vBulletin then, if it does indeed stop a lot of bots from being to index the pages. If the current posts are the main concern, they can all be archived
From a user perspective, I think vBulletin would greatly add to Webmasterworld. I've been so frustrated at times using WW I wanted to pull my hair out. Search is the most lacking feature, even when it was there. vBulletin is all around a great pleasure to use, post, find what you want etc.
Just my 2 cents.
|why not switch to a dynamic forum like vBulletin |
(Since Brett is fast asleep at this moment) The issue is system-load, as I understand it - exactly the same issue as this one (the bots). Brett has a lean, mean Forum that he has written himself in Perl, and which handles a staggeringly high number of postings on a single RH-box. He prides himself on the fact that just this single box is able to handle what other fora need a server-farm for.
As a Yorkshireman (renowned for our stinginess) I applaud him!
I was just thinking that this must be the longest ever thread in Foo. I would like to check this out but I can't without an effective search feature :(
I am afraid you guys lost me a week ago with the technology involved in this. I just don't understand much of it and it may be that it is all necessary. What surprises me is that with all the experts in here no one has come up with an alternative so perhaps there is none.
All I can say is that the effect on my WW experience is a bit like losing a limb. It's just unfortunate that some other search feature could not have been implemented first. (Incidentally the site search option at the top of the page is no longer applicable).
>> the experts in here no one has come up with an alternative so perhaps there is none
The issue is an acceptable alternative for Brett. We all run businesses where there are alternatives acceptable to us. :)
|The issue is an acceptable alternative for Brett. |
I know. To what did you think I was referring?
I think LexiPixel has a good point in msg #:101
Why not create an 'archive' area where you create truly static versions of your pages via a cron job when the server load is low.
Allow anyone without an IE/Opera/Mozilla UserAgent in, and anyone who arrives through those UA's gets redirected to the real pages.
At the end of the day the worst scenario is that the posts in the archive are up to 24hrs out of date. The server load should drop right back as the static pages should cache nicely and serve up quick.
"> You don't want to block an IP for eternity.
That is the hardest part. I was tracking all the ips with "reason for ban" in a spreadsheet. it has about 20k ips in it that we have banned the last few years. I take the ip off after about 5-6 months."
I'm no Unix Expert and I apologise if this has already been covered. But this has been sitting in my bookmarks for last year or so. [neilgunton.com ] From my limited POV it looks good.
looks like a good tips, but i doubt (without a analyze yet) the cache miss/hit ratio can be nice.
bots tends to crawl page no more than once a day except for active update-checking.
but can u give a approximate guess or a tested/analyzed result? :)
afaik, some bad guys is found of setuping a proxy page like:
$ctx = file_get_contents("http://your domain/" . $_SERVER['REQUEST_URI']);
$ctx = preg_replace(.... replace your banner, logo with theirs..);
sitting and waiting money from ad links they put up.
this may hit your cache nicely, but i bet u prefer to ban these ip. really easy to analyze from log if u do.
> Allow anyone without an IE/Opera/Mozilla UserAgent in, and anyone who arrives through those UA's gets redirected to the real pages.
Any suggestions that rely on UA checks implicitly rely on the "visitor" honestly reporting their UA. In other words, UA checks are worthless (especially given, since even mainstream browsers allow you to spoof a UA now).
> It all depends on the type of site notsleepy. I have another site, that hasn't seen 1 in 20k visitors use the site search.
If you work out that WebmasterWorld search feature and it brings back relevant results quickly I think you'll be surprised to see how much higher that number goes. When it works, they use it.
|Brett has a lean, mean Forum that he has written himself in Perl, and which handles a staggeringly high number of postings on a single RH-box. He prides himself on the fact that just this single box is able to handle what other fora need a server-farm for. |
There's nothing wrong with having a lean, mean site, but everything has its limits. If the number of real webmasters using WebmasterWorld increased 10-fold, would he still be able to handle it on a single server? What if Google had tried to build their search engine to run on just a single server?
It sounds like now is the time to redesign the site to function on multiple servers. It might also make sense to build it around a database (assuming it isn't aleady), add search functionality, and do a number of other things.
good point, is that why you are called Moving on up? :)
there was one until recently. Name was Google.
|It sounds like now is the time to redesign the site to function on multiple servers. It might also make sense to build it around a database (assuming it isn't aleady), add search functionality, and do a number of other things. |
For some reason an earlier post of mine was deleted without explanation. In it I basically made two points:
- What exactly is the problem which is trying to be solved here? - it's not entirely clear
- Server load seems to be a big issue. Has getting the BestBBS generated source cleaned up being considered? It could probably be chopped in half by moving a fair amount of presentational stuff (e.g. attributes in <td> tags, <font>, etc.) into CSS. (I'm NOT suggesting a full CSS-P layout, but you could get rid of a lot of code without really affecting what things look like at all.)
>>why not switch to a dynamic forum like vBulletin
vBulletin is really powerful, but is very resource intensive. I admin a forum that gets a fraction of the traffic WebmasterWorld does, and it takes two dual Xeon servers with 4 GB of RAM each. I've gone to pains to eliminate features and queries to keep the load down, but we can still get overwhelmed by traffic peaks. I'm getting very zippy fractional-second page loads with 800 users online, but I'm waiting to see what will happen when we we peak at 2000+ simultaneous users. (I can tell you what happens with one dual Xeon server - it gets really slow, and starts throwing database errors.)
I think vBB would be an interesting solution for WebmasterWorld, but it would take some serious recoding to streamline page creation and some even more serioius hardware.
Here's the vBulletin hardware setup for a busy forum (not mine) - Alexa rank about 14,000, with a base of 40 million posts: "5 Web Servers each with Dual Xeon 3.0 Ghz Processors, 2GB Ram, Ultra 320 15k SCSI Drive. Gallery/Uploads Server with Single Xeon 2.8, 2GB Ram, Ultra 320 15k Raid Array. Database Server with Dual Xeon 3.0 1MB, 4GB Ram, Ultra 320 15k Raid Array." That's serious hardware.
Comparing forums is difficult, of course, without detailed information on user behavior. Very few sites report stats like "peak hourly pageviews" which is major determinant of server load, as are number of searches and posts. One site can have a few million posts and 100K members and loaf along because it's rare for more than a couple of hundred members (and visitors) to be online at the same time. Another forum that looked smaller could actually be much more server intensive if they have more members online at once or get a lot of external traffic.
At Pubcon presentations on Community Building, Brett has emphasized the importance of snappy page loads to keep members and visitors coming back, so you can be sure that he wouldn't want to get into a software/hardware setup that could bog down under load.
|We didn't ban bots to start with - that was not the real action - we required login - which mandated blocking the bots or suffer the wrath of a totally abused login page. I would rather throw up a road block, than set here and let a bot crash into the 404 or login page because we required login and cookie support. |
Presumably login/cookie access can be disabled by IP address therefore
1) Cloak the robots.txt file allowing approved bots in.
2) Let the approved bots in without login/cookie requirements.
I've never heard of anyone cloaking the robots.txt file, but I can see no technical difficulty.
I've never heard of anyone cloaking the robots.txt file, but I can see no technical difficulty.
I have in the past ;)
yes, and traffic will keep dropping as the number of new users drawn from search queries drops, which it will.
Essential for creating a so called 'community' - man I hate that term for anonymous web communications... - is openness. The doors are open, you drop in, look around, leave. If you like what you see, you join, and start contributing. That decision is made after you've determined you like what you see, not before. I never believe anything on the web until I see it myself, and I never signup for something without having seen it myself. Very simple. If there are walls around the community and you can only enter through one door, people turn away and go to 'communities' that are open.
Think of a small village with many roads leading into it, there's a public market, everyone hears about it from various news sources and gossip [search engines and word of mouth]. One day, all the roads except one, with a guard on it, block access to everything. It no longer feels like an open community, and people start to drift away, except for the ones who are used to hanging out at that market already, and a few others. But most just go to the next town over, that still has open access to its markets. And those are the ones people start reading about in their local papers, so that's where they go.
Then of course there's the other things, simple, people like their work to be seen, but when pictures of their sculptures stop appearing in the surrounding papers, they too start moving their work to more open markets. Brett knows all this, if bots are part of the web, you have to have a site that can support that part too.
While many webmasters out there would and will love it if brett stays off google/search in general, that decision will only have one outcome long term. Which means bbs will have to be able to support running on multiple boxes. So my guess is Brett will keep the search engines out until he's reprogrammed the forums to run on multiple servers, then let the bots back in.
back to that server load issue, has anyone ever tried punbb?
I was looking at punbb too, it's an interesting product, by far and away the best output html and css of any forum software I've ever seen, I tested some sample page html sizes, stunningly low, the lowest I've seen on a reasonably full featured package.
Don't know how truly optimized the db queries are though, my guess is that end probably needs work, but the punbb forums I checked out loaded very fast, much faster than most phpbb forums, even heavily html optimized ones, which suggests some good work on the db end of things.
Hey Brett, I like the sound of dataguy's redirection idea in message 103. You haven't commented on it. Did I miss something?
Also, I too have wondered about the inefficient markup on the site, but presume this takes us back to discussions about bandwidth and that isn't the problem.
WW Alexa ratings have plunged since this started. We'll have to see how long past Thanksgiving this trend continues. Given the number of webmasters with Alexa's plugin installed the chart, while not accurate comparing one site to another, does give strong indications of the trends of a single site. Plus robots/spiders/crawlers don't use it so this shows only true visitor trends.
Looks like about a sudden 60 percent drop in WW user traffic.
I run a very high volume site with forums. Over 1 terabyte of bandwidth per month. I can say from my own experience there are better ways to deal with this while still allowing legit search engine crawlers in.
Our intelligent automated approach works well:
1. Dynamic robots.txt
2. HTTP honey-pot without instant auto-blocking.
3. Honey-pot from projecthoneypot to catch the sneakiest of crawlers.
4. Automated log file scanning.
5. Firewalling of problematic IP ranges (not just single IPs). If I have to lose a handful of visitors in order to gain many more legit visitors from SERPs, so be it!
6. Lastly, hardware and bandwidth is cheap... throw more at it if you can. Its less costly than 8 hours per week of your time.
We're going to be adding a geotargetting database soon, once we get another server up.
Well you must have made googlebot hungry cause it's swallowing my site in huge chunks. Server load is topping 170... Ouch.
> WW Alexa ratings have plunged since this started.
Alexa stats have never made any sense whatsoever for my sites. I used to think that they were part of Amazon, but found out they weren't when I tried to contact them awhile ago. I don't see the value in watching stats that appear to have no basis in reality.
The problem with auto banning of ip addresses, is that sooner or later you will ban a proxy server of an ISP and not even know it. We recently banned one of the largest isps in New Zealand and didn't know it. In the past we have flirted with nuking aol and the like.
Wow I really like the bravery shown here. Its really now got me hooked on webmasterworld and what you do next. You guys are rocking again.
"6. Lastly, hardware and bandwidth is cheap... throw more at it if you can. Its less costly than 8 hours per week of your time. "
I guess more hardware is not an option at webmasterworld because it does not use a database. Its file system based. But this limitation is just my assumption, and I may be wrong.
|Earlier Brett mentioned that 1 in 1,000 use internal site search. I'm not sure where you got that figure but on my sites its WAY higher. |
that's because on other sites, site search works!
I stopped trying long ago.. did the Giga thing, and then just gave up.
|Presumably login/cookie access can be disabled by IP address therefore |
1) Cloak the robots.txt file allowing approved bots in.
2) Let the approved bots in without login/cookie requirements.
Kaled, Brett already made the point earlier that this would be seen as unapproved cloaking by the SEs and would result in WebmasterWorld getting banned. He mentioned that he even asked the SE reps personally and they said he would get banned.
Suppose I suffer from a troublesome skin infection on my left hand. I could "cure" the infection by taking an axe and amputating the hand. Some people would, no doubt, admire by bravery, but it would still be bad decision, especially if this action in no way ensured that the infection would not return somewhere else.
Since I am not Brett, I cannot know just how bad the problem is that caused him to take this action - nevertheless, I think it is highly likely he will regret this decision. Brett has already made the point himself that bad bots can can be walked through the login/cookie stage so, using the metaphor above, the infection will probably be back. If it doesn't return, it will be because WebmasterWorld has fallen so far off the internet map that that it is no longer attractive.
[edited by: kaled at 12:05 pm (utc) on Nov. 29, 2005]
| This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4  6 7 8 9 ) > > |