Welcome to WebmasterWorld Guest from 126.96.36.199
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
Ideally you would do this with a firewall device that sits in front of your webserver(s). A good Cisco device would be able to handle a really high request load, and would not cost your webserver a single CPU cycle.
A software based firewall running on the webserver would be ok too - though, since the request needs to be handled by your server still, that's going to result in some usage on that server.
Still - blocking at the level of a software firewall is better than trying to block from the apache .htaccess level - because the firewall operates at a lower level on the TCP/IP stack.
A free and very easy to setup/use software firewall that would run on red hat enterprise is called 'apf firewall'. It includes an IP block list (/etc/apf/deny_hosts.rules).
As far as the hardware firewall - I would definitely suggest you contact the RackSpace support people about this - they definitely are the experts. While the external hardware firewall is definitely the better choice for reducing usage on your server - it would be more expensive. Depending on the exact device they recommend for this, the cost might be in the range of $500 - $5000.
From a user perspective, I think vBulletin would greatly add to Webmasterworld. I've been so frustrated at times using WW I wanted to pull my hair out. Search is the most lacking feature, even when it was there. vBulletin is all around a great pleasure to use, post, find what you want etc.
Just my 2 cents.
why not switch to a dynamic forum like vBulletin
As a Yorkshireman (renowned for our stinginess) I applaud him!
I am afraid you guys lost me a week ago with the technology involved in this. I just don't understand much of it and it may be that it is all necessary. What surprises me is that with all the experts in here no one has come up with an alternative so perhaps there is none.
All I can say is that the effect on my WW experience is a bit like losing a limb. It's just unfortunate that some other search feature could not have been implemented first. (Incidentally the site search option at the top of the page is no longer applicable).
Why not create an 'archive' area where you create truly static versions of your pages via a cron job when the server load is low.
Allow anyone without an IE/Opera/Mozilla UserAgent in, and anyone who arrives through those UA's gets redirected to the real pages.
At the end of the day the worst scenario is that the posts in the archive are up to 24hrs out of date. The server load should drop right back as the static pages should cache nicely and serve up quick.
That is the hardest part. I was tracking all the ips with "reason for ban" in a spreadsheet. it has about 20k ips in it that we have banned the last few years. I take the ip off after about 5-6 months."
I'm no Unix Expert and I apologise if this has already been covered. But this has been sitting in my bookmarks for last year or so. [neilgunton.com ] From my limited POV it looks good.
afaik, some bad guys is found of setuping a proxy page like:
$ctx = file_get_contents("http://your domain/" . $_SERVER['REQUEST_URI']);
$ctx = preg_replace(.... replace your banner, logo with theirs..);
sitting and waiting money from ad links they put up.
this may hit your cache nicely, but i bet u prefer to ban these ip. really easy to analyze from log if u do.
Any suggestions that rely on UA checks implicitly rely on the "visitor" honestly reporting their UA. In other words, UA checks are worthless (especially given, since even mainstream browsers allow you to spoof a UA now).
If you work out that WebmasterWorld search feature and it brings back relevant results quickly I think you'll be surprised to see how much higher that number goes. When it works, they use it.
Brett has a lean, mean Forum that he has written himself in Perl, and which handles a staggeringly high number of postings on a single RH-box. He prides himself on the fact that just this single box is able to handle what other fora need a server-farm for.
There's nothing wrong with having a lean, mean site, but everything has its limits. If the number of real webmasters using WebmasterWorld increased 10-fold, would he still be able to handle it on a single server? What if Google had tried to build their search engine to run on just a single server?
It sounds like now is the time to redesign the site to function on multiple servers. It might also make sense to build it around a database (assuming it isn't aleady), add search functionality, and do a number of other things.
- What exactly is the problem which is trying to be solved here? - it's not entirely clear
- Server load seems to be a big issue. Has getting the BestBBS generated source cleaned up being considered? It could probably be chopped in half by moving a fair amount of presentational stuff (e.g. attributes in <td> tags, <font>, etc.) into CSS. (I'm NOT suggesting a full CSS-P layout, but you could get rid of a lot of code without really affecting what things look like at all.)
vBulletin is really powerful, but is very resource intensive. I admin a forum that gets a fraction of the traffic WebmasterWorld does, and it takes two dual Xeon servers with 4 GB of RAM each. I've gone to pains to eliminate features and queries to keep the load down, but we can still get overwhelmed by traffic peaks. I'm getting very zippy fractional-second page loads with 800 users online, but I'm waiting to see what will happen when we we peak at 2000+ simultaneous users. (I can tell you what happens with one dual Xeon server - it gets really slow, and starts throwing database errors.)
I think vBB would be an interesting solution for WebmasterWorld, but it would take some serious recoding to streamline page creation and some even more serioius hardware.
Here's the vBulletin hardware setup for a busy forum (not mine) - Alexa rank about 14,000, with a base of 40 million posts: "5 Web Servers each with Dual Xeon 3.0 Ghz Processors, 2GB Ram, Ultra 320 15k SCSI Drive. Gallery/Uploads Server with Single Xeon 2.8, 2GB Ram, Ultra 320 15k Raid Array. Database Server with Dual Xeon 3.0 1MB, 4GB Ram, Ultra 320 15k Raid Array." That's serious hardware.
Comparing forums is difficult, of course, without detailed information on user behavior. Very few sites report stats like "peak hourly pageviews" which is major determinant of server load, as are number of searches and posts. One site can have a few million posts and 100K members and loaf along because it's rare for more than a couple of hundred members (and visitors) to be online at the same time. Another forum that looked smaller could actually be much more server intensive if they have more members online at once or get a lot of external traffic.
At Pubcon presentations on Community Building, Brett has emphasized the importance of snappy page loads to keep members and visitors coming back, so you can be sure that he wouldn't want to get into a software/hardware setup that could bog down under load.
We didn't ban bots to start with - that was not the real action - we required login - which mandated blocking the bots or suffer the wrath of a totally abused login page. I would rather throw up a road block, than set here and let a bot crash into the 404 or login page because we required login and cookie support.
I've never heard of anyone cloaking the robots.txt file, but I can see no technical difficulty.
Essential for creating a so called 'community' - man I hate that term for anonymous web communications... - is openness. The doors are open, you drop in, look around, leave. If you like what you see, you join, and start contributing. That decision is made after you've determined you like what you see, not before. I never believe anything on the web until I see it myself, and I never signup for something without having seen it myself. Very simple. If there are walls around the community and you can only enter through one door, people turn away and go to 'communities' that are open.
Think of a small village with many roads leading into it, there's a public market, everyone hears about it from various news sources and gossip [search engines and word of mouth]. One day, all the roads except one, with a guard on it, block access to everything. It no longer feels like an open community, and people start to drift away, except for the ones who are used to hanging out at that market already, and a few others. But most just go to the next town over, that still has open access to its markets. And those are the ones people start reading about in their local papers, so that's where they go.
Then of course there's the other things, simple, people like their work to be seen, but when pictures of their sculptures stop appearing in the surrounding papers, they too start moving their work to more open markets. Brett knows all this, if bots are part of the web, you have to have a site that can support that part too.
While many webmasters out there would and will love it if brett stays off google/search in general, that decision will only have one outcome long term. Which means bbs will have to be able to support running on multiple boxes. So my guess is Brett will keep the search engines out until he's reprogrammed the forums to run on multiple servers, then let the bots back in.
Don't know how truly optimized the db queries are though, my guess is that end probably needs work, but the punbb forums I checked out loaded very fast, much faster than most phpbb forums, even heavily html optimized ones, which suggests some good work on the db end of things.
Looks like about a sudden 60 percent drop in WW user traffic.
I run a very high volume site with forums. Over 1 terabyte of bandwidth per month. I can say from my own experience there are better ways to deal with this while still allowing legit search engine crawlers in.
Our intelligent automated approach works well:
1. Dynamic robots.txt
2. HTTP honey-pot without instant auto-blocking.
3. Honey-pot from projecthoneypot to catch the sneakiest of crawlers.
4. Automated log file scanning.
5. Firewalling of problematic IP ranges (not just single IPs). If I have to lose a handful of visitors in order to gain many more legit visitors from SERPs, so be it!
6. Lastly, hardware and bandwidth is cheap... throw more at it if you can. Its less costly than 8 hours per week of your time.
We're going to be adding a geotargetting database soon, once we get another server up.
Alexa stats have never made any sense whatsoever for my sites. I used to think that they were part of Amazon, but found out they weren't when I tried to contact them awhile ago. I don't see the value in watching stats that appear to have no basis in reality.
I guess more hardware is not an option at webmasterworld because it does not use a database. Its file system based. But this limitation is just my assumption, and I may be wrong.
Presumably login/cookie access can be disabled by IP address therefore
1) Cloak the robots.txt file allowing approved bots in.
2) Let the approved bots in without login/cookie requirements.
Kaled, Brett already made the point earlier that this would be seen as unapproved cloaking by the SEs and would result in WebmasterWorld getting banned. He mentioned that he even asked the SE reps personally and they said he would get banned.