Welcome to WebmasterWorld Guest from 22.214.171.124
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
I was also thinking one of those Google search appliances would be a good replacement for the site search, but looking at it and the number of pages here it would cost a fortune. Brett, know anybody over there that would give you a good discount on that? Maybe as part of some beta test or something?
[edited by: Brett_Tabke at 9:51 pm (utc) on Nov. 27, 2005]
Seems like there is a great deal of interest in the topicYou bet - one of the most interesting I can remember (because it is live, and totally relevant to my site - that just has to also echo out to others).
It was a, throw hands in the air I've had it momentThat is the human element that gets us all coming back here, Brett. Give me human over spiders every time.
rogue bots are the #1 issue we faceAnother echo of a problem we all face [webmasterworld.com] - the link is the way I've managed to fix it on my site (and how I found WebmasterWorld in the first place), and it is so useful to know the problems faced here, and how you are managing it.
On the issues of Search - since G is so good at this, do they have a solution that could be installed on your server? Just a thought.
I was also thinking one of those Google search appliances...Great minds, huh?
[edited by: AlexK at 8:16 pm (utc) on Nov. 25, 2005]
I suspect this comment is made with a mouthful of facetious irony, but the answer is certainly "yes."
In fact, this is what I do with a few news sites from a certain Third World country. Each news site individually is terribly biased (pro/anti-government), but by aggregating the stories into a local searchable database, I can get a more clear, "bigger picture."
So instead of a full blown cloaking script, you would just need to edit the login script to not require log ins from certain IP's. Can that truely be considered cloaking?
Just a thought.
This Sucks! No site search an any engine AS WELL as no search box on webmasterworld.com
This doesnt seem right at all...
Why not go purely cloak and only server pages to known IP's? Because cloaking is bad?
I dont get it, why be scared of "purely cloaking" if your willing to give up all search engine traffic anyway?
After being a member with many names over the past seven years I have to say that this is unbelievable.
There has been so much focus on search on this site over the years I cannot believe two things:
--The action taken to remedy the problem you are having.
--There is no search box on ww.com?
For the first time I cannot find what I am looking for and I am going to be forced to get the information elsewhere.
Figure it out man!
I believe the explanation is that the site requires a login now so it would require cloaking to allow Googlebot in - and the cloaking would lead to a ban anyway.
If thats the case, why not enable guests to browse the forums, just not repond? Wouldn't this solve the problem?
This wouldn't be cloaking because as long as the users browser supported cookies they could see the site, but you would allow google to also see the site without cookies. This would stop most non-authorized spiders but also allow the search engines in.
Seems requiring the ability to set a cookie would accomplish the same thing as requiring a login.
The human verifier page could have something saying "register so you don't have to do this anymore."
(most were proxy servers)
Isn't it nice to know your "friends" at least put on the carnival masks before attempting the burglary :)
warm and cosy feeling ..
actually..given the expression on the "kittens"( where'd they go?..404's aint what they used to be ;) I am slightly surprised that you didn't smack a few bots back into their boxes and break a few spider legs and lairs whilst you were at it ..
showed restraint ..
whilst i think of it and there is some agreement behind scenes in stickies ..particularly with reletion to "update" threads etc .."post" POST MODERATION mightn't be a bad idea either ..even if there lies the path to censorship ..cut out an awfull lot of the "can't be bothered to read the TOS" or "I cant read 500 posts" etc ..WHY NOT!
[edited by: Leosghost at 9:53 pm (utc) on Nov. 25, 2005]
Point is ..if as a side effect of the current config the quality of the threads gets back to what it was ..( endless cut and pastes of what "Cutts" / "gg" said are really not good for this place ..after all it does say professional in the description ..and whilst all of us were "noobs" somewhere ..at some time ..of late there is a tendancy of many recently to post just to see their own posts ..and of some real new visitors to think that some of them ( the multi-cut and pasters and DC watchers )actually know what they are talking about ..
If one wants slash dot type "Warhol fame" posts ..this site is not what one opens ..
mini rant ended..needed saying ..semi on topic IMO
Alexa ..rotfalol ..only for drive by black hat planning are alexa relevant ..
Can you use something like VBulletin and limit Search to Supporters Only?
I'm seeing it too on a much smaller sized message board. I've seen some over in the phpbb forums wondering "why all these large increases in visitors online?" Sometimes ten fold of the actual registered visitors on much smaller boards. It has been posted in their forums but nobody had an answer for it.
How else can it be explained? Man I really feel for Brett because I've been dealing with what almost looks to be the same problem. I was on a shared server and they shut it down twice. I've recently upgraded but the strange IP's are still hammering away.
But then again, I'm kinda lost with alot of this stuff
My two cents: Drop any cookie requirements for the initial visit but present a captcha check once the hit rate coming from a particular IP exceeds a certain value. Ask again every 5 minutes for as long as the hit rate exceeds that limit. If the captcha check fails, blacklist the IP. If the captcha check succeeds, use a session cookie to identify the user agent. If the session cookie is rejected by the UA, do another captcha check along with a message saying that you require cookies.
I don't know what the hit rate limit should be exactly, but it could be that it needs to be so low that busy members might trigger it incidentally. To deal with that, members other than "New Users" should be white listed.
But if I were a new visitor coming from G, I couldn't be bothered to register just to see some contents I have never seen before and don't know the quality of.
Fair enough, but that could help weed out the useless "me too" posts and raise the quality bar.
[edited by: engine at 11:31 pm (utc) on Nov. 25, 2005]
This is what does not make sense....
And Brett if you could...please answer this one!
Why do you not opt to cloak the site?
If the answer is because it is "bad"....explain why!
If the reason you dont want to cloak is because you are afraid of search engine's frowning down on you the I am COMPLETELY LOST..I mean so here you are not cloaking but who really cares of your not even in the search engines..
<joking>But yeah at least your not cloaking</joking>
I have read through many suggestions...some may work some may not!
You say you tried everything but have you tried to cloak the whole site?
That exact question has been asked of the se's for years, and they say universally, that would be cloaking and against the major se guidelines and you would be subject to removal.
If the reason you dont want to cloak is because you are afraid of search engine's frowning down on you the I am COMPLETELY LOST..I mean so here you are not cloaking but who really cares of your not even in the search engines.
Not my place to answer this, but it seems obvious to me - this state of affairs might not be permanent. Why poison the domain? We've all seen tales of woe from people here who have domains that are so poisoned that they're beyond redemption/recovery.
And for those folks who want the wavy letter captcha things - man, most of the time when I run into those, I can't tell what the hec they're supposed to be spelling. Please don't put me through that.
And great intro to part two of this thread, Brett. It really puts things in perspective.
Added: He's playing by the rules, man - rather hard to find fault with that, isn't it?
[edited by: Stefan at 11:51 pm (utc) on Nov. 25, 2005]
- donations for new server equip
- custom software
- distributed computing resources (Akamai or SETI@Home style)
This is going to sound naive, but I'd like to think we could 'open source' a solution given that we have a lot of talented webmasters, distributed servers, the will, etc.
To the issue with syncing, maybe WW's own server(s) (How many exactly are there Brett?) would host fresh content, and a distributed network of member servers (I'll donate some cycles) form an Akamai-style distributed WW. Maybe P2P syncing could reduce the load on WW to update the mirrors? And since only 'changes' would need to be synced, it would be limited to new posts.