Forum Moderators: open
[webmasterworld.com...]
Summary:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Thanks
Brett
Rogue Bot Resources:
Eh? The data the google bot sees is EXACTLY the same data the user would see. You're just requiring the user to login. The data behind the login page is still the same. No one is being deceived.
The USER who clicks on Google's result is being deceived - instead of seeing page with relevant data he will see login page. I feel seriously pissed off when I click on some links in Google news only to be greeted with subscriber's only page - it may be acceptable for Google news since they have trusted feeds, but its not for Google search engien.
Granted registration is free, however its not important - what's important is that there is no way computers can distinquish between good Brett cloaking for good reason and bad Bill-The-Spammer who cloaks for bad reasons, thus anybody who cloaks should be penalised because machines simply can't see the difference.
eg: if you have an ip that has been banned and you are running as a bot name - that's why...
Actual ip parsing is left as an excersize to the reader:
#####################
#!/usr/bin/perl
#use CGI::Carp qw (fatalsToBrowser);
print "Content-type: text/plain\n\n"; # if needed...
$ip = $ENV{'REMOTE_ADDR'};
$hta= ".htaccess";
&GetHtaccess;
# print "s $success $ip" if $debug;
&BurnIP($ip);
&PutHtaccess;
sub BurnIP {
$z=shift;
foreach $t (@htaccess) {
if ($t =~ /deny from/gi &&!$done) {
$t.=" $z";
$done++;
push(@out,$t);
}
else {
push(@out,$t);
}
}
if (!$done) {
push(@out,qq¦\ndeny from $z\n¦);
}
undef @htaccess;
@htaccess=@out;
undef @out;
}
sub PutHtaccess {
open(FILE2,">$hta");
foreach $t (@htaccess) {
print FILE2 "$t\n";
}
close(FILE2);
}
sub GetHtaccess {
return(0) if!-e "$hta";
open(FILE3,"$hta");
@htaccess =<FILE3>;
chomp @htaccess;
close(FILE3);
1;
}
#####################
If not logged in :-
1) Disable all outward links from every page.
2) If cookies are enabled, insert a login form at the top of the page, otherwise insert a "cookies required" message at the top of the page.
By disabling all the outward links, robots would be totally screwed. By displaying the indexed content (albeit with a login form at the top of the page) those that enter the site directly from a search engine will not be overly annoyed/disappointed.
Kaled.
I don't know anything about the workings of this site but I presume it's driven by some sort of database, could not a search function be written for it?
Is it because the site is too big and it would take too long to do a search?
By disabling all the outward links, robots would be totally screwed.
It seems to me that this will defeat the point - if robots can't find links then they won't index much, thus there will be no bots and hence no results in Google to get traffic from: this is fine if you don't want bots, but the way I understand the situation here (with alleged cloaking of robots.txt and possibly other content depending on user-agent) the point is to make people register to view content, yet some bots are allowed using cloaking.
Its either having cake or eating it.
You would disable links if not logged in (thereby defeating unwanted robots) but enable the links if logged in and/or on a white-listed IP address. Thus, Googlebot et al could be allowed in.
This is still cloaking, but the user would see the same content as indicated by the search, however links would be disabled and a login form would be displayed at the top of the page.
To disable links, simply set href="#" (to go to top of page I think). You could also use javascript to focus the first item on the form.
Kaled.
Is this something that will be implemented in the near future?
another thing; I mentionned PunBB as being a very light and efficient BB system, I forgot to mention the best system: MesDiscussions. I have no other words, it simply is the best. Only thing is, I'm not sure they have support in English (it's a French system).
cheers
Plus it can be "taken down" too easily for most usages ..France has some of the best coders in the world ..they did not work on "MesDiscussions"
support is french only ..24 hour delay except holidays ..France has lots of holidays ..form mail problem submit system..
BTW ..you seriously think that TF1 has the specific bot problems of here ..most of the posters to TF1 fora can barely read let alone run a bot!
This is still cloaking, but the user would see the same content as indicated by the search, however links would be disabled and a login form would be displayed at the top of the page.
Some people will certainly be confused seeing no links - but I suppose in this case it could actually be a reasonable compromise, the only thing I don't like is producing different content depending on whether requests come from particular search engine IPs - all good search engines should have an unknown range of IPs to catch such things and this brings the ultimate problem with cloaking - how can a machine determine if its a cloaking with the best intentions or just plain black hat spamming? They can't do that, so the only reasonable course of action is to ban all cloakers - not geo-IP delivery is not cloaking.
I was not aware TF1 did use this system at any stage.
MesDiscussions is used for some of the largest online communities, including hardware.fr (about 350,000 members and not far from 30 million messages).
The "smiley" thing is not an issue, it can be deactivated. The support however can become an issue.
And by the way, I suggest those who doubt of systems such as phpbb for large communities have a look at "Gaia Online" forums. You'll be surprized... I think they're not far from 1,000,000 new posts each month.
1) Use entirely cgi links.
2) Unless logged in or on an IP address white-list, all links will lead to a login page.
3) The login page must include a noindex robots meta (to avoid duplicate content penalty!)
Kaled.
2) Unless logged in or on an IP address white-list, all links will lead to a login page.
This is based on assumption of knowing all IPs of given search engines - this perhaps will work now, but with explosion of spam sites it seems to me that using cloaking is exposing yourself to a serious penalty.
This also does not address the issue of users who came from search engine being mislead - they expect to see content they searched for on the first page after click, but instead they will have to register etc. When I come across with these things I just close the window and search harder.
effisk, those french forums, I was going to post a code sample from their page that certainly does nothing to support your claim that they are the best, but leosghost already covered the question.
With phpbb forums, it's a different idea than these, they are db driven, bbbs are flat file driven. Different animals. Punbb does look interesting, but hasn't been stress tested yet as far as I know on a major forum site. But it's probably more or less just phpbb lite, with some extras and some subtractions, definitely as I noted the best output css/html of any forum software I've yet looked at. And very quick. But these forums aren't going to migrate to any generic solution, so there's really no point in bringing that up.
not sure how to translate it in numbers but the Alexa ranking is now at about 500, from a top 300 or so. Sure there's a drop, but it's holding up pretty good IMO.
Brett, how about selling your blacklists? If you have every bad bot in the Delta Quadrant coming at you I would think it would be a trustworthy and effective blacklist that you could make a profit out of (and of course make operations a bit cheaper). I *LOVE* access logs and will probably always love them, I'm sure there are plenty of others who spend countless hours tracking like I do. Maybe you could hire someone to cover the work for you (unless you do like dealing with the issue though probably not)...either way sell the lists and make a profit and buy some new super servers or something? ;)
I fully agree with requiring cookies to serve content, it must be done to keep this or any site under siege operationally sound.
WW is rarely the one to break a story. Usually stuff is posted on the front page days and weeks after the rest of the web gets it (i've noticed there are peaks and troughs of spurts and dryness concerning stories posted). So basically you come to read the valuable comments posted by members. That will continue.
But searching for other people who have started threads on a topic I am interested in will now, unfortunately, have to happen elsewhere (and sorry, I don't prefer to use MSN search...and the web is about choices - you can't choose for me.)
Absolutely nothing has changed in the abstract. Anyone who is willing to accept a cookie can come to WebmasterWorld and rip the entire site. It's just an added requirement - like when before where everyone who had an internet connection could come and rip the entire site.