Forum Moderators: open
[webmasterworld.com...]
Summary:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Thanks
Brett
Rogue Bot Resources:
What firewall software would you suggest that has that easily manageable capability? Under RedHat enterprise.
Any, I'm use the iptables, you can do something like this (some features used below need be active in kernel compile time, but your kernel already will have I guess) ->
iptables -N DENY_IP
iptables -N DENY_IP_RULE
iptables -A DENY_IP_RULE -m limit --limit 64/hour --limit-burst 16 -j REJECT --reject-with icmp-host-prohibited
iptables -A DENY_IP_RULE -j DROP
iptables -I INPUT -j DENY_IP
# Now the cool part! f*king bad IPs :)
iptables -A DENY_IP -s 200.200.200.1 -j DENY_IP_RULE
iptables -A DENY_IP -s 200.200.200.2 -j DENY_IP_RULE
...
To change the firewall must have root privileges, but of course this open secutiry issues, so you need write custom tool owned by UID/GID 0 with the sticky bit (chmod 4700) to ban the IPs and expire older IPs. But writing superuser tool need be VERY careful, you dont wanna that somebody gain extra privileges using your app exploiting it? :) If you let, I can help.
*cough* this is where it gets a little to dicey for me.
If you wanna ban IPs, what place is better than kernel/firewall? Maybe the router, but... :-)
Write a tool like this dont is hard, just need trainings. Seems to me that you appreciate do everything alone, I am thus also... so good work/fun :-)
( edit: You wanna see this system in action? I'm can say my site, you just run the "wget -r" and your IP will be banned before load my server, very cool heh :) ... )
[edited by: nsqlg at 3:12 pm (utc) on Nov. 26, 2005]
they [SEs] were the first ones to break that standard [robots.txt]
Maybe that standard doesn't work anymore for 2005 and needs to be updated. I would think the SEs of all people would have some good input into such an update.
Given the prominence of webmasterworld, part of me wonders if this isn't partly a stunt to provoke a reaction from the SEs. Something's got to give, and white-listing SE spider IPs sounds like a pretty good idea, while banning you for cloaking in this scenario sounds like a pretty dumb idea.
What firewall software would you suggest that has that easily manageable capability? Under RedHat enterprise.
I use APF on RHEL...
[rfxnetworks.com...]
there are others.
APF is donationware, loads of people at ev1 use it, and in my experience it works well and can be extended easily. It has a whitelist capability for the good guys. BFD is an associated shell script that might also be helpful.
I feed apf with IPs from a script on one of my sites that is susceptible to rippers. More than x pages in a minute and they're blocked server wide.
Its dead easy to use a honeypot url and feed the ips to apf. Use robots.txt to block google et al from /honeydir/honeypage.html and you will have the rogue bots bang to rights without lifting a finger. Have a script auto generate new honeypot url and update robotstxt to keep badbots guessing.
I use apf with IPs from a geo IP database to ban port 25 from various countries. Works brilliantly for email spam reduction, and took a big load off scanning for spam.
Have a script release the IP's after a period (unless from the honeypot). A short period is OK because the next time they hit the trigger they get blocked right off with little o/head to you. You can increment the block period if you have the script check a db of last block per ip. 3 strikes and you're out.
You can have a script scan your logs for suspicious behaviour falling in between 'normal' but below the 'block' threshold and email you a report so you can manually add selected ips to apf from an admin page or the c/l. (BFD will do this)
htaccess and httpd.conf are too slow if the list is long. My firewall block lists get to be a few k ips and I cant see the difference on server load or script time if I turn the firewall off so its not a big issue - but I dont have your numbers of visitors :-(
I've had this stuff running on my server, but I'm not any kind of expert and for sure there are smart people here that can sort it for you or maybe talk to Ryan who wrote APF
Whatever, please... there *has* to be a better solution than banning google. So here's to you finding it cos I need the search function back, and despite some of the gungho optimism I dont really believe WebmasterWorld can live without g.
good luck
l.
Also, the login is not case sensitive (even though it says it is) - perhaps you may want to fix that too?
rogue bots are the #1 issue we face
If the Search engines want to maintain their new business dominance they had better come up with an accepted method of cloaking so they can spider what must be made un-spiderable.
I would so love to hear GoogleGuys take on this. This is not just about WebmasterWorld it is about a shadow growing over the internet (sorry sounds a bit middle earth)... The legion of rogue bots IS increasing, they WILL get smarter, webmasters MUST take precautions. Brett (as ever) is simply in the lead, many others will follow. Like I said this is bigger than WebmasterWorld, the search engines had better chime in or lose out.
Search Engines Keep quite at your peril, Brett will not lose much, Search Engines could lose everything.
Ya - which is nothing for them. How often do you require login? More than once a month and people get real itchy.
> How come that isn't "cloaking"?
It is - and they have agreements with Google to allow that. Google will not make the same agreemnt with us - yes, we have asked. Which is the right thing in the end, because there is no reason we should get special (or even ask for it) than our own webmaster visitors get.
I would think the answer is straightforward.
1) Create a clean list - known good bots (by IP address).
2) Create a dynamic bot detection system that bans bots by IP address (for 24 hours).
Part 2) of this solution may be tricky but it is far from impossible.
Kaled.
1) Use compression, its downside would seem to be the lesser of two evils at the moment.
2) Using ebay, have an auction where the winner gets to host webmasterworld, they will get their hosting logo on every page of webmasterworld and they supply all the bandwidth for free. Plus you get the money from the auction.
3) Tell the three major SE's that whichever one mirrors your site (for free) gets the exclusive right to crawl it. Imagine MSN using that against Google and Yahoo, "MSN, the only search engine on earth that can be used to search WebmasterWorld".
4) As someone suggested before, setup a p2p solution where the pages I have most recently visited could in turn be downloaded by other users on the p2p network.
5) Start a petition or group to lobby against the governments to do something about this problem that all webmasters potentially face. If you end up fixing this solution by throwing more money at it, then it only saves you and leaves the rest of us just as defenseless as you are now.
I started to make a robots.txt to do just that, and after about the 40'th exception line - I threw in the towel. Given our site structure - there is no way to do it.
Well, that's certainly something to look at with the design of BestBBS. It would seem easy enough to move the page 'out of the box', or at least a copy of the page that could be updated as necessary.
There is a reasonable need to have a page listed. Word of mouth is a valuable tool. I gave someone your URL last night over the phone. Now maybe he wrote it down, and found your site this morning. Maybe he didn't, or maybe he wrote it wrong. He would then still be trying to search for you this morning.
As I said, this may be tricky, but it is doable and not overly messy.
Kaled.
Spoofing an IP address for crawling purposes is pretty much impossible, so any requests coming from one of Google's IP's is pretty much guaranteed to be from Google, and checkiing for a few ranges of IP's entails only a few lines of code.
lets hear em.
Requiring cookies is a good way. Checking referral information and user agent info are not fool-proof but can go a long way without adding hardly any overhead. My point is that no matter what you choose to use, an exception could be made for Google's IP ranges, which are nearly impossible to spoof and also add very little overhead.
What if you made it so anyone could view the very first page they come to on WW?
Yet, if you tried to access a second page that would require a login.
This way you could check the referrer string and IP ranges to see if it was Googlebot or Slurp, and let them continue through the pages without requiring the login.
Since anyone can then view any single page of the WW site, that then would not be considered cloaking.
This would then allow only the search engines you wanted to make it past the login and therefore keeping WW in the Google and Yahoo search results.
This way you could check the referrer string and IP ranges to see if it was Googlebot or Slurp, and let them continue through the pages without requiring the login.
Through all the noise, what is easy to overlook is the root of this problem.
Slurp knocked us offline twice because of aggressive spidering
It's not about letting bots in or out... well, ok, it is about that, but their aggressive behaviour and refusal to obey robots.txt is the problem here. Any time you start checking referals you have opened the door once again to unruly, disobedient web bots.
I think scaling one reason that so many of us see this as a problem with a simple solution. Our sites don't have the massive bandwidth issues.
The only real solution is to add a site search.
One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link.
I am not sure how this is impletmented but have you tried this,
You allow all bots, good and bad, to follow these links in robots.txt but instead of banning them for following a hidden link, let them follow a certain number of hidden links before banning. For example, you ban the IP for 24 hours after it has followed 50 of these hidden links. You would end up banning google and yahoo but at least they would still get the opportunity to follow a few hundred links a day.
I have suggested cloaking a couple times in this thread..
However, dataguy brings up a valid point here...you dont have to cloak but why not use things that a cloaker may use, like Google's IP addresses to serve the only to Google.
This way you wouldnt be deceptive cloaking and shouldnt cause a problem.
I mean come on brett..Google must acknowledge that this is a serious issue that has depreciated the value of their results.
Further it is an issue that will only increase over the years to come!...can we all agree on that?
Now if we can agree on that then wouldnt it be safe to say that this can not POSSIBLY be the solution for the future? Is this not the best forum on websites on the internet?
How long will we not have a solution and thousands of new users NOT find the best information resource?
Not to mention Find a solution for others to utilize and benefit from! Thats what this is all about!
-----
I just cant really stand the whole this is bad and this is good with cloaking...
It is not bad to restrict the bots that crawl only to ones that come from a Google, MSN or Yahoo IP address!
Pointblank Bottomline!
I use a multi-level system to lock out bots. I have root access on a dedicated Linux box, and do all my programs in 'C'. I would not recommend anything slower than compiled 'C' for the log analysis I describe below.
First I use a very simple bot trap that catches a lot of rogue bots. It's a one-pixel transparent GIF that has a link to a cgi-bin trap behind it. No human can find this with a mouse unless they know about it and even then it's difficult to find with a mouse. Since all bots are disallowed from cgi-bin, I know that if anyone follows this link, they should be immediately zapped in htaccess. I have this trap on so-called "doorway" pages that lead to thousands of deep links.
Second, I have my ssh on a non-standard port. There are a lot of dictionary attacks from scripts that assume ssh is on port 22. These can be very load-intensive. Just change your ssh configuration to some other port and the problem is solved.
Third, I don't use sendmail. I have a different program that can send email through authenticated SMTP, but I don't receive any mail on the server. I did this when I had a zombie problem a year ago. Zombies were using one of my domains for the return address on spam, and I was getting thousands of bounces per day from all over the world. For receiving mail I use other providers, and other accounts. Many of these do an okay job of filtering spam for me. Many of these same providers also allow authenticated SMTP from anywhere.
Next I run a cron program every minute to analyze the one-minute load against a threshold that I've set, and also analyze the number of current connections by counting lines in /proc/net/tcp. This is very fast. If the threshold is exceeded, it trips the program in the next paragraph.
My log-analysis program runs about once an hour, or whenever tripped by the previous paragraph. This extracts the IP addresses in today's access_log and sorts them, and counts that day's total for each address. This is a lot faster than you might expect if you are using compiled 'C', but then if I had Brett's traffic I'd be rotating my access_log every hour or so, instead of every day. If the first threshold is exceeded for load and/or connections, I set an htaccess block so that they get a 403. If it keeps coming and exceeds threshold number two, I block it in the router table so that it doesn't even make it to Apache. I have a list of a dozen IP addresses from approved bots that serves as an exception list, so that they don't get blocked.
I reboot about once a month, which cleans out the router blocks, and I clean out the htaccess once a day with a cron job.
I would dearly love to be able to do a complete disallow for all bots. I've seen dudes from .edu domains (they're sitting in a dorm room with their caps on backwards, I suspect), hit me as fast as 40 fetches per second. They usually get nailed by the cgi-bin bot trap, but even if they don't they're nailed a couple of minutes later.
Brett is doing the right thing. If you can afford to disallow all those bots, more power to you. There is nothing special about Google -- they are probably the major reason why there is so much spam on the web today.