| This 246 message thread spans 9 pages: 246 (  2 3 4 5 6 7 8 9 ) > > || |
|Attack of the Robots, Spiders, Crawlers.etc|
Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
- Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
- Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
- Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
- Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
- IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
- One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
- Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
Anyway, I was thinking if it would be possible to set up a mirror site where posting wasn't allowed and redirecting all known and detected robots over there, would that work? Let the bots chew on a slow mirrored site and keep the real site relatively bot free. When a user logs into the mirror site, have them redirected over to the real site. At least Google will have something to crawl.
I was also thinking one of those Google search appliances would be a good replacement for the site search, but looking at it and the number of pages here it would cost a fortune. Brett, know anybody over there that would give you a good discount on that? Maybe as part of some beta test or something?
[edited by: Brett_Tabke at 9:51 pm (utc) on Nov. 27, 2005]
You bet - one of the most interesting I can remember (because it is live, and totally relevant to my site - that just has to also echo out to others).
|Seems like there is a great deal of interest in the topic |
That is the human element that gets us all coming back here, Brett. Give me human over spiders every time.
|It was a, throw hands in the air I've had it moment |
Another echo of a problem we all face [webmasterworld.com] - the link is the way I've managed to fix it on my site (and how I found WebmasterWorld in the first place), and it is so useful to know the problems faced here, and how you are managing it.
|rogue bots are the #1 issue we face |
On the issues of Search - since G is so good at this, do they have a solution that could be installed on your server? Just a thought.
Great minds, huh?
|I was also thinking one of those Google search appliances... |
[edited by: AlexK at 8:16 pm (utc) on Nov. 25, 2005]
> So if I had a copy on my cpu, I could then use Google desktop to search it?
I suspect this comment is made with a mouthful of facetious irony, but the answer is certainly "yes."
In fact, this is what I do with a few news sites from a certain Third World country. Each news site individually is terribly biased (pro/anti-government), but by aggregating the stories into a local searchable database, I can get a more clear, "bigger picture."
What I don't get is, why did you block Google?
Cant you just block all bots, except google?
> Cant you just block all bots, except google?
I believe the explanation is that the site requires a login now so it would require cloaking to allow Googlebot in - and the cloaking would lead to a ban anyway.
> The major problem is when you get 3000-4000 ips in your htaccess
Also, why are these in htaccess not in the server config (httpd.conf) directly? That's much faster.
Would it really be cloaking? Cloaking content and feeding a SE different content because it is a SE is bad. But all it is really doing is giving the SE exactly what the human gets after they log in.
So instead of a full blown cloaking script, you would just need to edit the login script to not require log ins from certain IP's. Can that truely be considered cloaking?
Just a thought.
This Sucks! No site search an any engine AS WELL as no search box on webmasterworld.com
This doesnt seem right at all...
Why not go purely cloak and only server pages to known IP's? Because cloaking is bad?
I dont get it, why be scared of "purely cloaking" if your willing to give up all search engine traffic anyway?
After being a member with many names over the past seven years I have to say that this is unbelievable.
There has been so much focus on search on this site over the years I cannot believe two things:
--The action taken to remedy the problem you are having.
--There is no search box on ww.com?
For the first time I cannot find what I am looking for and I am going to be forced to get the information elsewhere.
Figure it out man!
|I believe the explanation is that the site requires a login now so it would require cloaking to allow Googlebot in - and the cloaking would lead to a ban anyway. |
If thats the case, why not enable guests to browse the forums, just not repond? Wouldn't this solve the problem?
If you required users to have cookies, but didn't require them to login. Could you require everyone but google to have a cookie to view the site?
This wouldn't be cloaking because as long as the users browser supported cookies they could see the site, but you would allow google to also see the site without cookies. This would stop most non-authorized spiders but also allow the search engines in.
Seems requiring the ability to set a cookie would accomplish the same thing as requiring a login.
What about requiring unknown IP addresses who are not logged in (or even those with < xx posts) to enter a human verifier/catchpa after viewing 10 or 20 pages? I would imagine a lot of the SE traffic is looking for answers to specific problems, so this probably would not even affect too many people.
The human verifier page could have something saying "register so you don't have to do this anymore."
|(most were proxy servers) |
and reading betwixt the lines..
Isn't it nice to know your "friends" at least put on the carnival masks before attempting the burglary :)
warm and cosy feeling ..
actually..given the expression on the "kittens"( where'd they go?..404's aint what they used to be ;) I am slightly surprised that you didn't smack a few bots back into their boxes and break a few spider legs and lairs whilst you were at it ..
showed restraint ..
whilst i think of it and there is some agreement behind scenes in stickies ..particularly with reletion to "update" threads etc .."post" POST MODERATION mightn't be a bad idea either ..even if there lies the path to censorship ..cut out an awfull lot of the "can't be bothered to read the TOS" or "I cant read 500 posts" etc ..WHY NOT!
[edited by: Leosghost at 9:53 pm (utc) on Nov. 25, 2005]
> Figure it out man!
Funny, but useless.
"Where's my waitress?"
Holy cow... the backlash begins.
Alexa's trend for WW since the Google listings were sliced shows how important good Google rankings are currently.
'scuse the "speeling" in the foregoing..I was worried about running out of "edit window"..
Point is ..if as a side effect of the current config the quality of the threads gets back to what it was ..( endless cut and pastes of what "Cutts" / "gg" said are really not good for this place ..after all it does say professional in the description ..and whilst all of us were "noobs" somewhere ..at some time ..of late there is a tendancy of many recently to post just to see their own posts ..and of some real new visitors to think that some of them ( the multi-cut and pasters and DC watchers )actually know what they are talking about ..
If one wants slash dot type "Warhol fame" posts ..this site is not what one opens ..
mini rant ended..needed saying ..semi on topic IMO
Alexa ..rotfalol ..only for drive by black hat planning are alexa relevant ..
You can use those graphical 'words' that are warped and angled for login. Having four letters shouldn't be a big deal for us users. That will stop all bots. Then make special access for Google bots and maybe Yahoo and MSN.
As you have no search facility for the forum software that you use, I spent about 10 minutes today trying the roundabout way to search for something on Webmaster World through Google and I thought Google was broken. Now I know what happened....
Can you use something like VBulletin and limit Search to Supporters Only?
"rogue bots are the #1 issue we face"
I'm seeing it too on a much smaller sized message board. I've seen some over in the phpbb forums wondering "why all these large increases in visitors online?" Sometimes ten fold of the actual registered visitors on much smaller boards. It has been posted in their forums but nobody had an answer for it.
How else can it be explained? Man I really feel for Brett because I've been dealing with what almost looks to be the same problem. I was on a shared server and they shut it down twice. I've recently upgraded but the strange IP's are still hammering away.
But then again, I'm kinda lost with alot of this stuff
I don't understand. Writing a bot that accepts cookies is a piece of cake, so to speak. So it's the registration/login that's used to separate humans from bots. But if I were a new visitor coming from G, I couldn't be bothered to register just to see some contents I have never seen before and don't know the quality of. I'd just move on to the next result. The most I could be bothered with is a captcha but even that is going to drive first-timers away.
My two cents: Drop any cookie requirements for the initial visit but present a captcha check once the hit rate coming from a particular IP exceeds a certain value. Ask again every 5 minutes for as long as the hit rate exceeds that limit. If the captcha check fails, blacklist the IP. If the captcha check succeeds, use a session cookie to identify the user agent. If the session cookie is rejected by the UA, do another captcha check along with a message saying that you require cookies.
I don't know what the hit rate limit should be exactly, but it could be that it needs to be so low that busy members might trigger it incidentally. To deal with that, members other than "New Users" should be white listed.
|But if I were a new visitor coming from G, I couldn't be bothered to register just to see some contents I have never seen before and don't know the quality of. |
Fair enough, but that could help weed out the useless "me too" posts and raise the quality bar.
[edited by: engine at 11:31 pm (utc) on Nov. 25, 2005]
this issue sounds more like a DDOS attack:
- first, contact the FBI (you'll need to explain to them that there are damages >$50k)
- There is (very) expensive router/firewall hardware / very smart software out there to detect this kind of "unnatural" behaviour.
- The IPs doing this are either hacked servers (OK to ban them right away) or trojans on enduser XPs etc. (dangerous to ban them as you might ban an AOL proxy completely)
This is what does not make sense....
And Brett if you could...please answer this one!
Why do you not opt to cloak the site?
If the answer is because it is "bad"....explain why!
If the reason you dont want to cloak is because you are afraid of search engine's frowning down on you the I am COMPLETELY LOST..I mean so here you are not cloaking but who really cares of your not even in the search engines..
<joking>But yeah at least your not cloaking</joking>
I have read through many suggestions...some may work some may not!
You say you tried everything but have you tried to cloak the whole site?
> If you required users to have cookies,
> but didn't require them to login. Could
> you require everyone but google to have
> a cookie to view the site?
That exact question has been asked of the se's for years, and they say universally, that would be cloaking and against the major se guidelines and you would be subject to removal.
|If the reason you dont want to cloak is because you are afraid of search engine's frowning down on you the I am COMPLETELY LOST..I mean so here you are not cloaking but who really cares of your not even in the search engines. |
Not my place to answer this, but it seems obvious to me - this state of affairs might not be permanent. Why poison the domain? We've all seen tales of woe from people here who have domains that are so poisoned that they're beyond redemption/recovery.
And for those folks who want the wavy letter captcha things - man, most of the time when I run into those, I can't tell what the hec they're supposed to be spelling. Please don't put me through that.
And great intro to part two of this thread, Brett. It really puts things in perspective.
Added: He's playing by the rules, man - rather hard to find fault with that, isn't it?
[edited by: Stefan at 11:51 pm (utc) on Nov. 25, 2005]
>>use Google desktop to search it?
If you have Google Desktop installed, at least you can use it to find the posts that you have read, and want to locate again.
>>subject to removal
and here we are...
sorry couldn't resist :)
|I am going to be forced to get the information elsewhere |
If it even exists elsewhere...
Not having WW on Google is like driving a car on a dark foggy night.
Brett -- this is heartbreaking. I feel like my puppy died. Is there any way we can help?
- donations for new server equip
- custom software
- distributed computing resources (Akamai or SETI@Home style)
This is going to sound naive, but I'd like to think we could 'open source' a solution given that we have a lot of talented webmasters, distributed servers, the will, etc.
To the issue with syncing, maybe WW's own server(s) (How many exactly are there Brett?) would host fresh content, and a distributed network of member servers (I'll donate some cycles) form an Akamai-style distributed WW. Maybe P2P syncing could reduce the load on WW to update the mirrors? And since only 'changes' would need to be synced, it would be limited to new posts.
| This 246 message thread spans 9 pages: 246 (  2 3 4 5 6 7 8 9 ) > > |