<don't worry, be happy>
> combined with bandwidth-throttling,
I think it would only hurt members (probably me worst of all). Most of the bots are not agressive at speed, and fall somewhere in the middle of the regular members usage patterns. It is the constant number of them and consisitent spidering. Hence, the reappearance of session ids and tagged pages on most pages of the site. This has helped weed out about another 30 bots that were smart enough to support cookies and be setup with a u/p login. The content is flowing back to unindexable...
> bad bot repellent script
And what would that look like? It wouldn't look like:
- page view throttling. (as mentioned, many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). However, above that threshold is doable...or as a mod put it 'doneable'.
- bandwidth throttling. Mod_throttle was tested on the old system and only clogs up the system for other visitors - it is also pretty processor intensive - it is very noticable to all when you flip it on).
- agent name parsing - ya, it's laughable.
- cookie requirements (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password.
- ip banning - takes excessive hand monitoring (which is what we've been doine for 7 years). The major problem is when you get 3-4k ips in your htaccess, it tends to slow the whole system down.
- intelligent combo of all of the above? yep, that is basically where we are at right now.
> before having something else in place
agreed, the speed with which we fell out of the index caught me by surprise. I was expecting 30 days. In the interim a project came up that demanded attention this week. oops.
In the meantime, we could use some help from members [webmasterworld.com]. Many of those are from people who found us via an engine...
> we better index it like crazy
The major problem is that all this content is dynamic and to get "if modified since" support, is difficult. We will have to move to non-parsed header scripts and generate our own if mod since headers. That would slow down spidering enormously.
> can you use a program like Fluid Dynamics Search Engine?
I am a major fan of Zoltans work over on XAV. What a script. As most know here, it is was what we were using to about 200k pages, when it slowly faded because it was so slow and so much of a system killer. Great program - I highly recommend it on smaller sites.
> immense bendwidth charges.
Currently, we have no bandwidth charges other than the base fees. I don't expect that to change any time soon.
> Ever thought of burning the whole site
> to DVD and selling it?
Yes, thought of it alot. There are a lot of issues involved - many of which out weight the benefits at this time.
> incentive for people to subscribe
Your word of mouth recommendation is the best reason there is for people to subscribe and support the site.
> Not sure how it got fixed, but it did.
They moved from dynamic to static content. They moved from a mid range box to 3 high end boxes with round robin dns. They optimized their scripts/programs. They were more aggressive in offering downloads of the db . They banned scripts designed to raid them. They started changing key bytes on the html that crippled older scripts ability to parse the html - thus they would have to be rewritten. Lastly, there were so many sites running copies of the odp, that Google got aggressive at pr0'ing them. That decreased the value of a odp clone site to about zero.
> I hope you know what you are doing Brett,
hehe. ya right. Life is best when it is lived on a Wing-n-Prayer baby!
> I think we should trust his judgement on this one.
Oh, I'm not saying it was a perfect decision. Oilman mentioned we'd been considering requiring cookies for ages, this was not a long preplanned. It was an emotional reaction to all the bot attacks, scrapper sites, blog leeches, open Chinese proxy sites, and hyper aggressive big search engine crawlers. It was a, throw hands in the air I've had it moment. The site is here for the members to be involved and engaged in - robots do not make an online industry community - people do.
> announced his decision to kill them on
> this site just as that month finished.
Hey, I remember Nostradamus mentioning something like this. ;-) (I think this is how rumors start. lol!)
> tell us the real reason
I give up trying to disabuse people of any notion to the contrary. I've laid it all out for you here. It doesn't matter what I say Mick, there will be those that think just the opposite. I've said since day one that rogue bots are the #1 issue we face here as a community site. I don't talk about it too much in public because it is hard to discuss when the very problem is reading the page. It is just like google talking about spam issues too much. Once they do, then someone will design a system to game that feature within a few hours. After we had put up the required login here - I found 10 new bots that had been given cookie support and were crawling the site as logged in registered members.
In the meantime, we could use some help from members [webmasterworld.com]. Many of those are from people who found us via a engine...
> put up more ads.
Sorry, not at this time. We like it direct advertising free as it is now. I don't discount the possibility, but we have no plans at current to put advertising on the site.
Yes, we do give exhibitors and sponsors of PubCon page views here and that is in their agreement with us. We do that to promote the conference and let people know who is going to be at the show. It supports us and the members, MORE than it does the exhibitors.
> then how do the big sites deal with it?
They have multiple servers. Of which, most are static and don't require constant syncing. Other bigs sites such as: auction houses, email sites, and other massively dynamic sites have custom software that keeps their servers in sync. Basically what they have are programs that can "read from any server, but must write to ALL servers" code. That code is about half done here. Yes, we will be investing in infrastructure and flushing out that setup in the next year. Gasp - moving to a full real db (sql) is in our non-too-distant future.
> doesn't need the same 20 questions asked again and again every single day
And a search function does not always do alot to help that. The content is all so similar here, that on many issues, the major search engines were of little help. How many duplicate "duplicate content", or "supplemental results" questions has the google forum seen? A search engine didn't help there.
> I just hope that his new search system
> doesn't cause the same issue for him.
It will for awhile. There will be problems with that at the start. Moving the search engine to new server is the answer, but that will take a few weeks.
So, patience please...
In the meantime, we could use some help from members [webmasterworld.com].
</don't worry, be happy>
...wow those are pretty black helicopters outside.