homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 [4] 5 6 7 8 9 > >     
Attack of the Robots, Spiders, Crawlers.etc
part two

 7:55 pm on Nov 25, 2005 (gmt 0)

Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:


WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.

Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.

It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.

The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.

It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.

The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.

How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.

We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.

Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.

> how fast we fell out of the index.

Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.

It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.

Steps we have taken:

  • Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
  • Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
  • Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
  • Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
  • IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
  • One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
  • Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.

An intelligent combo of all of the above is where we are at right now today.

The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.

That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.

The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.


Rogue Bot Resources:



 11:52 am on Nov 27, 2005 (gmt 0)

That's the best post I've read for a while Scarecrow, thank you.

From a business point of view, I don't agree that WebmasterWorld is doing the right thing however. Crippling the site search isn't good for the existing subscribers and you cut off a big source of potential new subscribers.

Besides, 6 months down the line when the original reasons for this are long since forgotten, will anyone take a PR0 site with no pages in any search engines as a credible source of information for their SEO/webmaster tips and news?


 11:54 am on Nov 27, 2005 (gmt 0)

No, Google is not taking a sitemap as gospel. At least on my site, I've seen Googlebot crawl lots of pages without regard to the sitemap, and not at all jump to new pages "announced" in it.


 1:08 pm on Nov 27, 2005 (gmt 0)

My own bot traffic can be anything from 30% to 60% of the total, and the trend is generally upwards. I knew they are a nuisance, but I was gobsmacked by Brett's figures. If this is the shape of things to come, we should all be worried.

Some people have mentioned CAPTCHAs as a way around this. It probably won't help WebmasterWorld for the reasons Brett has mentioned. But I've been considering using them in on some of my sites. What worries me is how to make them accessible.

How can you have a good CAPTCHA that doesn't exclude users with vision problems? I often wonder why I've never come across a site using sound files as an alternative CAPTCHA test.

Anyway I wish Brett the best of luck in dealing with this issue.


 1:30 pm on Nov 27, 2005 (gmt 0)

<OT>Rosalind ..I found one yesterday that uses voice links for people with problems reading captcha ..

Take me a little time to find it in my history ..
I'll sticky you when I have it ..

Couldn't find the quote ..did find the resource .


suggestions on how to use audio captcha's


 1:52 pm on Nov 27, 2005 (gmt 0)

<Still OT>
more on captcha's here

for those who need ..
</back OT>


 2:24 pm on Nov 27, 2005 (gmt 0)

Im proud of you Brett and the WW team, its a brave decision to do what you did and i think it indeeds shows your perceptiveness of why WW is actually popular. While some sites thrive on SE traffic, WW survives on its resourceful and brilliant individuals in its community who share their knowledge that MAKE individuals such as myself continuously coming back again and again. Its good to know that someone who has a clue is running the show.



 2:32 pm on Nov 27, 2005 (gmt 0)

Great post ScareCrow.

I hate to strip just one comment here and parse it, but I wanted to remind folks that:

> No human can find this with a mouse unless they know about
> it and even then it's difficult to find with a mouse.

this is webmasterworld - our html is regularly read and analyzed by members on a daily basis. There is an html comment somewhere on the site that has an intentional abbrevation in it that looks like a misspelling. I can't recall how many times that has been pointed out to me.

Many people have found our random 1px spider traps - and have taken the link and been autobanned. Some of them really good members (*cough* even moderators) that have to send me an email and ask for access again.

So, I switched format to a 2 step page. They take the 1px or href link to the ban page that says "click here to be auto banned..." with some more info on it. Only the bots will take the page...


 3:51 pm on Nov 27, 2005 (gmt 0)

> will anyone take a PR0 site with no pages in any
> search engines as a credible source of information
> for their SEO/webmaster tips and news?

Never thought of it that way. Guess I gotta hurry up and split so as to avoid getting trampled in a mass exodus come summertime.

Buh-bye. I am so outta here, it's like I'm an Eagles tune. If the content isn't showing in the SE's, and if there's no soylent green dripping off a toolbar, there's absotively posilutely nothing about the content or the community that builds the content that would be worth my time.

Of course, I could be wrong. :)


 5:18 pm on Nov 27, 2005 (gmt 0)

Many people have found our random 1px spider traps - and have taken the link and been autobanned. Some of them really good members (*cough* even moderators) that have to send me an email and ask for access again.

So, I switched format to a 2 step page. They take the 1px or href link to the ban page that says "click here to be auto banned..." with some more info on it. Only the bots will take the page...

How hard would it be for a bot programmer, once they know that your second page says "click here to be auto banned...", to program their bot not to follow any links from pages with the word "banned" on them?


 5:38 pm on Nov 27, 2005 (gmt 0)

That is the crux of the situation twist. Every measure gives rise to a counter measure. Which is again - why we don't talk about the issue to much.


 5:53 pm on Nov 27, 2005 (gmt 0)

The measure v. countermeasure issue is a huge one.

Long ago when terminal emulation and browser incompatability was was "the" time consuming issue in BBS and website design, I decided to shoot for 90% (accessability).

Needless to say I saved myself countless hours of workaround programming, client detection (and subsequent re-invention when "browser x" or "terminal z" altered their code).

Here's an off the wall suggestion:

1. Pare down WebmasterWorld to a "best of" (static read-only)site --- let the bots and everyone else have at it.

2. Move all active content to (a completely registration based access) "WebmasterWorld-Pro" site.

3. Put links on every page of the "best of" site to the Pro site.

This would keep WebmasterWorld in the SE's, protect your content and reduce your server load.

Alternativley, blow off all SE's / bots / crawlers and go 100% subscription. Offer a discount for reciprocal link to all subscribers --- since many are professional developers, programmers, designers, etc.. your new visitor traffic will continue and they will be from visitors referred from on-topic sites.

Or... you could just require cookie login <grin>.

The problem needs to be resolved one way or the other if you plan to release your BestBBS software as the same thing will happen to anyone who develops a site with valuable content in the BB.


 5:58 pm on Nov 27, 2005 (gmt 0)

I had said, I'm made a perl script that calcule the average page access "on-the-fly" reading access_log (excluing files like css, jpg, png, etc...) in a couple of different time frame, banning at firewall if excess X number, this is very cpu friendly at peak time using 0.1% with access_log that grown 1gb/day. But webmasterworld need a C program to do this, maybe I can write this and share source code... Brett, the WebmasterWorld had how many uniques? ( using cat access_log cut -d" " -f1 sort uniq wc -l )


 6:26 pm on Nov 27, 2005 (gmt 0)

This is not the method that I use, as I have chosen to use a database to keep track of pageviews so that they may be throttled, but here is another method that a friend of mine uses which doesn't use a database, so it is pretty light weight. It's not perfect, but it may be something to think about.

When a page is viewed, the HTTP Referer is checked along with a cookie that was saved with the current URL of the previous page that was viewed. If they match or if the request comes from a white listed IP, the cookie is updated with the current URL and everything continues as a normal page view.

If the HTTP Referer doesn't match the cookie, a cookie is set with the URL and a 302 redirect is used to redirect the visitor to a message page that says that their browser is being tested. On the message page, the HTTP Referer is again checked against the cookie and if it matches, then the visitor is sent on a javascript redirect back to the originally requested page. If it doesn't match, a friendly message is displayed that recommends that the visitor use a compatible browser and that it is using it's default settings.

This means that the visitor must have HTTP referal information, cookies and javascript working properly in order to view pages of the web site. This is very hard to do with a web crawler. The beauty of this system is that the referral can be from anywhere (or blank as would be the case with a URL type-in) and still be ok, as long as the browser could go through the little cookie/redirect sequence ok.

WebmasterWorld has is a unique beast and Brett has done an incredible job of bringing the board to its current level. There is nothing like it anywhere... Frankly I like the speed at which the site is working now that the bots are gone, but WebmastersWorld is such a valuable resource it just seems like there has to be something that can be done to eleviate the problem without barring Google and maybe a few others from helping others find their way here.


 7:13 pm on Nov 27, 2005 (gmt 0)

If any of you are thinking of writing C code to read access_log, here's a little tip. I'm not trying to be esoteric, but you Perl and PHP programmers may have gotten too used to the joys of automatic, indeterminate-length variables. (And you pay a CPU-load price for this during the automatic garbage collection.)

On Apache 2.0.50, I sometimes see errors of "request failed: URI too long (longer than 8190)" in the error_log. These are from a Trojan probe, probably issued by a zombie. I looked in access_log for these lines, and the corresponding log line was 26K bytes!

Use a big line-input buffer on C if you use fgets() with streaming input. And if the beginning of the line does not look like an IP address, make sure it continues gracefully to the next line.

Also, in the exception list of IP addresses for approved bots, use a leading-characters matching algo on the addresses. That way you'll have a dozen lines that do the trick, instead of a couple hundred. Google is 6.249.6* and 6.249.7* for me.

Sometimes I even have to let in the entire Class B (first two quads). I think slurp is one of those.

Do a DNS lookup and geolocation lookup and email yourself when you block someone, in case an approved bot with a new IP address got caught in your net. Then you can undo the block and add the address to the exception list in a timely fashion. Just last week I saw MSN doing a heavy crawl using 65.55.246.*, which was the first time I've seen that address. I white-list only half a dozen engines, and on average I've been surprised with new addresses about once every three months.

If you have cgi scripts that are especially CPU-expensive, each of them needs its own IP-address access monitor and access limiter inside the script itself. This is fairly trivial to do, in that you only need to look at the last few minutes of access. Keep a separate monitor file, and append just the IP address to it for each access. The advantage of doing this is that for some of your CPU-expensive scripts, you do not have the luxury of waiting for a block to kick in a minute or two later. Your box won't even survive half a minute if such a script is accessed many times per second. At this point it doesn't matter whether a particular IP address is from a human or a white-listed bot -- it must be stopped. In this mode, I don't use htaccess but just issue my own 403 from the script. I have a separate "abuser" log for just these scripts. Since I'm blocking only a handful of expensive scripts, I block the entire Class C of the offender. These blocks get reset once a week.

Put it all together, and most of the programming on my site that I'm proud of, is stuff that no one ever sees! Is this any way to run the Internet?


 8:38 pm on Nov 27, 2005 (gmt 0)

Yeh Scarecrow, its true.

Write something in C for internet is like art. Especially be safe and high performance.

(For this program, dont need more than 4096 as buffer line, but must guarantee null at end, IP/date comes first, if the request is too long, count as page -- or just ban the IP as punishment --, and ignore next chars until '\n' reach in loop.)

>> Put it all together, and most of the programming on my site that I'm proud of, is stuff that no one ever sees! Is this any way to run the Internet?

Not only way, but help alot :)


 9:20 pm on Nov 27, 2005 (gmt 0)

> - ip banning - takes excessive hand monitoring

Hand monitoring should be a temporary measure while you tweak your dynamic monitoring. You don't need to keep a massive list of banned IP's if you take Yahoo's approach. Ban the IP for an hour or so. (In other words you are only tracking a time frame of no more than a few hours of IP's that were aggressive). Couple this with a list of SE bot IPs to let the good bots in.

> So, I switched format to a 2 step page. They take the 1px or href link to the ban page that says "click here to be auto banned..." with some more info on it. Only the bots will take the page

If I am coding a crawler to rip your site this only stops me once till I modify my code and renew my IP from my ISP and start ripping again. Its the same dilemna you mentioned with captchas. (p.s. - i never scrape this site. just hypothetical)

I would give this a try. [webmasterworld.com...] Don't know if it will scale to the demand of this site but it basically depends on Linux filesystem lookups which are quite fast. Works great on some pretty busy sites for me.


 9:42 pm on Nov 27, 2005 (gmt 0)

Brett is NOT dumb, folks. He's brilliant. He knows what the hell he is doing, whether you can figure it out or not. This is history in the making, mark my words. I have no doubt that over the past 7 days WW has gotten as many backlinks as it did in the past year or two.

Maybe the only true...I really cannot imagine a very big site refusing visitors/SE due overload.... traffic dont is a problem, is solution, so handle the bad robots and balance the load will be the fun part :)

I'm sure I can help, but I should wasting my time trying solve a issue that never existed.


 9:54 pm on Nov 27, 2005 (gmt 0)

> Ban the IP for an hour or so

There are some bot runners that think it is ok to run bots if they are respectful and at a slow pull rate. They think a page view per half minute is ok. Then they leave them on for the entire week and forget about them. An hour ban would stop maybe a third of the bots. This isn't an average site architecture or visitor. These are highly knowledgeable, highly motivated, and proactive visitors who are not new to site ripping.

Then we get into the hyper, super, merciless monitoring services. There must be 100 of these things running from 500 IPs. These guys are so totally out of control it isn't funny. I had calls this weekend asking why we had banned their xyz robot and they threatened us if we did not allow them back in. [webmasterworld.com...]


 10:03 pm on Nov 27, 2005 (gmt 0)

> These are highly knowledgeable, highly motivated, and proactive visitors who are not new to site ripping.

I see. I didn't realize that the bulk of the problem were from skilled rippers. Honestly its easy to get around any deterrent including Google's, Yahoo's and Jeeve's. (I still have trouble believing that these highly knowledgeable, ruthless rippers are obeying robots.txt). In this case the only solution is load balancing. Hire an architect, gutt BestBBS, and rebuild with a DB solution. Yes its hard, and yes its a lot of work, but bandaids won't cut it anymore.


 10:52 pm on Nov 27, 2005 (gmt 0)

FUD quote of the moment:

> will anyone take a PR0 site with no pages in any search engines as a credible source of information for their SEO/webmaster tips and news?

I do.

Given that most of us claim to have found WebmasterWorld via Google*, and that WebmasterWorld cannot be found via a search engine, how are these hypothetical "anyone"s from the above quote finding WebmasterWorld?

If "they" are finding the site via word of mouth, then a lot more information than the URL is being conveyed at the same time. I think it's a safe assumption that the site came up in a positive light, or at least due to current events - I don't usually point out which sites suck to my friends. Apparently this happens (positive WOM), as noted by grandpa:

> Word of mouth is a valuable tool. I gave someone your URL last night over the phone.

Unfortunately, he needed to sprinkle a little FUD on his comments:

> Now maybe he wrote it down, and found your site this morning. Maybe he didn't, or maybe he wrote it wrong. He would then still be trying to search for you this morning.

The friend obviously has a phone, but not the wits to call back to confirm the URL. Sad.

So, what if "they" found the site via a link from a WebmasterWorld member's site? Text surrounding the link may or may not convey additional info; however, like Google, given that the link is merely present, I'll see it as a "vote" for WebmasterWorld. (And again, like Google, I'll give that individual "vote" little weight, although my "algorithms" are better.)

Now, word of mouth or a text link may or may not inform me that WebmasterWorld is PR0, and that the site is not indexed in any SE - either way it doesn't matter, because I now know of the existance of WebmasterWorld. One look at the homepage will strongly influence your opinion of the credibility of the site - first impressions, you know. And to suggest that the PR of a site is directly correlated to the credibility of the site, well that's just daft. (Case in point, The Drudge Report with a PR of 7, and a "CredibilityRank" considerably lower.) Shoot, the vast majority of netizens don't even know what PR is ("so why should they care?", I say in my best Sony voice). Besides, haven't you heard? Google is broken! [webmasterworld.com]

*Myself, I can't remember how I first got here, but then again, I don't have the agenda, err, memory that others have.

Wow! While doing some research for this post, I was shocked to discover just how much others rely on WebmasterWorld to pad their own (lack of original) content. Going with the last example I cared to look at...

A SEO blog-type site has fifteen stories on its homepage; here's the breakdown:

WebmasterWorld stories: 7 (only one dealt with the bot-banning issue)
(5 different) WebmasterWorld "competitor" stories: 6
Original stories: 1
Personal stories: 1

Throw out the personal story, and WebmasterWorld accounts for 50% of the content! Counting links to WebmasterWorld, I found 22.

Because of its users, WebmasterWorld has been able to establish itself as an authority. (Everyone pat yourself on the back!) Because it is an authority, many sites link to it, therefore establishing an influx of new (l)users. There is life beyond Google!

Came around a story that states that it was Google that booted WebmasterWorld for cloaking, not the other way around...


 11:02 pm on Nov 27, 2005 (gmt 0)

It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

Some noticed it in Vegas when I did Eytans side-by-side comparison with Google and Google listed a snippet, but MSN just listed the url :-) Even Matt Cutts noticed it.

However, none of the WebmasterWorld members noticed it buried in that long robots.txt file. So - this action was started in july and has been on going.


 11:40 pm on Nov 27, 2005 (gmt 0)

I am not an expert at all, and have no clue how is this going to affect this site.

All I can say honestly is that if there was no this site in google 2 years ago when I started with my first website I would probably never be able to achieve what I done in two years.

I learned almost everything from this site and I feel really sorry for all other people who are just starting that they will not be able to find this site any more...

I just hope somehow, this will be soon resolved.


 11:50 pm on Nov 27, 2005 (gmt 0)

So... has it worked? Are the bad bots stymied? Or is this maneuver mostly useful as a way to find out which bots ignore robots.txt?

I have neither "bravos" nor "boos" for your move, Brett... I'm merely interested in it as an experiment, and it is interesting. I've long had to be logged in to read the site (Bellsouth DSL ban), so that isn't bothering me any. I am missing the site search, though <g>.


 11:52 pm on Nov 27, 2005 (gmt 0)

There are some places friends consider out of bounds to snooping ..like washbaskets when you are invited for dinner or bathroom cabinets when you take a leak ..

( like it just is weird that others do this ..like burglarising buddies )

corse I'm not normally that polite ( elsewhere ..other sites ) and most of us aren't otherwise we wouldn't be here ..but balam hit it right ..

BTW ..I retract my earlier comment about "shouldnt be discussed" I'd have missed too much stuff that I never new was here ..the link outs to older threads here that I never saw ( Brett ..site search?..here always was flaky ..even a beta would be "betah" than none ..huh.. rush it dont be picky ;)..

nsqlg ..greetz

I am slightly curious as to the lack of input from jim , jatar_k and andreas ..carfac etc?..

and although it is waaaaay OT ..when /if WebmasterWorld comes back into the SE's ..might I suggest fora in other languages if bandwidth isnt a problem ..i personally know many who would contribute in french , spanish, portugese , german , turkish, chinese etc ..but hesitate due to the english only restriction ..the contributions particularly on this subject would be worthwhile ..

almost forgot ..( my son distracted me ..want's I see his bryce and flash navs ..so my speeling will have to stand ;) ..Scarecrow ..nice post ..


 12:18 am on Nov 28, 2005 (gmt 0)

So Brett - are you saying that its the legit bots from SE's that are causing the load on the system?
(re: your last post)


 12:24 am on Nov 28, 2005 (gmt 0)

Earlier Brett mentioned that 1 in 1,000 use internal site search. I'm not sure where you got that figure but on my sites its WAY higher. I see that 80% of my visitors use site search.


 12:55 am on Nov 28, 2005 (gmt 0)

Unless Googlebot has been causing problems, banning it is illogical. It will not help in the fight against bad bots. Either this is a bad decision caused by frustration or there is a reason for it that has not yet been mentioned.

I'm sure we've all made bad decisions as a result of frustration. Earlier today I blew up a circuit board in a flurry of sparks and melted diodes - just not thinking straight.


Lord Majestic

 1:25 am on Nov 28, 2005 (gmt 0)

I applaud to Brett's decision to ban all bots using robots.txt - it should be either every-bot (good bots of course) or no-bot.

Allowing to crawl front-page that does not change often (thus can be compressed static page) would have been a good idea however - at least people who search directly for WebmasterWorld will be greeted with reasonably up-to-date info from the site, rather than dodgy link.

I doubt it will solve the main issue of rogue bots however - only charging money for access and limiting pageviews can achieve that.


 2:28 am on Nov 28, 2005 (gmt 0)

> Are the bad bots stymied?

Working on it. It is like a good dog trying to shake a case of fleas.

>legit bots from SE's that are causing the load on the system?

A wee bit yes. One engine hit us for 250k views in a day. Even after beeing blocked by bots.txt. The only thing we could do was htacess ban the ip.

> I'm not sure where you got that figure but on my sites its WAY higher. I see that 80% of my visitors use site search.

It all depends on the type of site notsleepy. I have another site, that hasn't seen 1 in 20k visitors use the site search.

> banning it is illogical.
> I applaud to Brett's decision to ban all bots using robots.txt

We didn't ban bots to start with - that was not the real action - we required login - which mandated blocking the bots or suffer the wrath of a totally abused login page. I would rather throw up a road block, than set here and let a bot crash into the 404 or login page because we required login and cookie support.

Lord Majestic

 3:18 am on Nov 28, 2005 (gmt 0)

Some people must be really desperate to crawl this site, perhaps providing highly compressed data feed that they can only retrieve in off-peak time will give reason for them not to engage in war of bullet (them) and armour (WebmasterWorld).


 5:03 am on Nov 28, 2005 (gmt 0)

I still think the best solution is to block the offenders with a firewall.

Ideally you would do this with a firewall device that sits in front of your webserver(s). A good Cisco device would be able to handle a really high request load, and would not cost your webserver a single CPU cycle.

A software based firewall running on the webserver would be ok too - though, since the request needs to be handled by your server still, that's going to result in some usage on that server.

Still - blocking at the level of a software firewall is better than trying to block from the apache .htaccess level - because the firewall operates at a lower level on the TCP/IP stack.

A free and very easy to setup/use software firewall that would run on red hat enterprise is called 'apf firewall'. It includes an IP block list (/etc/apf/deny_hosts.rules).

As far as the hardware firewall - I would definitely suggest you contact the RackSpace support people about this - they definitely are the experts. While the external hardware firewall is definitely the better choice for reducing usage on your server - it would be more expensive. Depending on the exact device they recommend for this, the cost might be in the range of $500 - $5000.

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 [4] 5 6 7 8 9 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Local / Foo
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved