homepage Welcome to WebmasterWorld Guest from 54.234.74.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 223 message thread spans 8 pages: < < 223 ( 1 2 3 4 5 6 7 [8]     
lets try this for a month or three...
last recourse against rogue bots
Brett_Tabke




msg:329347
 1:21 am on Nov 19, 2005 (gmt 0)

[webmasterworld.com...]

required login the real story here...
MSN and yahoo bots were blocked in October. This does everyone else.

 

Brett_Tabke




msg:329557
 3:25 pm on Nov 25, 2005 (gmt 0)

<don't worry, be happy>

> combined with bandwidth-throttling,

I think it would only hurt members (probably me worst of all). Most of the bots are not agressive at speed, and fall somewhere in the middle of the regular members usage patterns. It is the constant number of them and consisitent spidering. Hence, the reappearance of session ids and tagged pages on most pages of the site. This has helped weed out about another 30 bots that were smart enough to support cookies and be setup with a u/p login. The content is flowing back to unindexable...

> bad bot repellent script

And what would that look like? It wouldn't look like:

- page view throttling. (as mentioned, many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). However, above that threshold is doable...or as a mod put it 'doneable'.
- bandwidth throttling. Mod_throttle was tested on the old system and only clogs up the system for other visitors - it is also pretty processor intensive - it is very noticable to all when you flip it on).
- agent name parsing - ya, it's laughable.
- cookie requirements (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password.
- ip banning - takes excessive hand monitoring (which is what we've been doine for 7 years). The major problem is when you get 3-4k ips in your htaccess, it tends to slow the whole system down.
- intelligent combo of all of the above? yep, that is basically where we are at right now.

> before having something else in place

agreed, the speed with which we fell out of the index caught me by surprise. I was expecting 30 days. In the interim a project came up that demanded attention this week. oops.

In the meantime, we could use some help from members [webmasterworld.com]. Many of those are from people who found us via an engine...

> we better index it like crazy

The major problem is that all this content is dynamic and to get "if modified since" support, is difficult. We will have to move to non-parsed header scripts and generate our own if mod since headers. That would slow down spidering enormously.

> can you use a program like Fluid Dynamics Search Engine?

I am a major fan of Zoltans work over on XAV. What a script. As most know here, it is was what we were using to about 200k pages, when it slowly faded because it was so slow and so much of a system killer. Great program - I highly recommend it on smaller sites.

> immense bendwidth charges.

Currently, we have no bandwidth charges other than the base fees. I don't expect that to change any time soon.

> Ever thought of burning the whole site
> to DVD and selling it?

Yes, thought of it alot. There are a lot of issues involved - many of which out weight the benefits at this time.

> incentive for people to subscribe

Your word of mouth recommendation is the best reason there is for people to subscribe and support the site.

> Not sure how it got fixed, but it did.

They moved from dynamic to static content. They moved from a mid range box to 3 high end boxes with round robin dns. They optimized their scripts/programs. They were more aggressive in offering downloads of the db . They banned scripts designed to raid them. They started changing key bytes on the html that crippled older scripts ability to parse the html - thus they would have to be rewritten. Lastly, there were so many sites running copies of the odp, that Google got aggressive at pr0'ing them. That decreased the value of a odp clone site to about zero.

> I hope you know what you are doing Brett,

hehe. ya right. Life is best when it is lived on a Wing-n-Prayer baby!

> I think we should trust his judgement on this one.

Oh, I'm not saying it was a perfect decision. Oilman mentioned we'd been considering requiring cookies for ages, this was not a long preplanned. It was an emotional reaction to all the bot attacks, scrapper sites, blog leeches, open Chinese proxy sites, and hyper aggressive big search engine crawlers. It was a, throw hands in the air I've had it moment. The site is here for the members to be involved and engaged in - robots do not make an online industry community - people do.

> announced his decision to kill them on
> this site just as that month finished.

Hey, I remember Nostradamus mentioning something like this. ;-) (I think this is how rumors start. lol!)

> tell us the real reason

I give up trying to disabuse people of any notion to the contrary. I've laid it all out for you here. It doesn't matter what I say Mick, there will be those that think just the opposite. I've said since day one that rogue bots are the #1 issue we face here as a community site. I don't talk about it too much in public because it is hard to discuss when the very problem is reading the page. It is just like google talking about spam issues too much. Once they do, then someone will design a system to game that feature within a few hours. After we had put up the required login here - I found 10 new bots that had been given cookie support and were crawling the site as logged in registered members.

In the meantime, we could use some help from members [webmasterworld.com]. Many of those are from people who found us via a engine...

> put up more ads.

Sorry, not at this time. We like it direct advertising free as it is now. I don't discount the possibility, but we have no plans at current to put advertising on the site.

Yes, we do give exhibitors and sponsors of PubCon page views here and that is in their agreement with us. We do that to promote the conference and let people know who is going to be at the show. It supports us and the members, MORE than it does the exhibitors.

> then how do the big sites deal with it?

They have multiple servers. Of which, most are static and don't require constant syncing. Other bigs sites such as: auction houses, email sites, and other massively dynamic sites have custom software that keeps their servers in sync. Basically what they have are programs that can "read from any server, but must write to ALL servers" code. That code is about half done here. Yes, we will be investing in infrastructure and flushing out that setup in the next year. Gasp - moving to a full real db (sql) is in our non-too-distant future.

> doesn't need the same 20 questions asked again and again every single day

And a search function does not always do alot to help that. The content is all so similar here, that on many issues, the major search engines were of little help. How many duplicate "duplicate content", or "supplemental results" questions has the google forum seen? A search engine didn't help there.

> I just hope that his new search system
> doesn't cause the same issue for him.

It will for awhile. There will be problems with that at the start. Moving the search engine to new server is the answer, but that will take a few weeks.

So, patience please...

In the meantime, we could use some help from members [webmasterworld.com].

</don't worry, be happy>

...wow those are pretty black helicopters outside.

stuntdubl




msg:329558
 4:02 pm on Nov 25, 2005 (gmt 0)

Reasons I can see for banning all bots:

  • bandwidth costs
  • testing bots that don't obey robots.txt
  • giving people seizures from over contemplation
  • threads like this [webmasterworld.com] get bookmarked and will get found
  • because no one else has the balls and everyone will mention it anyways
  • virality is more potent than spiderability
  • supporter's never got spidered anyways
  • lots of chaffe that will sink to the bottom
  • 2m pages is a helluva lot in even legit bots
  • Quality is better than quantity - if you could show me the BEST 10% of WebmasterWorld it would probably take me a year or three at 3 hours per day to even come close to reading the gold.
  • focusing on the small percentage of quality by axing quantity
  • encouraging users not spiders to be the quality filters

    They're are plenty of reasons WHY NOT to ban bots. I'm trying to see the best reasons WHY it's potentially a GOOD idea.

    Perhaps I'm simple, but anyone else got additional speculation on rationale that I'm missing?

    There are definitely pros to search functionality. What are the additional major POSITIVES to banning ALL bots?

  • dmoz24




    msg:329559
     4:54 pm on Nov 25, 2005 (gmt 0)

    Firstly

    Wish you all the best

    If there were no bandwidth charges, why turn out bots?

    And from what I have found turning out Google is not good.
    I feel that nowadays google is sort of a necessary evil. Evil becuase most webmasters are overdependent on google.

    Also an offtopic question just if you happen to read my post > when can we small webmasters get hands on BestBBS?

    To cut with all offtopic rant : Good Luck and I hope that you succeed.

    tigertom




    msg:329560
     5:00 pm on Nov 25, 2005 (gmt 0)

    What I would like to know is:

    What do bad bot users get out of spidering WebmasterWorld all the time?
    What do they do with the data?

    An established search engine I understand , but not some guy in his bedroom ...

    trillianjedi




    msg:329561
     5:02 pm on Nov 25, 2005 (gmt 0)

    why turn out bots

    Bandwidth is not the only resource (CPU/RAM/Apache max connections etc).

    TJ

    balam




    msg:329562
     5:24 pm on Nov 25, 2005 (gmt 0)

    > What do they do with the data?

    Create their own copy of the site, burned to DVD, and then pan for the gold locally?

    Powdork




    msg:329563
     6:22 pm on Nov 25, 2005 (gmt 0)

    So if I had a copy on my cpu, I could then use Google desktop to search it?

    lawman




    msg:329564
     6:52 pm on Nov 25, 2005 (gmt 0)

    > Hello Play_Back. How did you find Webmaster World?

    Probably through a Google search. And you?

    You must've found us before we dropped out of the index. :)

    I found Search Engine World in my logs one day. From there I found Webmaster World. BT has a name for that, but the name escapes me right now. ;)

    chopin2256




    msg:329565
     7:06 pm on Nov 25, 2005 (gmt 0)

    Here is a suggestion for a good site search:

    A good search engine, that I was referred to by doing a Google site search on this site, is [swish-e.org...] I looked at tons of site search scripts, but this one caught my eye. Even though it was harder to set up, I liked it better and asked tons of questions on how to set it up. It was worth it, because I have a superb site search on my site now. It works very much like Google, and you can even give priorities to certain words/phrases, such as meta tags, the word "webmaster" or "webmaster world" etc.

    You can use a perl script to spider only html's, so you don't spider other garbage.

    I would advise checking it out, it is very good and their support team has even helped a beginner like me set this complicated thing up. In fact, their support and software is so good, I even wrote up a beginner tutorial on how to set it up, what you need, and how install it (in case you want to take a look).

    But definitely give [swish-e.org...] a shot!

    And if the direct URLS arent allowed, Google swish-e.

    [edited by: chopin2256 at 7:15 pm (utc) on Nov. 25, 2005]

    Brett_Tabke




    msg:329566
     7:14 pm on Nov 25, 2005 (gmt 0)

    fyi: just read a post suggesting that this was all done at the same time. No it was not. We have slowly weeded out the other se's and bots until the final action.

    We had specifically blocked in robots.txt slurp/yahoo/inktomi/msn/ over 45 days ago. Jeeves/gigablast and several other 3rd tier crawlers over 60 days ago.

    Swish - been there done that - numerous problems - of which - I don't remember most of them (that was like 3 years and two dozen se's ago...).

    > stuntdubl

    Vere nice post.

    Lastly, some are saying that I am recommending this for others - absolutly NOT. This is not something you should do unless you are in the same boat as us...

    Yes, you can ban alot of bots via ip's - but when your htaccess hits about 3000-4000 ips and you are cleaning those out monthly - it is time to do something else.

    Play_Bach




    msg:329567
     7:16 pm on Nov 25, 2005 (gmt 0)

    > You must've found us before we dropped out of the index. :)

    I think I first started visiting WebmasterWorld last December as a result of some Google search. Once I figured out how to site-search it using Google, I found WebmasterWorld could often provide the answers I was looking for.

    However, having recently dabbled with trying to get a PHP/MySQL full-text Boolean search script to work right, getting results on a level with that of a Google I quickly realised was going to be no small feat! :-

    Three months later, I've bagged the idea and am back to reading programming books until I have a better grasp of what *really* is involved in making a site-search that actually works... yikes.

    chopin2256




    msg:329568
     7:34 pm on Nov 25, 2005 (gmt 0)

    Swish - been there done that - numerous problems - of which - I don't remember most of them (that was like 3 years and two dozen se's ago...).

    Let me know which site search you choose. I am curious because I only want the best for my site as well.

    Although, 3 years is a long time ago and things can change. The latest version is Fri, 17 Dec 2004.

    My forum is nowhere near as big as this forum, (100,000 pages) but it is a decent size and swish-e seems to handle it beautifully. But maybe it breaks with super huge sites?

    Brett_Tabke




    msg:329569
     7:57 pm on Nov 25, 2005 (gmt 0)

    continued:

    [webmasterworld.com...]

    This 223 message thread spans 8 pages: < < 223 ( 1 2 3 4 5 6 7 [8]
    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Local / Foo
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved