homepage Welcome to WebmasterWorld Guest from 54.226.191.80
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Local / Foo
Forum Library, Charter, Moderators: incrediBILL & lawman

Foo Forum

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 7 8 [9]     
Attack of the Robots, Spiders, Crawlers.etc
part two
Brett_Tabke




msg:307365
 7:55 pm on Nov 25, 2005 (gmt 0)

Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:

[webmasterworld.com...]


Summary:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.

Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.


It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.

The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.

It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.

The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.

How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.

We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.

Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.

> how fast we fell out of the index.

Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.

It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.

Steps we have taken:

  • Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
  • Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
  • Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
  • Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
  • IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
  • One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
  • Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.

An intelligent combo of all of the above is where we are at right now today.

The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.

That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.

The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.

Thanks
Brett

Rogue Bot Resources:


 

jdancing




msg:307605
 6:41 pm on Dec 6, 2005 (gmt 0)

Offer tastefully done sponsorships for each WebmasterWorld sub-forum. Then uses the extra ~$20K/mo. from those sponsorships to pay someone to migrate WebmasterWorld from flat files to a database driven forum (like vBulletin) with search built in, use the money left over to pay for more server power as needed.. Problem solved.

Id rather see a few non-obtrusive sponsorships rather then not have a search function to find the answers I need quickly at WebmasterWorld. I fear unless WebmasterWorld gets search, the redundant posts and lack of utility will cause this site to start to die on the vine.

2by4




msg:307606
 9:54 pm on Dec 6, 2005 (gmt 0)

"vbulletin..."

Has the very worst search of any major forum software I've ever come across. When I land on a website running vbulletin I cringe in anticipation of the agony I will experience. Why vbulletin is considered a good option by anyone is absolutely beyond me, but I guess that's what brett rolled his own, the alternatives all kind of suck, each in their own way. Search will have to be resolved, it will be interesting to see how brett does that.

When I think bloatware, I think vbulletin, but I guess different people see different things when they look at this stuff. Anyway, again, the solution is not going to be using some package like that, it's going to be something interesting I'd guess, since site search is really pretty critical for this type of stuff, you can't bookmark every interesting thread. yahoo search is also working fine by the way, although yahoo claims to only have 69 k pages indexed, but I found what I wanted that way.

walkman




msg:307607
 5:23 pm on Dec 7, 2005 (gmt 0)

>> migrate WebmasterWorld from flat files to a database driven forum (like vBulletin)

Why? To run a site like this on VB, one would need 3-4 servers with GBs of Ram. Why complicate things even more?

Xuefer




msg:307608
 2:50 am on Dec 8, 2005 (gmt 0)


it will be interesting to see how brett does that

2by4: what do u guess it would be before it's out? and any idea on the "well enough" search solution? both flatfile and database driven

DamonHD




msg:307609
 7:01 am on Dec 8, 2005 (gmt 0)

Years and years ago I wrote a completely flat-file driven Java search engine (which has survived unchanged since JDK 1.0 / Netscape-2 days!) to work in the kind of environment that shuns DBs and CGI for efficiency; run a nightly/daily update of the compressed index file and all searches run locally. Brett can have the code if he wants, and make it look pretty. I still use the system every day...

Rgds

Damon

sleepy_kiwi




msg:307610
 8:25 am on Dec 8, 2005 (gmt 0)

Ok, I haven't read every post so dont know if this has been talked about or not (or what outcomes have been reached in the last 17 pages - I only read up til page 8).

Also I'm not very knowledgable on cloaking issues, IP banning methods etc... so why am I posting?

What I read in the first 8 pages indicated that:
- If cloaking were allowed much/all of the problem may be overcome
- Some sites have an agreement with Google to allow this - although WebmasterWorld does not.
(If this is incorrect then you are open to shoot down this part - although my main point below still stands:)

We are a community here guys and girls - a very powerful, valuable and profitable one to Google.

Surely (collectively) we can do something more than just write our objections and ideas here in this thread.

Sleepy

This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5 6 7 8 [9]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Local / Foo
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved