| This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5  7 8 9 ) > > || |
|Attack of the Robots, Spiders, Crawlers.etc|
Seems like there is a great deal of interest in the topic, so I thought I would post a synopisis of everything this far. Continued from:
WebmasterWorld is one of the largest sites on the web with easily crawlable flat or static html content. All of the robotic download programs (aka: site rippers) available on Tucows can download our entire 1m+ pages of the site. Those same bots can not download the majority of other content rich sites (such as forums or auction sites) on the web. This makes WebmasterWorld one of the most attractive targets on the web for site ripping. The problem has grown to critical proportions over the years.
Therefore, WebmasterWorld has taken the strong proactive action of requiring login cookies for all visitors.
It is the biggest problem I have ever faced as a webmaster. It threatens the existence of the site if left unchecked.
The more advanced tech people understand some of the tech issues involved here, but given the nature of the current feedback, there needs to be a better explanation of the severity of the problem. So lets start with a review of the situation and steps we have taken that lead us to the required login action.
It is not a question of how fast the site rippers pull pages, but rater the totality of all all of them combined. About 75 bots hit us for almost 12 million page views two weeks ago from about 300 ip's (most were proxy servers). It took the better part of an entire day to track down all the IPs and get them banned. I am sure the majority of them will be back to abuse the system. This has been a regular occurrence that has been growing every month we have been here. They were so easy to spot in the beginning, but it has grown and grown until I can't do anything about it. I have asked and asked the engines about the problem - looking for a tech solution, but up until this week, not a one would even acknowledge the problem to us.
The action here was not the banning bots - that was a outgrowth of the required login step. By requiring login, we require cookies. That alone will stop about 95% of the bots. The other 5% are the hard core bots that will manually login and then cut-n-paste the cookie over to the bot, or hand walk the bot through the login page.
How big of an issue was it on an ip level? 4000 ips banned in the htaccess since june when I cleared it out. If unchecked, I'd guess we would be somewhere around 10-20 million page views a day from the bots. I was spending 5-8 hrs a week (about an hour a day) fighting them.
We have been doing everything you can think of from a tech standpoint. This is a part of that ongoing process. We have pushed the limits of page delivery, banning, ip based, agent based, and down right agent cloaking a good portion of the site to avoid the rogue bots. Now that we have taken this action, the total amount of work, effort, and code that has gone into this problem is amazing.
Even after doing all this, there are still a dozen or more bots going at the forums. The only thing that has slowed them down is moving back to session ids and random page urls (eg: we have made the site uncrawlable again). One of the worst offenders were the monitoring services. I have atleast 100 of those ips are still banned today. All they do is try to crawl your entire site to look for tradedmarked keywords for their clients.
> how fast we fell out of the index.
Google yes - I was taken aback by that. I totally overlooked the automatic url removal system. *kick can - shrug - drat* can't think of everything. Not the first mistake we've made - won't be the last.
It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.
The spidering is controllable by crawl-delay, but I will not use nonstandard robots.txt syntax - it defeats the purpose and make a mockery of the robots.txt standard. If we start down that road, then we are accepting that problem as real. The engines should be held accountable for changing a perfectly good and accepted internet standard. If they want to change it in whole, then lets get together and rewrite the thing with all parties invovled - not just those little bits that suit the engines purposes. Without webmaster voices in the process, playing with the robots.txt stanards is as unethical as Netscape and Microsoft playing fast and loose with HTML standards.
Steps we have taken:
- Page View Throttling: (many members regularly hit 1-2k page views a day and can view 5-10 pages a minute at times). Thus, there is absolutely no way to determine if it is a bot or a human at the keyboard for the majority of bots. Surprisingly, most bots DO try to be system friendly and only pull at a slow rate. However, if there are 100 running at any given moment, that is 100 times the necessary load. Again, this site is totally crawlable by even the barest of bot you can download. eg: making our site this crawlable to get indexed by search engines has left us vulnerable to every off-the-shelf bot out there.
- Bandwidth Throttling: Mod_throttle was tested on the old system. It tends to clog up the system for other visitors - it is processor intensive - it is very noticeable to all when you flip it on). Bandwidth is not much of an issue here - it is system load that is.
- Agent Name Parsing: Bad bots don't use anything but a real agent name. Some sites require browser agent names to work.
- Cookie requirements: (eg: login). I think you would be surprised at the number of bots that support cookies and can be quickly setup with a login and password. They hand walk the both through the login, or cut-n-paste the cookie to the bot.
- IP Banning: - takes excessive hand monitoring (which is what we've been doing for years). The major problem is when you get 3000-4000 ips in your htaccess, it tends to slow the whole system down. What happens when you ban a proxy server that feeds an entire ISP?
- One Pixel Links and/or Link Posioning: - we throw out random 1 pixel links or no text hrefs and see who takes the link. Only the bots should take the link. It is difficult to do, because you have to esentially cloak for the engines and let them pass (It is very easy to make a mistake - which we have done even recently when we moved to the new server).
- Cloaking and/or Site Obfuscation: that makes the site uncrawlable only to the non search engine bots. It is pure cloaking or agent cloaking, and it goes against se guidelines.
An intelligent combo of all of the above is where we are at right now today.
The biggest issue is that I couldn't take any of the above steps and require login for users without blocking the big se crawlers too. I don't believe the engines would have allowed us to cloak the site entirely - I wouldn't bother even asking if we could or tempting fate. We have cloaked a lot of stuff around here from various bots - our own site search engine bot - etc. - it is so hard to keep it all managed right. It is really frustrating that we have to take care of all this *&*$% for just the bots.
That is the main point I wanted you to know - this wasn't some strange action at banning search engines. Trust me, no one is more upset about that part than I am - I adore se traffic. This is about saving a community site for the people - not the bots. I don't know how long we will leave things like this. Obviously, the shorter the better.
The heart of the problem? WebmasterWorld is to easy to crawl. If you move to a htaccess mod rewrite setup, that hides your cgi parameters into static content - you will see a big increase in spidering that will grow continually. 10 fold the number of off-the-shelf site rippers support static urls than support cgi based urls.
Rogue Bot Resources:
Suppose I suffer from a troublesome skin infection on my left hand. I could "cure" the infection by taking an axe and amputating the hand. Some people would, no doubt, admire by bravery, but it would still be bad decision, especially if this action in no way ensured that the infection would not return somewhere else.
Since I am not Brett, I cannot know just how bad the problem is that caused him to take this action - nevertheless, I think it is highly likely he will regret this decision. Brett has already made the point himself that bad bots can can be walked through the login/cookie stage so, using the metaphor above, the infection will probably be back. If it doesn't return, it will be because WebmasterWorld has fallen so far off the internet map that that it is no longer attractive.
[edited by: kaled at 12:05 pm (utc) on Nov. 29, 2005]
Folks - this thread is like army top brass having a discussion of tactics to which the enemy is invited. The more people keep suggesting things Brett could do to counter rogue bots, the more ideas are fed to the owners of said bots on how to get round whatever Brett puts in place.
I'd like to bet that serious discussion on security has been going on behind the scenes for many, many months. Isn't that the best place for it?
May seo forums are talking about what's happening here. I am sure this brings a lot of high value traffic to WW forums.
|I'd like to bet that serious discussion on security has been going on behind the scenes for many, many months. Isn't that the best place for it? |
Perhaps, but didn't Brett start this thread?
> The more people keep suggesting things Brett could do to counter rogue bots,
> the more ideas are fed to the owners of said bots on how to get round whatever Brett puts in place.
I don't buy that argument, sorry. From a security standpoint, it makes more sense to make something that is just too damn hard to break into - what's happening here are just ideas, that's all.
The old test still applies: You make the code and then see if you can break it - if you can, then somebody else can too.
A few things to add to the pile.
Create a posting server with WebmasterWorld as we know it (named posting.webmasterworld.com) and a non posting server (to keep things simple for the S/E this is what would be known as www.webmasterworld.com)
posting.webmasterworld is off limits to all bots and www.webmasterworld.com is file synced from posting.webmasterworld.com via a periodic rsync or a custom syncing system.
1: Ban IP addys from known large scale hosting companies, this catches a lot of rouge critters.
2: White list bots that play nice by IP address
3: Teardrop bots that don't play nicely for release at a rate you can tolerate.
1: Ban IP addys from known large scale hosting companies, this catches a lot of rouge critters.
2: Ban all S/E bots by IP and agent.
>>>>vBulletin is really powerful, but is very resource intensive. I admin a forum that gets a fraction of the traffic WebmasterWorld does, and it takes two dual Xeon servers with 4 GB of RAM each. I've gone to pains to eliminate features and queries to keep the load down, but we can still get overwhelmed by traffic peaks. I'm getting very zippy fractional-second page loads with 800 users online, but I'm waiting to see what will happen when we we peak at 2000+ simultaneous users. (I can tell you what happens with one dual Xeon server - it gets really slow, and starts throwing database errors.)
So? There's the solution right there. Put up something like vbulletin so we've got the features we're all looking for and throw hardware at it. So what if the system needs 5 servers to handle the load? Hardware isn't the problem....it's the solution. It sounds like we're clinging to some ideology while the users suffer. It's a busy forum, get some hardware and fix the problem. It's what any of the rest of us would likely do if we had sites that were bogging down the server.
(OT: As for your vbulletin specs, I've run vbulletin on almost a thousand simultaneously very aggressively active users on a single P4 2.8Ghz machine with a gig of ram. And that's without tuning mysql, apache, etc. And when the forum was starting to make my business stuff lag guess what I did? I threw more hardware at it and fixed the problem. Now we're looking at thousands of users online in the near future and guess what the plan is to handle the load? Yup, a bigger server. It's fast, it's effective and these days it's pretty cheap.)
>>>I guess more hardware is not an option at webmasterworld because it does not use a database. Its file system based. But this limitation is just my assumption, and I may be wrong.<<<
Even if its file based...
1. Get a fast 15K rpm RAID system with a high end SCSI controller with a hardware cache. Set the cache to write-back (make sure the card you purchase uses battery backed up RAM) to negate disc write delays.
2. Go 64-bit and load up with enough RAM to keep the entire forum file system in cache.
3. Use Google sitemaps. Re-generate the file each day via a cron job. Specify a recrawl time for threads based on their last reply/update date. For instance, threads replied to in the last 48 hours have a daily recrawl time, threads up to a week old have a weekly, etc. Threads over a month old get recrawled yearly. Since you're regenerating the file each day any old thread a user resurrects will still get crawled in a timely manner. This is extremely effective for dynamic sites and has a much lower impact on bandwidth/cpu. Our forum has nearly 2 million pages in Google... and the site moves along just fine. (Hopefully Yahoo will come up with something better than their current scheme).
Quick, I need a picture of a shark in mid air. ;)
Having no search is going to really increase the amount of reposts, and redundant newbie questions. While it probably won't match the amount of excessive pageviews created by spambots, how can blocking all bots be the right answer?
ok - couple threads in now and you all still don't get it do you?
It's not about the software, the hardware, or the money to run WebmasterWorld. The software rocks, hardware is cheap, and Brett can afford to run whatever configuration it would take to make any of the recent suggestions work.
I can't say I know for sure what it's really all about about but at this point I know for sure what it's not about. So stop offering servers and solutions already - BT and the team are pretty sharp folks - I'm sure they've covered the bases more than once in the past few weeks. Or keep offering - it's kinda fun to watch BT nix them all one by one ;)
earth to oilman... we were knocked offline twice since we moved to the new server... awareness is the first step.
Oilman, if he is just railing against automated critters in general, he is actually in a no win situation but hey he can rail all he wants. It is his nickel.
Thanks for sending a telegraph signal to oilman Brett. I was trying to figure out if FOO stood for something like confuse the techies out there ;-).
>>we were knocked offline twice since we moved to the new server
I know that :) My point is tho that if you wanted to implement any of the ideas that have been put forward in these threads you could. You're a good programmer (as this software alone proves) and I know that getting a couple/few dedicated boxes and load balancing etc is easily within your means if you wanted to do that as well.
Maybe there is no good way to actually manage the bot problem effectively and the only solution is to throw hardware and bandwidth at it and that's admitting defeat in a way.
The only other option is banning them entirely and requiring cookies and logins etc. Yes this is effective but there are side effects to the community. Just today I was asked if I saw a particular thread about something. I hadn't and I can't find it and the person who wanted me to read it can't find it either.
So how do you survive in this new WebmasterWorld? You live on the active list - check it every 30-60 minutes and flag threads of interest or set up a folder system in your bookmarks to categorize what will be many many bookmarks.
Now there's a thot - how do you lower page views? Direct page loads...
Off subject, but in the hopes of keeping traffic up,
I am always checking my WW RSS feed to see if there are any new headlines. Any chance of adding an RSS feed of recent messages that updates every hour or so. An RSS feed of "unanswered messages" or "hot topics" would also be nice. Other websites posting those RSS feeds couldn't hurt either.
I agree with Oil in alot ways, but - like alot of other guys - he isn't getting the scale and scope of the situation. It isn't just server management, page views, bandwidth, or servers. There are also issues of scrapping, copyright, and liability. The site is here for the human members - not bots.
I bet if we took off all the bot controls, it would take 20 servers just like this one to feed them.
while brett may or may not be interested in the advice in this thread, I definitely am, there's some really solid stuff in here, have to bookmark it and come back to some of these posters at a later date, if it's still an interesting problem that is.
If I was running a site this big I wouldn't be using redhat, and I wouldn't be doing the server admin myself, I'd be talking to a hardcore server freak who can make this stuff dance around his little finger, and it would probably be running on 64 bit systems.
"There are also issues of scrapping, copyright, and liability"
Such as? Interesting, but the disclaimer on the bottom of the page says posters own their postings. All forums have this problem, and all big ones have a big version of it I assume. There's nothing unique about webmasterworld except its size, and how good its software is.
u slipped that in while I was typing the reply.
> My point is tho that if you wanted to implement any
> of the ideas that have been put forward in
> these threads you could.
I am. There were a boatload of great ideas here - a whole bunch more in sticky also.
As you know, awareness is the first step - as you also know, I've been talking about this issue forever. How we suppose to do anything about it, when our friends don't even get it?
Big thanks to XTug for the code.
> disclaimer on the bottom of the page says posters own their postings.
Exactly. You let us use it, but it is yours. eg: not some rss refedder scraper site. Nor is it some big sites' who scraps the content for clients looking for their trademark name.
What seems to make the most sense is that this site was removed non-voluntarily by Google... Anyone else?
This site has no Pagerank, backlinks, or links from the Google directory (although it is in DMOZ), and there are other ways to solve this problem other than removing it from google or banning Googlebot. Also, the new robots.txt and cookie requirements should have been enough... why remove it from google entirely?
I'll save the anti-Google rant for when/if this line of thinking gets any support. But if this happened, I think Google is way out of line.
Read the whole thread... we put up robots.txt ban, and then required login and then someone used the url removal tool... so, backlinks...etc are gone
I have a asp.net site on iis, no logins ect.
I am getting sick of seeing my scraped content all over the web.
Can you tell me a few simple things to slow this down.
Obviously not to the lengths here as it is impossible - just to feel like I am fighting back?
I actually saw this with my own eyes earlier -- Brett agrees with me that it's high time to make robots.txt an opt-in protocol instead of opt-out. It won't solve the problem of rogue bots, but it will make it easier to identify them. The presumption of innocence will no longer be present if a particular bot isn't white-listed by a site.
This would also keep Google from grabbing directories they have no business grabbing, due to webmasters who don't know about robots.txt. Examples are lists of driving-while-intoxicated arrests (not convictions) from a county court in Florida, etc. The usefulness of Google as a provider of leads for hackers would go way down. Identify theft would be reduced.
It would help the scraping situation. It would help little websites because niche search engines will be able to get a start, as search engines generally would have to make a case that they are really interested in good content, and prove it with their rankings, or else lose the goodwill of webmasters who have real content.
It might hurt Google's profits a bit, but it's way too late to worry about that. It's time to save the Web.
It would help solve the cache/copyright problem by making it clear that web content is all copyrighted by default.
If and when the APA and Authors Guild win big against Google Book Search, it might be time to make a push for an opt-in system for robots.txt. Until that happens, we don't have a chance. After that happens, we will at least have the attention of some judges.
this is so confusing, here I am, surfing webmasterworld with cookies off, using the googlebot useragent, and it's all there. I thought cookies were required to view content?
How is this supposed to stop bad bots?
>> someone used the url removal tool
Not good Google. Not all people who ban robots want to be out for 180 days.
> with cookies off
Did you delete all cookies as well?
web developer toolbar, block all cookies, I even restarted firefox with cookies off, they're off, I can't post or anything but I can view the content fine. As far as I know, when you use the toolbar to block cookies, they are blocked. No cookies sent that is. Unless there is some bug I don't know about.
No problem surfing any part of the site with cookies off and google agent on.
This is actually a change, in the past I couldn't surf the search forums at all, especially google, using the googlebot user agent.
<added>just double checked, no, you won't duplicate this with opera, cookies off and standard useragent doesn't let you on the site, cookies off and googlebot useragent does, all of it. Hmmmmm.... same for msn and yahoo bots by the way. So there you have it, to scrape the site in its current configuration all you have to do is turn cookies off and use a fake search bot useragent... LOL.... next to try it with cookies on? It can't be this easy. then just set up a few proxies, and start downloading the site, should take a day or two...
[edited by: 2by4 at 2:40 am (utc) on Nov. 30, 2005]
Thanks, 2by4. I thought I was the only one - getting into twilight-zone territory until I read your post.
Added: No web developer toolbar, just FF 1.5, useragent changed, cookies disabled. You can't post, and you see the different outbound links (that we won't mention), but the whole site seems available.
Added again: Using the G useragent, that is, and yes with the cookies cleared first.
[edited by: Stefan at 2:58 am (utc) on Nov. 30, 2005]
brett's stuff has always been fun to look at, it's one of my favorite uses of all the firefox developer tools. What always cracks me up is that despite being very expert at many things, so many WebmasterWorld search engine forum posters have a really hard time actually seeing what brett does, which is always impressive, to me at least, but it goes unappreciated too often... this will probably be changed sometime soon, I need to grab the site while I can... no, just kidding brett, I don't want to grab it.
|Craven de Kere|
>How we suppose to do anything about it, when our friends don't even get it?
You say this over and over again, but I really think that in many cases it's not a matter of people not "getting it" so much as not agreeing with your method of handling it.
A simple example is the more hardware vs. less hardware lines. It's easy to appreciate the desire to manage load, but at the same time I don't think you can expect all of us to agree that your willingness to cripple the user experience instead of scaling the hardware is the best route.
>I bet if we took off all the bot controls, it would take 20 servers just like this one to feed them.
What to you is a problem to me is a solution. 19 more servers isn't worth doing this to a site to me and this may come across as not "getting it" to you, but that's a matter of differently weighed criteria and disagreement on priorities and methods, not obtuseness on my part.
Anywho, good luck. If you take on the windmill of zombie computing you will need it.
Yeah, looks like they just changed it.... but until a few minutes ago, you could turn off cookies and surf around the site as Googlebot.
You could do the same thing as Slurp, but the really interesting thing there was the robots.txt that you would see. It was not a "User-agent: *; Disallow: /", and it did not disallow Slurp. You can still see that--you should check it out while you still can. So I guess we'll be able to search Yahoo for WW pages for the foreseeable future. Another reason to use Yahoo.
I'll bet *anything* that with a Googlebot IP address and useragent you again would not see all useragents dissallowed in robots.txt, and could crawl the site. But that I guess we'll never know...
Cloaking robots.txt is a pretty crafty trick, I thought we were told Slurp was banned from WW? I'm not sure what to believe anymore. Something funny is going on, that is for sure.
"I'll bet *anything* that with a Googlebot IP address and useragent you again would not see all useragents dissallowed in robots.txt, and could crawl the site. But that I guess we'll never know..."
I'll bet you anything that at the time of this posting robots.txt is being cloaked by useragent.
I was just regretting not grabbing the old robots.txt for its bot list, but there it is, just like it used to be, no search bots banned at all.... now it's saved though... brett is still one of my favorite tricksters out there, I learn a lot from him, and some others here, all the time.
by the way, how do you cloak a .txt file? I remember trying that with .css but I couldn't figure it out, didn't spend very long on it though, that's in an apache php environment, is it just a htaccess thing? Anybody feel like sharing the code?
I've never seen the point of these though:
do these actually obey robots.txt? Somehow I doubt it.
Of course the real question is what a google ip is seeing.
to the poster asking why brett doesn't just add hardware, he's answered that several times, the bbs software isn't written for multiple servers yet, although he's working on it.
| This 246 message thread spans 9 pages: < < 246 ( 1 2 3 4 5  7 8 9 ) > > |