| This 34 message thread spans 2 pages: 34 (  2 ) > > || |
|In the era of increasing numbers of bad bots is robots.txt irrelevant?|
Why bother? Why not go the other way and white-list with controls?
I just read through a number of posts in an attempt to understand what, if anything, can be done about bad bots.
I found the proposals for white-listing interesting: Throttle all bot type activity except bots with benefits. :)
Based upon what I've been reading I'm reduced to asking this about robots.txt:
If bother, how much?
There are 3 types of bots:
1) good bots that obey robots.txt and can bring you immediate benefits (Googlebot, Slurp etc)
2) good bots that obey robots.txt but will not bring you immediate benefits, but may do so in the future - you can't be certain which of those bots will however
3) bad bots that don't obey robots.txt and will result in some sort of harm
So, clearly, you'd want to allow category #1, and you definately will hate category #3, as for category #2 then opinions will split into:
1) short term - people just care about CURRENT state of things, doing good job to maintain olygopoly in search engine space - less competition means bigger Google update threads on this forum
2) long term - support competition because one or few of them will be search engines of tomorrow.
Robots.txt won't save you from #3, but even if you have short term views and only want to allow well known bots, then still DO use robots.txt to disallow other bots - this will save you bandwidth and memory/CPU resources on processing requests.
You may not want all good bots that obey robots.txt to crawl your site, but do yourself a favour - use robots.txt to disallow them, and don't just issue Access Denied or something like this. What's the problem anyway? Its not like its a hard job of disallowing only specific dirs, disallowing everything is way easier.
Before you ban all or some good bots that care to obey your robots.txt ask yourself - do you want or like being one of the posters in those mega-huge threads that appear on this BBS every time Google updates its index? Do you like the fact that you depend so much from one search engine? For your own sake support alternative search engines - if you have fixed fee traffic quota that is not fully used, then use it as investment by allowing good bots crawl you - one or few of them could bring you countless benefits later.
You forgot #4:
4) bad bots that do obey robots.txt but use good bot names to mask their instrusions
Which is why I no longer list any bots in my robots.txt file and only put up the generic information publicly that all bots should honor. All authorized bots are whitelisted, use your .htaccess file, and everything else is bounced by Apache.
Scrapers have no publicly visible clues of which bots are or aren't allowed to crawl in robots.txt so they don't know who to try to minic attempting skirt security.
But if you don't want good bot that obeys robots.txt to crawl your site then you would disallow it completely, and if a bad bot that pretends to be good bot actually obeys your robots.txt it won't crawl your site - this would certainly make it qualify for a good bot!
On another hand if you see known good bot that supports robots.txt to actually try to crawl your site, then you can ban it safely knowing its most likely impersonated.
The whole point of disallowing good bot in robots.txt is to minimise requests from good bot and also catch impersonators of that good bot quickly.
Let me qualify the word OBEYS robots.txt as the bad bots try to get just enough information to fly under the radar undetected which is why I whitelist good bots by IPs which stops that problem in it's tracks.
Many bad bots are using popular browser user agent strings trying to fly under the radar but how many browsers do you know that read robots.txt to avoid spider traps? Therefore the robots.txt file itself becomes a spider trap for those crawlers. Also, humans can't read 200 pages in 15 seconds so that's just silly trying to hide the user agent and scrape as fast as possible.
The short answer to the OP is YES, I think robots.txt is mostly irrelevant with the exception of using it as a rogue spider trap.
Your point is correct but only in case if you partially disallow some urls to a good bot.
However if you choose not to allow known good bot to crawl your site, then it means that the bot should NOT be crawling your site at all. Therefore you can automatically ban all request with user-agent containing that bot's string unless its for robots.txt.
Thus, a true good bot will read your robots.txt and go away, but any impersonator for that bot will get banned easily. This probably happens anyway, but if you use robots.txt it would minimise load on your server from good bots and also automatically high-light possible robots.txt flaws in good bots.
Isn't it the reputation of the entity running the bot the impetus for opening the door? IF that entity has a good reputation don't I start by whitelisting the bot?
If you run a bot that doesn't "do me any good in the here and now" AND IF your company has no public reputation for valid business practices what am I supposed to do? Wait all the while to see if any good comes of it?
Sounds like a bad idea to let the bot run "pending an outcome". The "outcome" could already be happening elsewhere, as the spidered content is fed to other "unrelated entities" in the here and now, right? Say, content for MFA sites?
I'm not convinced that allowing bots of unknown or unproven entities to run is a good idea.
Anyone know of a reliable source for information about "good bots"?
A good bot would always include a link back to its site where it should explain what its doing and why (and how to block it). A webmaster can make up an opinion based on that - those bot owners who choose to be very secretive about what they do are likely not to be considered good bots, so its left up to the judgement of a webmaster.
It would be good if someone created a directory of bots classified by potential usefulness, perhaps this could have been done on the basis of info reported in Bot's tracking forum on this BBS.
|Isn't it the reputation of the entity running the bot the impetus for opening the door? |
Very much so and having good press and buzz around a project might cause me to whitelist them if I'm assuming it will be beneficial in the long wrong. Then again there's a crawler that claims to be Silicon Valley VC backed that has been crawling for years, claims to honor robots.txt, but I finally blocked them as other than the constant crawling nothing on their site has ever changed in all this time.
|Anyone know of a reliable source for information about "good bots"? |
Your log files, the good ones send you traffic, the rest waste your resources.
The bigger problem I see besides the scrapers is the new WEB 2.0 startups with things like Oodle, Kosmix, etc. crawling your site to analyze your content and aggregate it into their offerings assuming your content matches.
a) who are they to build yet another business to make money off my back
b) who are they to waste my bandwidth and CPU without my permission
There is a sense of entitlement going on that if you have content on the web then it's fair game to make money from it based on fair use laws. I'm sorry, I think it's time that PayPerCrawl might be a valid concept. Slip me some of your VC funds via my PayPal account and I'll let you crawl maybe 1,000 pages for $5 or some nonsense just to help offset my costs of doing business as I wouldn't need the more expensive a dual xeon server if it wasn't for all the bots.
That's the interesting thing about bots: bots are about someone making money off a website's content, until proven otherwise.
It only makes sense to open the door to bot tpye activity IF the income, derived from the bot activity, exceeds the cost of the bot's activity.
Does this make sense?
Mostly it appears that bot activity is no friend.
I tend to agree with Bill's principle: "Until the bot operator is positioned to show me, pay me."
It strikes me that we're approaching the day when robots.txt is inconsequential, so why bother? Just ban all bot activity but for a whitelist.
I'm no expert so that's why I am asking: Is robots.txt, which is mostly disregarded by the reportedly increasing number of bad bots, irrelevant and therefore should the discussion focus on how to bar the door AND how to identify whitelist members?
Corollary: Would it be in the interest of existing search engines - in the scenario I'm discussing - to make it easier for websites to identify and get along with their bots?
|I tend to agree with Bill's principle: "Until the bot operator is positioned to show me, pay me." |
How in practice do you see that? Do you think bot operators should try to send a cheque to registered domain's name owner? It is simply not practical to do anything like this - you are forgetting that there are billions of pages out there and there are alternatives to pretty much ANY content. The main question is really about how efficiently ensure that those web site owners who don't want to be crawled will get their wish granted - using robots.txt is the best way to do so.
Bad bots certainly won't care about robots.txt but it does not take much time to add a few disallow directives to robots.txt file, does it? I mean if some bot REALLY uses much of your bandwidth then it will be on top of the list, so just add it to robots.txt and live happy thereafter.
As a good bot operator I can tell you that the last thing I want is to have arguements with webmasters about my bot crawling their sites - that's why I want to make sure it obeys webmaster's wishes. If the site is REALLY popular then people will link to it and that page can still be found even without having been crawled.
So please don't be lazy - disallow bots specifically in your robots.txt, this will save YOU resources and also those who run good bots - believe me that implementing good support for robots.txt takes far more time than adding a few lines to it.
Not a literal payday, though iBill's point is well taken, but clear evidence of the bot's business authenticity/legitimacy AND the probability of a ROI - not just bandwidth sucking - as the end result of letting the spider feed.
Otherwise, why bother to facilitate the bot?
Absent legitimacy and promise it seems like a lot of wasted bandwidth and CPU cycles.
Brett and iBill and a number of others are on to something, are they not?
The heck with the meek "oh please" of a robots.txt approach to bots. Show the webmaster a reason why your bot should be a guest in their house.
You have to focus on the big picture. Do you want to be one of the people complaining in Google update threads about site losing all positions in SERPs. Last time I checked that update thread had way more than 1000 posts - do you REALLY want to depend on one single entity like this? Do you?
If not then you have to support alternatives - obviously MSN and Yahoo, but also others - you can't know which one will work out well, so you have to make a few reasonable guesses - not very hard really and not expensive either so long as you pay fixed fee for given traffic limit that you don't reach. If you pay for XXX GB of traffic but not use them, then why ban good bots? Ok, they may not bring you gold right now, but the alternative is the current situation - oligopoly or even monopoly of search engine space.
Talking of robots.txt - if you don't want known bots that support robots.txt to crawl your site, then for God's sake disallow them in robots.txt - this will save YOU and bot owners resources, this is really the way way to avoid being crawled by good bots - do yourself a favour and don't be lazy to add a couple of lines into robots.txt.
|do you REALLY want to depend on one single entity like this? |
I certainly don't count on a single entity but after letting bots run wild on my site for years I can count on one hand the number of bots sending me visitors.
The number I've disallowed are in the hundreds at this point, and all they did was waste my resources. Why do I need Hungarian, Polish and Chinese spiders looking for native language pages? Or technical spiders looking for technical content? Or link checkers or news aggregators or SCRAPERS or any of the other myriad excuses for crawling and content lifting all wasting my time, money and resources to combat as it overloads my server.
Enough is enough, done, they've all been vanquished and the user experience is now better than ever. The difference between what happened on WebmasterWorld and my site is that Brett could just request people to register if they looked suspicious but I don't have a registration and no desire to ask people to register so the problem was much more complicated to solve without putting roadblocks in front of everyone. But the point is I managed to get it done and the site hasn't been knocked down by a scraper in a long time.
You're right, robots.txt isn't difficult to implement but many people use that as their only defense against spiders and the bad bots ignore it or worse yet turn around and use it against them by cloaking as an allowed bot. Basically blacklisting only bad bots still leaves a gaping hole in your site security, whether in robots.txt or .htaccess, as the number of spider names is endless, including crawlers that randomly rotate known agent names until they get in or just generate random user agent strings that couldn't possibly be blocked, or pretend to be a browser at the keyboard.
Fundamentally the user agent string concept is flawed, has been heavily abused, and is barely worth looking at in site security anymore. If you know where the crawler is coming from by IPs is best or you could do a reverse DNS and check the domain name in real time on the first access to the server to validate the source. But depending on user agent strings alone for security and crawling control because of bad bots and scrapers is for all intensive purposes just a waste of time.
Robots.txt can be viewed as a set of rules to tell good bots what to do, and other well behaved bots that honor robots.txt that you don't want crawling to go away, but by no means should you be naive and assume this is any level of security that will control access to your content, it's merely SUGGESTIONS for the robots, nothing more, nothing less, and more robots could care less and the list of those seems to grow so rapidly it's simple mind numbing.
Just remember that all bots ARE NOT just being used by search engines and then you'll realize what kind of a quagmire we've gotten ourselves into as we aren't just discussing Google, Yahoo and MSN, we're discussing hundreds of bots.
It's depressing and overwhelming but it can be stopped.
As a defense against bad bots, robots.txt is largely useless.
As a means of providing some direction to good bots, it will remain useful indefinitely, or until some new standard is adopted.
I recall reading (here maybe?) a while back about someone that DID use Robots.txt to separate the good bots from bad bots.
I don't have the specific details but a Robots.txt rule was set up to disallow ANY bots from going to a certain directory.
If a bot went to that directory anyways, a script added its name and/or IP address to the .htaccess file which then disallowed it from any future access to ANY area of the website.
|until some new standard is adopted |
Which is somewhat desperately needed as there are all sorts of bots looking for all sorts of content and keeping pace with them just isn't viable.
You could keep a lot of bots off your site if you could just tell them what type of data you have on your site so Oodle wouldn't crawl looking for classifieds if you didn't have any, or Kosmix wouldn't be looking for health information, or that Polish crawler I'd never heard of before or can't pronounce wouldn't be scanning for Polish language pages, etc.
Additionally, there needs to be a mechanism in place to verify the bot is who it says it is so scrapers just can't minic Google and crawl the site.
Don't have a bright idea at the moment to solve the weaknesses of robots.txt other that just whitelisting the allowed bots and everything else is blocked by default and enforced using .htaccess
Not even sure in this current climate of entitlement to crawl that creating a better standard would solve anything.
|I can count on one hand... |
Don't have a bright idea at the moment to solve the weaknesses of robots.txt other that just whitelisting the allowed bots and everything else is blocked by default and enforced using .htaccess
This pretty much would solve the problem for me as I only care about those sending me traffic that dont do harm. I am willing to lose whatever miniscule amount of traffic that would be considered collateral damage.
Bill, can this be done effectively on a Win platform (.NET)?
New York, NY (February 7, 2006) - IAB and ABCE Announce Global Spiders & Bots Filtering List [iab.net]
The burden of bots appears to be on other people's minds.
Anyone see the IAB list? Comments?
robots.txt is very useful for one thing:
trapping bad bots! ;-)
and rewrite crawlthisanddie.htm to your htaccess trap, and poof!
no more bad bots...
I didn't trust in the robots.txt anymore. So about 2 years ago I created a server-wide database where all who might be 'bot' enters … opening a page everyone would be classified, being a 'normal' user, being a 'bot' or not … if a 'normal' user seems to exceed 'normal use' I look up in the 'whois' and maybe in the future he would be treated like a good or bad bot …
Deny access to bad bots brought me up in search engines and in google adsense because there are less impressions without any value …
Now my main web counts with about 7.000 unique users daily and between 7.000 and 10.000 site views by bots, 60% of them 'good' bots.
Because I'm denying access forever to 'guys' and 'bots' once 'searching' for archives which don't exist but if they were existing might be bases for hack attacks I maintain the servers pretty clean.
Yes it was work to create this kind of 'user tracking' but now with some clicks and whois views per week I can handle the 'good and bad bots' theme perfectly.
|and rewrite crawlthisanddie.htm to your htaccess trap, and poof! |
no more bad bots...
That's too simplistic of an approach for 2 reasons:
1. Many bad bots also read robots.txt to avoid those traps so you only catch the stupid ones
2. Firefox and Google prefetch can also read those pages if the links appear on any pages and suddenly you have a bunch of people in your spider trap that aren't bots
Trust me, I do the same tricks but it requires more profiling of the behavior to determine if the visitor is really a bot or just someone using what I consider abusive browser technology.
One of the simplest things to do is kick the visitor into a random captcha mode once they step into the spider trap which let's a human using a browser with pre-fetch to exit but the robots get stuck.
|Deny access to bad bots brought me up in search engines and in google adsense because there are less impressions without any value |
What you could've encountered which happened to me was high speed scrapers/bots overloading your server at the same time Google/Y!/MSN was crawling and the SE's got page timeouts waiting for the scrapers. This situation will cause your SERPs to lower as the SEs seem to assume the page content may not be available and only raise your SERPs again when those pages can be properly crawled without interference so keeping bots out can raise your positions and income but for different reasons than you speculate.
|The burden of bots appears to be on other people's minds. |
Without threadjacking to the advertising bent, that's another problem I have as well and Firefox and Google pre-fetch both inflate page impressions, along with all the bots and scrapers, it's a bloody mess.
To those that value their content, it makes no sense to make it available to every bot that crawls out from underneath of every rock.
I went white-list only access a while ago, and have seen no drop in traffic. CPU % usage, however, has dropped from nearly 70% to around 30% and bandwidth usage has dropped by about 38%.
Robots.txt used work, and was the norm during the time when things worked on an honor system. With all these MFA scrapers sites and wannabe SEs out there trying to monetize off someone else's work, putting up an "invitation only" entry sign at the gate makes sense to me.
It is a headache to keep up the whitelist updated with IP lists and UAs to be allowed, and I am no where close to having it as complete/accurate as I would have liked it to be, but it makes no sense anymore for me to let every parasitic nightcrawler out to feast on my original work (my stock in the trade) with impunity.
Sure, I am probably painting with a broad brush, and keeping some good guys from accessing my content unintentionally. That's the collateral damage I am willing to accept.
Well, this has been a good discussion.
But it begs the question, exactly what can a person do about "bad bots". How can a site owner cut off the bad bots while allowing the Googlebots/Slurps/MSN bots of the world through if Robots.Txt doesn't work?
A simple primer for people would be nice!
|Jordo needs a drink|
What does a typical .htaccess whitelist file look like? Are you allowing access by IP or agent or both? If by agent, is it a correct assumption that you are allowing browsers and certain bots in, and if so, can't bad bots spoof those? If by IP, then how do you keep from blocking normal traffic?
My whitelisting goes something liks this:
Allow Google's IPs
Allow MSN's IPs
Allow Yahoo's IPs
Allow user agents that start with Mozilla as browsers start with Mozilla
All other user agents bounce off the walls here and this stops a huge amount of nonsense.
Then I check anything that starts with Mozilla for subsets of user agent strings containing words like "download", "nicebot", "java", etc. etc. etc. and block anything that matches this huge list of additional banned words.
Beyond this I have a script running in real time that analyzes all access by IP and session tracking cookies (yes, some are silly enough to use my cookies) and automatically block access based on speed of access, volume of access, stepping into spider traps, etc.
I'm logging all bounced attempts and review those every couple of days just to make sure one of the allowed bots hasn't opened up a new range of IPs or something.
It's not pretty but it's pretty effective.
It'll be interesting to see how this is handled over the years as I too believe that robots.txt is outdated. You wonder if we'll hit a day when bot owners must lobby web hosts to allow access to their server. Kind of like an invitation only policy.
The problem with bad bots is only going to get worse with time.
I think my whitelist does not yet include all SE IPs that I ought to include in my allowed list, but I am adding them as I discover them in my logs. I allow Google, Yahoo, MSN, Ask, Looksmart, and a couple other niche SEs and directories that I like. I block the rest. I also have a mile long list of UAs that I have discovered in my logs, and see mentioned on this forum.
Keeping up with this is turning out to be not so trivial task, but it seems to be better than the alternative.
<edited : formating>
[edited by: bose at 9:16 pm (utc) on Feb. 20, 2006]
| This 34 message thread spans 2 pages: 34 (  2 ) > > |