| This 57 message thread spans 2 pages: 57 (  2 ) > > || |
|Do Bot-Blocking Techniques Alter Bot Behavior?|
Various methods discussed, 403 forbidden vs 200 OK
System: The following 8 messages were cut out of thread at: http://www.webmasterworld.com/search_engine_spiders/3758861.htm [webmasterworld.com] by incredibill - 10:31 am on Dec. 11, 2008 (PST -8)
The problem with robots.txt is that you can't just keep adding disallows for ever on the off-chance that a bot will actually obey them - most don't and there are far too many in any case, especially when one has to update robots.txt by hand for several dozen sites, all with different blocking requirements. Life is too short.
The converse is to allow certain bots (eg googlebot) and disallow all the others, but then one never can be sure when a useful new one turns up without trawling through the logs.
I favour blocking /logging on certain header and IP criteria within an actual page access and returning an appropriate header code (eg 403, 405). Not that a lot of bots take any notice either way. It would be useful if htaccess were a standard part of IIS but as ever MS broke it and the cost to add a proprietary version to a multi-site server is prohibitive.
|It would be useful if htaccess were a standard part of IIS |
ISAPI_Rewrite v3 supports the Apache standard for htaccess and is only US$99 for each server. Even for someone like me with 10 servers the costs versus benefits is hardly prohibitive. Without it I'd like need several more servers at around US$250 each to handle the load. Granted though, "prohibitive" is still a relative term.
-- allow certain bots (eg googlebot) and disallow all the others --
There are IP ranges & reverse DNS to catch "eg googlebot", for the rest: remove your robots.txt from your server, set up 404.asp that catches the requests to files that are not found of you server eg:to robots.txt, if request is made from a real bot serve your real robots.txt text.
this is a ColdFusion Routine that does the trick detecting for the older versions of < cf8:
where is a CGI.query_string = "404;http://www.example.com:80/robots.txt" when file is not present on the server.
<cfset notFoundURI = ListGetAt(ReplaceNoCase(ReplaceNoCase(cgi.QUERY_STRING,':80','','all'),'404;','','all'),1,'?')>
<cfset NotFoungQS = " ">
<cfif ListLen(cgi.QUERY_STRING,'?') gt 1>
<cfset NotFoungQS = ListGetAt(ReplaceNoCase(ReplaceNoCase (cgi.QUERY_STRING,':80','','all'),'404;','','all'),2,'?')>
"https" should be at port of where https is pointed to.,
hope this helps.
I don't think there is a way to prevent scrapers via the UA or IP. Perhaps as webmasters we are doing the mistake of serving 403s or 404s and that is an indication to the other end to start changing IPs, referrers, UAs etc until they get in. How about serving a 200 in such cases with some irrelevant content? Even a blank page may be more useful (and wastes less b/w). The headers at least won't signify anything out of the ordinary for a scrapper and the advantage is, they may even fetch the content and try to use it.
my 403 is a royal 806 bites, double the size if you will, for Free! Whould you like fries(206 + every letter O replaced with 0, makes it really fun and easy to find where the content ends up) with that? ;)
yes I understand but it is still a 403. And that basically tells the scrapper-bot "try something else". While a 200 OK, has to go under review isn't it?
Gary, when I last looked that was $99 per virtual server - ie per site.
Blend27 - I don't have cold fusion. I've considered setting up IIS to process/serve .txt files but it's too much hassle at the moment.
Enigma1 - the problem with serving 200's is that they will continue to hit the site, often at a high rate, until a human intervenes, probably when he gets up next morning. Although they often disobey 403's they do tend to go away more quickly and the bandwidth is smaller than actually serving a dummy page - although admittedly that isn't really the problem. In any case with a 403 I block the IP and if it gets too busy I drop the IP into IIS's IP security box.
What we're talking about with bot blocking is actually adding a DMZ (demilitarized zone) security layer to your web site.
Whether it's blocking via 403 forbidden, 200 OK, UA filtering or IP blocking is more or less purely academic.
There are 2 basic types of IP ranges out there which are hosting and non-hosting (business/residential), or bots vs browsers.
Firewall hosting (bots):
The first layer of defense is to firewall off all the hosting data centers and then punch holes in that firewall for valid allowed services such as search engines, blog feed services, etc.
Firewall non-hosting (humans):
Whitelist to only allow valid browsers user agents because major services don't operate outside of hosting companies so anything non-browser coming from these areas gets the boot.
Non-hosting challenges and bot traps
There will definitely be things operating from non-commercial locations trying to penetrate your sites pretending to be browsers.
The final approach after whitelisting browsers obviously has to be behavior based such as setting up bot traps, speed traps, checking other behaviors like use of images, JS or CSS, validating headers, validating requests, etc. and last but not least a challenge to validate it's a human.
Once you know the average page views for your site, which is typically 5-10 on mine, anything that skews that number is a prime candidate for challenging at perhaps 20 page views and perhaps set a daily page view cap (100-200) in case the scraper manually answers the challenge.
There are some other techniques but you get the basic idea.
[edited by: incrediBILL at 11:18 pm (utc) on Jan. 8, 2009]
|The final approach after whitelisting browsers obviously has to be behavior based such as setting up bot traps, speed traps. |
I had this vision of you sitting there on the Internet Highway disguised under your Wii Avatar pointing a radar gun at passing bots. :)
I am so looking forward to the "Whitelisting" methods of dealing with this stuff. One of these days I'll understand .1% of what you are doing. I'm on board, that is for sure. ;)
|Do Bot Blocking Techniques Alter Bot Behavior? |
I would surely think so. Think of it from a bandwidth perspective. If you are blocking all the bad bots, that leaves plenty of bandwidth for the good ones to come in and do their thing.
Do bots queue? I know they thread. For example, if I were on some type of limited bandwidth hosting plan and at the time Googlebot came knocking there were rogue bots using up resources and the bandwidth cap was being approached quickly. In comes Googlebot and bam, the site goes offline due to that cheap hosting service and me bandwidth limitations! < Heh, you know me and my Tin Hat thing. But, I've been responsible for referring sites and because they were on some sort of limited hosting plan, it took them down for the count right after the referral. :)
|In comes Googlebot and bam, the site goes offline due to that cheap hosting service and me bandwidth limitations! |
Actually, that has happened to me but not because of bandwidth limitations, but CPU limitations. When bad bots are behaving really badly and you have a few of them ripping a site at the same time that Google, Yahoo and Live (which crawl me all day) are also hitting the site, plus add a few hundred live visitors and the CPU backlog starts.
Next thing you know there are several hundred pending page requests in the queue and it's all grinding to a slow halt.
It would eventually clear itself out if everyone went away but more people and more bad bots just keep coming and G/Y/L are persistent little beasts so the odds of this bottleneck clearing soon without manual intervention is slim.
That was before I added had speed traps on my site.
This does not happen any more.
Speed traps require some means of back-referencing IPs against time of access. I recently updated my (home-made) bot trap software and fully intended to store IPs, good and bad, in a MySql database with times, expiry dates etc with a view to doing just that. Time - that's the problem.
Managing an old and a new version of the software simultaneously took too long so I dropped back to using the original simple text file to store banned IPs. The problem with that is it takes too long to load a large number of IPs every time. At the Sql Injection peak I had to purge the file of IPs daily just to stop the processor falling down. Even now I have to remove probably-ex-trojanned broadband IPs weekly.
I regularly look at yahoo and msn non-bots and wonder if it's worth adding to the banned list but I'm never sure what the effect will be.
I treat robots.txt as a means of telling the big three not to access / index certain pages such as contact forms and AUPs. If something disobeys I log it, look at it and usually kill it. Hence my comment in another thread today re: bad behaviour of yahoo-test.
It also contains a couple of bot traps but very seldom does a bad bot read robots.txt even less often does it follow up on blocked pages, so the trap seldom gets triggered.
|MySql database with times, expiry dates etc with a view to doing just that. Time - that's the problem. |
MySQL adds an additional load on the OS and when the database reaches a critical level it itself will become a vulnerability. Why do I know this? Been there, done that, watched it hit overload when a massive attack came and caused a meltdown.
Now I use the OS as the file system, breaking up the IPs into 4K directory buckets, and it's 1 file per IP. Screaming fast, no additional software to cause meltdowns, and I clean up old files on a cron job every hour so I don't utilize too much disk space.
The only things I do log and save long term are the bad bots themselves and what they did for further analysis.
|the trap seldom gets triggered |
They're onto the 1 pixel image trap and some of the other stuff.
I use a hidden iframe, you can't get to it in your browser, that points to a blocked page in robots.txt and it gets hit quite often.
Another trick is to put a hidden spider trap page out there that isn't blocked in robots.txt at all, use meta NOINDEX, NOFOLLOW, NOARCHIVE on that page in case a real bot lands on it and give the real bots a free pass for landing on that page.
Other natural spider trap pages to look for that humans rarely visit are PRIVACY, LEGAL, etc.
i use the free IIRF rewrite engine on IIS. it's on microsoft's codeplex.
On the other hand, Bill, I've seen the CPU go into overload trying to read all of the IPs into memory for every hit - that was at the height of SQLI at 1500+ IPs per day with no purging.
My rationale is that with MySQL the actual page has to do very little work - is it in the database? Yes, block / no, test something else. If the load gets bad then push the db onto another server then the db can be used by other servers...
I sort of like the one-file-per-IP idea but Windows has never liked a lot of files in one directory. Yes, they could be split into several on a Class-C basis. But I suppose if they are only for speed testing that reduces the number. Something to think about. :)
The trap URLs in robots.txt are by no means the only traps! I don't use the 1-pixel trap at the moment but have recently been considering it in addition to others. I'm paranoid about it being hit by the big three, though, even with a nofollow in the link - yahoo is often very sloppy about following links it shouldn't and not identifying itself properly (though that is increasingly applying to google and msn). The same reasoning applies to trap pages. I can see a way around the problem but again it's time. :(
iframes are now becoming subject to AV checks and some browsers are set to not load them. I suppose under your circumstances that's not so much a problem. :)
|My rationale is that with MySQL the actual page has to do very little work |
That was the same as my rationale until it hit the fan.
I'm processing about 40K unique IPs a day and when some bad boys start to hammer the site and ramp up the bot tracking code running that database is when things really get interesting.
Remember, at the same time your site is being abused by a multi-IP attack that all the regular SE bots and visitors are still online at the same time, not to mention the regular daily barrage of 1K junk IPs wanting on your site.
That's when the core overloads and the site starts leaking radiation into the atmosphere and the next thing you know you need to have the server declared a hazardous site and apply for government super fund grants.
|overload trying to read all of the IPs into memory for every hit |
I don't do that either, for just that reason.
Without giving away all my tricks, I can literally ping the OS to see if the file exists and know if the IP is bad or not based on the filename. Each IP gets it's own filename and those are cleaned up as I go along.
|You whitelist only allow valid browsers user agents because major services don't operate outside of hosting companies so anything non-browser coming from these areas gets the boot. |
Bill, you need to consider how some legitimate visitors think about this, including me.
There is a reason I block the UA when I go around the web. As you understand, I trust a server (when I am a visitor) as much as I trust a client (when I am the site owner). If I release the UA information of my browser, is very easy for the server to know the browser vulnerabilities. Every time a browser is patched as you know there are all kinds of "highlights". So its possible for a server to send the right data and hijack my browser, knowing exactly the weaknesses of a given version. And that's something I cannot afford. If an attacker compromises a browser, he can see everything that follows. So I have to block it and simply send a blank one.
Now you did say to filter based on IPs. Whether you use database or files to store ip information shouldn't matter. It's how you manage the resources. Obviously if we overload a database server it will break down. But same goes for files if you have a huge number of files in the same folder, things will start break down. So we segment these, either by using different folders or different servers or different databases. Also depends what kind of access you have for a site. Are you on shared, virtual, dedicated server, privileges and resources are different. And they also change from host to host. Because it's not only what filtering goes in the scripts, is also how the firewall behaves, what ports are open, etc. In many cases, I will have to rely on the existing host's services and can do little if they have a problem. For instance what good all the filters of the world will do, if the cpanel s/w isn't updated.
Now if someone is determined and wants to scrap the site's content, he can first go into the cached content of SEs, because its fast, uptime is 100% and he's after the main text content. Of course you can block that with some meta-tags, but there are always other ways. Ways that do not involve visiting the site yet. We did discuss at some point some of the translation services from search engines, we also have elite proxies and many other "man in the middle" entities. If a scrapper cannot access a site one way will always try another. One element that is disturbing and I see in my logs, is that RFIs are coming from all kinds of IPs and try to include a piece of code that is located to whatever site you can imagine. Put together that I see this on low traffic sites and it means the number of compromised servers out there is immense. That number of compromised servers can give the scrappers all the resource they ever need. (I am trying to get some real figures from the logs)
Anyways in my opinion 40K daily unique IPs is high traffic for a single server to coop with. If you had a store with products to sell and 40K IPs daily, there is still the need to generate a session and validate it, pretty much the same way as you do with IPs. That would stretch the database server anyways.
dstiles - they will continue to hit the site either way. 403 signals something. "Change IP" and/or "Change Headers". With 200 they may very well assume the content they received is real so perhaps they will go away faster. In my opinion the more different signals you send the easier for the other end to determine a weakness. The thing is to properly detect the scrapper because you don't want to send some garbage to a human visitor.
|That's when the core overloads and the site starts leaking radiation into the atmosphere and the next thing you know you need to have the server declared a hazardous site and apply for government super fund grants. |
Do you know if the high MySql cpu overhead is primarily associated with insert/update query activity vs just selects (read only)? Or is basic MySql overhead too much to start with?
On entry I hit the db once just to see if the IP is range blocked.
I'm not positive about which activity caused the problem, probably update if I had to guess, all I know is under heavy fire the result was MySQL errors showing up with no website. Not a good situation for a bot blocker when it ultimately killed the site.
That's when I switched the PLAN B ;)
|If I release the UA information of my browser, is very easy for the server to know the browser vulnerabilities. |
Remember. raw HTML cannot harm you, it's text.
The only other vulnerability that I've seen exploited is the meta redirect to an EXE file, but that too is blocked without permission.
However, if you're not using a legit UA, you'll get booted from all my sites and a lot of other sites I know.
|Now if someone is determined and wants to scrap the site's content, he can first go into the cached content of SEs |
Not on my sites, I use NOARCHIVE for all and have excluded myself from Archive.org.
For more reasons than just scraping, there are potential legal problems that arise from cache.
Go read [noarchive.net...] , lots of links back to WebmasterWorld on the topic.
|Now you did say to filter based on IPs. |
I said ranges of IPs for data centers.
When you aren't chasing each bot or badly behaved bot, instead just blocking entire data centers, your list is much smaller.
The small whitelist of allowed bots that punch thru the firewall is ahead of that list so processing the few good guys is very quick.
|But same goes for files if you have a huge number of files in the same folder |
I don't, I run a high volume site and the speed of my solution is of the utmost importance, it has to be transparent. The files are fragmented into buckets that never exceed an amount shown to have high performance. I'm also on Linux which has a faster directory system than Windows.
I used to work in operating systems and mass storage backup solutions in the early 80s so I do know a little about directory overhead issues and workarounds ;)
|That would stretch the database server anyways |
True, and I do generate tons of sessions daily but I generate a lot less since all the crap is filtered out from the beginning.
Besides, once you know what's a bot and what's not a bot you can skip session generation for all the Google, Yahoo and MSN bots, they won't be shopping anyway, just let them have the page. That'll speed your site up by simply stopping some overhead with normal bots.
Basically 40K IPs/day will average out to about 1,666 IPs/hour or roughly 28 IPs/minute. Obviously there will be peaks that exceed the average but it's a reasonable load for a dual Xeon box until it comes under a high speed attack.
On average I block a minimum of 500 IPs a day and none of it human.
|We did discuss at some point some of the translation services from search engines, we also have elite proxies |
Those are also blocked, blocked.
Obviously can't block man-in-the-middle, but rarely would someone go to that extreme to scrape unless it's CC #s.
|That number of compromised servers can give the scrappers all the resource they ever need. |
Which is why I block data centers, which brings you full circle. :)
[edited by: incrediBILL at 6:43 pm (utc) on Dec. 12, 2008]
-- Do Bot Blocking Techniques Alter Bot Behavior? --
Why do you want to know? ;)
Do the learn? Yes they do!
1. 2 Years ago we wrote the script for the "guest book" to change the Action="h**p:www.some-bot-page.tld" when the request to "guest book" is orphan.
2. a year ago we registered that h**p:www.some-bot-page.tld to track the UFOs!
3. Sweet and buttery too! Oh, some one put a Logo over my face...
@phred -- MySql cpu overhead --
you could store the ip status in application scoped array before it gets to MySQL call and then go from there.
Thanks for the extra info, Bill.
I can't be precise since I'm running over 50 sites on a single server, but extrapolating from a few of them I'd say I'm getting between 5,000 and 10,000 IPs per day, but on a couple of sites that could be a dozen users behind each of, say, ten firewall/cache services for schools which pushes up the bandwidth and CPU time. Generally I block an average of 35 new IPs per day but SQLI pushed that out of all proportion.
This is Windows 2003 on a 4-CPU machine about two years old. Windows, of course, is slow and cumbersome anyway.
I can see how your application could work. My own IP listing technique goes back several years, when the web was relatively quiet and only a few nasties got blocked per day. In fact the technique worked well - and still does apart from peaks like the SQLI overload. I'm coming around to your file idea but I'm also intrigued by the MySql solution. Wonder what I'm doing Christmas Day? :)
I wonder if, on Windows, the directory method would be slower and use more resources than MySql. I would have thought MySQL on keepalive would have coped. What version and when?
From my viewpoint as a server manager with several sites including informational and e-commerce, you won't be allowed into any of them without a valid UA AND certain other valid header combinations.
I've seen enough comments around this forum to know I'm by no means alone in this. I suspect you are only seeing sites that are not protected in any way at all. And THAT is dangreous because it means the sites are probably open to a lot of hacks and scrapes that could compromise them.
As to status code and switching IPs - my experience is that rarely they retry with the same or equally bad headers, perhaps a couple or three times, but mainly they come back to another site on the server with a new IP scavenged from idiots whose computers are part of a botnet. These are usually detectable and are equally blocked by IP.
In practice I think exploit searches are more important to this class of visitor than content scraping. The latter is fairly easy to counteract and I doubt multiple IPs are available to them. I doubt hackers are interested in further attempts on a site that blocks them, no matter what the method. Not unless the site were very high profile. It's much like rattling door-handles to see who's left the door unlocked.
Nice discussion .. this is good if you have access to server. What about shared hosting? There are so many open source projects but none is created to address this issue.
On shared hosting, there are still many steps you can take, assuming that you have privileges to run mod_rewrite, ISAPI Rewrite, or the IIRF rewriting package mentioned above (Thanks Salzano!).
See the WebmasterWorld PERL and PHP forum libraries for two scripts: a PERL script that blocks IP addresses based on robots.txt violations, and a PHP script that detects excessively-fast requests (one aspect of non-human site access). Then look into IP address and user-agent whitelisting and blacklisting, reverse-DNS lookups, HTTP request header validation, and proxy header detection.
In short, there is a lot you can do on shared hosting, given a minimum required privilege level.
There is also a lot you can do on IIS servers without rewrite. As noted above, I don't have rewrite installed on my server. Instead I include a common script that parses various headers and checks for unwanted IPs. With the SQLI exception noted above this has worked fine using a common script across a slowly growing number of sites for the past several years.
If you only have one web site on a virtual host then this method can still be used, either with ASP on IIS or PHP on almost anything.
thanks for sharing info about php. I will try em out...
So, how does a less savvy person, someone who is largely GUI dependent and who is leasing a dedicated server, identify a person capable of implementing these solutions?
Where are such people found?
It's a bit frustrating to read about solutions knowing that I, and probably many others, are not up to the task of implementing these ideas.
|overload trying to read all of the IPs into memory for every hit |
RAM is exactly where the IP list should be stored. If you're cross checking each hit on your web site with a block list then reading the block list from disk each time is going to really slow your site down.
How big are the block lists that you guys are using? You can store 1 million IPs in well under 4MB of RAM, I don't understand why that would be problematic.
I really can't understand all these "solutions" that involve flat files and databases, they're going to slow your site down and that's precisely what you're trying to avoid by blocking the bots.
[edited by: mrMister at 5:05 pm (utc) on Dec. 15, 2008]
The problem is storing data in memory. No problem if you're programming in a real language such as C, with just a backup to disk, but in a kiddy-language such as ASP forget it. The problems are just too great. The only available solutions are to a) store data in app vars (high latency) or b) design a COM module to handle memory. Storing data in ordinary ASP vars or arrays is out because they can only manage a limited size before complaining.
I suspect the recovery time would still be slower than MySQL in any case. Not so much for single IPs perhaps, since they can be retrieved using instr or similar. The problem is having to extract a single IP from a #*$!.#*$!.#*$!.0-#*$!.#*$!.#*$!.255 range. A properly indexed SQL database (or even Bill's folder idea) has to be faster and more convenient.
Hmm. Why did my x's get converted into swearwords?
"z's" work better.
| This 57 message thread spans 2 pages: 57 (  2 ) > > |