Forum Moderators: open
The converse is to allow certain bots (eg googlebot) and disallow all the others, but then one never can be sure when a useful new one turns up without trawling through the logs.
I favour blocking /logging on certain header and IP criteria within an actual page access and returning an appropriate header code (eg 403, 405). Not that a lot of bots take any notice either way. It would be useful if htaccess were a standard part of IIS but as ever MS broke it and the cost to add a proprietary version to a multi-site server is prohibitive.
RAM is exactly where the IP list should be stored.
I range block colo’s identified first hand or though posts here at WW, most non-US IPs, and some people/organizations I consider slimy (M$ for example) – a relatively static list of 6300 or so unique ranges. I thought about a RAM resident table but it wasn’t obvious just how to achieve it in a shared/virtual hosting environment using PHP - even though I switch all web pages though a central bit of PHP code where I do all checking. So went for a read only MySQL IP range table.
Without giving away all my tricks, I can literally ping the OS to see if the file exists and know if the IP is bad or not based on the filename. Each IP gets it's own filename and those are cleaned up as I go along.
I originally built velocity checking, fast/slow scrape, around AlexK’s code. Turns out I want to accumulate more temp info than can be kept in the directory entry date/time fields. So had to implement another MySQL table where I record “current” IPs and data about their behavior. I track flow through the site using the central switch/routing PHP code and can punish atypical human behavior or reward good behavior. I can also limit real human “scraping” to an extent, at least make them come back over a few days. Would like to use just the operating system file system for current IP data but the only way I can see it working for me is if I actually write data into the file – probably not all that efficient.
Phred
I suspect the recovery time would still be slower than MySQL in any case.
I half fancy doing a test to see what the performace difference is between in memory and from a database.
Checking an IP range (in a full block) should, in theory at least be quicker than a single IP check. In your example you only have to check 3 bytes instead of 4.
I'm not sure where Instr comes in to play. Anyone who's storing IP addresses as strings is doing it wrong!
[edited by: mrMister at 12:25 pm (utc) on Dec. 16, 2008]
The first time the IP hits site it has no session so it is compared to Application Array of banned IPs(fast scrape). If not there then it goes against the session array, if not there then, UA and headers routine, then Hosting ranges Table, if there IP gets PRE-pended to application array, the reason to pre-pend is next time it hits, you could get out of the loop once found. The last IP in that banned array gets dropped from it. IF IP passes the GoodHuman Check it gets a Session so most DB Calls are unnecessary from there on. If the browser can not hold the session(cookies turned off - this happens very rarely I guess its the visitor type and audience) then there should a challenge presented.
Application level Arrays are also for Good Bots. The sites that I’ve been involved with usually crawled all day long by Google but the IP Address is somewhat the same. As for $M and Y! these are slightly different $M bots like to hunt in groups, where Y! little rascals Just RAIN from every possible variation of 1-255 they could get away with from the "white list" of ranges.
So if in Bill’s example 28 IPs/minute + all the good bots, say another 50 it should be OK I guess. There are software packages - that are almost impossible to detect, but then again stuff happens and every bot has its signature.
This is done on IIS server.
The use of instr is because the data is (in this case) read in from a file as a single string separated by CRLF, which makes an instr theoretically faster than an array check. But as I say, not really practical anyway.
Blend27 - the number of bots that return sessions are relatively small so of little use against fast non-browser scrapers. I do hold them but they are only minimally useful in reducing lookups. I accept the possibility of storing IPs for browser-based scrapers in session vars BUT it's still possible for the browser to hit a trap or change its UA and thus expose its nasty intentions, so one cannot rely on whitelisting via session var'd IPs.
For knwon good IP ranges I prefer to hard-code the ranges in the parsing script. The list is small and this makes checking them fast. The script does need to be updated occasionally but not often enough to be a chore. This has the added advantage that app var latency is of no account and there is no disk storage/retrieval involved for backup beyond what is involved in loading the parsing script.
Alternatively, I could use a ramdisk for the task but the performance savings is negligible.
the number of bots that return sessions are relatively small
:-) exactly the case in my limited experience.. Hits from one IP without returning a session is one of the additional items I track. The way the site is structured in many cases I can pretty much tell it’s a bot after just a couple of hits, independent of any velocity info.
Phred
in many cases I can pretty much tell it�s a bot after just a couple of hits, independent of any velocity info.
Exactly, behavior analysis will catch them faster than anything.
However, I'm seeing some behave more and more like a browser just to avoid that sort of thing, including site spammers.
There's irony in the fact that we block bots to conserve resources (bandwidth/cpu) which has driven them to use more resources attempting to mask the fact that they're bots in the first place.
Oh the tangled world wide web we weave...
[edited by: incrediBILL at 2:30 am (utc) on Dec. 17, 2008]
This one sends back a valid Cookie, also would be blocked by Hosting ranges, but blocked on the first request, before it gets to to hosting ranges, Proxy header part is only 10 % of the formula.
IP: 194.8.75.nnn (currently in Project HoneyPot and local pool of man made forums created to be spammed, but besides that point...)
Headers:
Cookie: CFID=11nnn111; CFTOKEN=22nnn22
Accept: */*
Proxy-Connection: Keep-Alive
Host: www.example.com
User-Agent: Opera/9.0 (Windows NT 5.1; U; en)
Content-Length: 0
------------------------
request_method: GET
server_protocol: HTTP/1.0
It's funny how they do it, but it gets trapped and blocked for a short while...
BTW: You could always return 200 with prefered IBL formula(they deside on it), if you now where you content ends up ;), don't tell me your never thought ot that one...
Blend27
Don't tell me you can block based on referrer, because you cannot.
Ex. Referrer:
[google.com...]
where example.com the site you want to browse. And at the top someone can make combinations with the keywords. (or, and, the, for, etc).
And same goes for UA. You may block my blank one because I expose it willingly so, but you don't block someone's else because he uses a fake UA (that very well appears valid, very easy to do) scraping content from your site.
For me the referrer is out of the question to take a decision. The UA I only use it to determine if someone intents to enter as a bot or regular visitor and has to be consistent throughout a session, but again is not a decision to ban/block something. Data centers Bill mentioned earlier on, is something efficient but I have seen a couple of false positives also. One of the ips for the slurp bot was blacklisted for some reason and had to modify code to allow it.
Don't tell me you can block based on referrer, because you cannot
Too sweeping a statement there I think:
RewriteCond %{HTTP_REFERER} example.com
Sure, it's not foolproof, but it will deal with a lot of automated pests.
As for your example:
RewriteCond %{HTTP_REFERER} google
RewriteCond %{HTTP_REFERER} site
what happens if someone fakes the referrer?
Then you hope to catch them with a different filter - most here will be using a combination of multiple methods to separate bots from humans, and if a human deliberately chooses to fake or conceal their referrer or user-agent they can have no complaints if they are refused access.
had to modify code
I would imagine that most of us do so frequently.
We are shooting moving targets, and expect an occasional miss.
And we are all entitled to set the rules for our sites as we see fit - I don't intercept blank referrers myself, but if another webmaster wants to serve them an interstitial page or block them outright then that is their prerogative.
I don't think anyone here claims there is one perfect method of bot control.
Looking for one is probably a doomed quest.
...
I started to word a reply similar to yours and then realized it was not what they are disccussing (i. e., general techniques).
Don
Do Bot-Blocking Techniques Alter Bot Behaviour?
Yes.
Some bots (or their masters) are quicker to adapt, some are programmed to change fast.
Different bots use different techniques at different times and require different responses.
Only eternal vigilance and a combination of techniques can hope to thwart them.
There is no magic bullet.
...
I do block on referer but these are very specific and generally based on keywords or bad domains.
Most faked UAs are obvious to some degree or fall into a general "be wary" category to be cross-checked against other characteristics. Blank UAs are VERY suspicious. They're blocked. No argument. You want me to talk to you, you be polite and at least tell me which train you arrived on.
Samizdata - I have reservations about whether bots adapt. I suspect the majority don't, although targetted scrapers might if the bot-driver really wants your site. On our server not many do.
I think most bots visit so many sites that the occasional rejection is just ignored. Sure, they come back next pass, but my logs suggest they haven't changed anything in the meantime. Why should they? Well over 99% of sites will let them in so why bother with anti-social webmasters like us?
Let's face it, a lot of bots are bored-kid-in-bedroom playthings. They have little idea what a bot is but manage to drive a simple one until they eventually discover girls. Most of these come from dynamic IP ranges. Others are possibly good-intentioned let's-build-a-new-google attempts from dedicated servers but those are usually blocked by IP range anyway. If they turn out to be any good they will eventually come to light here and be provisionally accepted.
Let's face it, a lot of bots are bored-kid-in-bedroom playthings.
That accounts for some of the stuff but others are very organized, often corporate data mining operations or scrapers involved in criminal activities using your content to lure victims into their snares.
Those types of data mining operations are mostly likely to try to adapt because without access to the content they're out of business.
The scary part is where I most often find my scrapings is on the criminal sites leading to malware.
[edited by: incrediBILL at 10:50 pm (utc) on Dec. 18, 2008]
As far as I can tell these are home users with high-speed browser attachments but I'm still working on that diagnosis. Certainly I can't so far persuade google to return secondary sources for specific keywords, and since these keywords are product parts I would expect it to if it were a commercial scrape.
The problem with detecting on speed is that broadband is getting ludicrously fast. Is it a scraper or just some idiot trying to be clever? I know I should slow them down but see previous discussions. :)
The problem with detecting on speed is that broadband is getting ludicrously fast.
Connection speed isn't the issue, it's how fast a human can read that's the issue.
I block the pre-fetch technology so that when people hit my site they don't get 20 pages they'll never read on the first access.
So then you have people with feed readers and other new bookmark tools in Firefox and such with asinine innovations such as "Open all in tabs" and they do it with 20-40 pages attempting to open all at once and BOOM! it locks them out ;)
Beyond that silliness, a real human can't visually process a page in under a second so loading 3 pages in a mere second is enough to trigger a trap.
I have reservations about whether bots adapt
I didn't mean to suggest Darwinism of any sort.
But a trawl through this forum will surely find cases where bot behaviour has changed over time (due to a rethink and code amendment by the botmaster).
The other suggestion made by enigma1 was that some bots are programmed to try different approaches if they hit a 403, and I believe that may be the case, though as you say they are far from the majority.
why bother with anti-social webmasters like us?
You are my kind of people :)
...
The other thing is to look what kind of bots we are talking about. A scrapper doesn't need some sophisticated code. The script can be just 50 lines of php code and run reliably. With php, all it needs is a fscockopen, send the right headers, retrieve the content from a given set of pages. It can also be done via a jscript or via an ad, by anyone who simply browses another site. The browser downloads an ajax script, once it runs it retrieves data from another site and uploads the info to the server completely transparent to the visitor. So in essence the visitor becomes the man in the middle for this kind of service.
And the UA, referrer, even IP will be legitimate for all these filters we are talking here. And if you detect and ban an IP that does it, it may very well be a dynamic one.
@dstiles, There are tools/programs that once installed (eg: firewalls, antivirus, browser plugins etc., block the UA and Referrer fields). They may have some cryptic option like surf privately or something. So what you blocking really could be real visitors that could buy something from your store. My point was if you rely on these headers to take a decision you will end up with false positives more often. And no I am not trying to talk you out of it. Each one of use does what he thinks best for his site. Having said that, this diversity makes extremely hard to setup a browser securely and be able to browse the web without hitting restricted pages. Other webmasters rely on cookies. If they see the cookie is not accepted by a human they block access. Others block access if the jscripts don't run and so on.
Most "scrapes" I see tend to be more than a second between hits - well, sometimes maybe a couple in a "second" with a step beyond for the next fetch. Could be because a lot of hits are from UK broadband, which is still a bit slower than some places in the world (one company just announced 50Mbps cable), but my main server is also fairly slow on a 10Mbps bandwidth so that may be part of the reason.
And that is a problem for the future: high-speed downloads using accelerators or other browser add-ons working on connections that are far faster than a lot of servers. Half a dozen hits from something like that simultaneously and the server collapses (I already had something like that with one site and had to move it to a faster new server). I'm convincing myself I really do need to implement slow-down techniques. :(
Samizdata - I'm not saying bots don't evolve but my own experience suggests it's relatively slow apart from UAs that rotate anyway. Could be (probably is!) we're seeing different types of bots, possibly due to different site content and location (do bots geo-locate?). But I agree: longer term they are bound to evolve - define "longer term". :)
Enigma1 - I see very few false positives through blank UAs. Most FPs are through badly modified UAs (eg recent threads re: nested Mozilla's and broken AV scanners). In fact the most persistent blanks are through google proxies on the 72.14.193.* range.
If a firewall or proxy blocks all of the headers (as some do) how can the correct content be served up? There are several reasons for serving up different content to different types of browsers and robots and by no means all are black-hat. For example, I block contact details and forms from robots. What do I serve up to a blank UA? Is it a nasty robot or a paranoid (present company excepted!) person?
Apropos which, I have one site whose customers really are paranoid and often try to hide behind proxies or block referers. They've been like this for about a decade now. And some of them complain if the web site won't let them in. I have yet to have a complaint from a blank UA.
There are combinations of browser characteristics that CAN be relied on to indicate baddies. Our problem isn't that: it's the baddies who are good at masquerading as goodies that we need to resolve.
You can't rely on cookies (nor javascript for that matter). I block cookies on a lot of sites and I know a lot of others do. I enable them if I think the site won't work without (eg shopping) but if I'm never likely to visit the site again (which is usually the case) I leave them off or perhaps only enable them for a session. Judging by the requirement for browser cookie-blockers I'd say I'm by no means alone. Exactly the same applies to javascript (I think that's what you're saying anyway).
Is that right or am I missing something?
More info: Of 87 blank UAs this month only 17 came with referers. Of those, almost all would have failed on other browser characteristics. Possibly some or even all were genuine browsers behind very strict firewalls or proxies that wouldn't identify themselves in any way but since they wouldn't admit their browser identites they didn't gain access.
In any case about half of those with referers came from unexpected country IP blocks including countries known for scraping and having no obvious legitimate need for the site it hit. Ditto for the referer-less ones. In both cases several came from IPs that responded to an http request, indicating there were servers behind the IP. Not all such are bad but in certain circumstances it can add to the evidence.
And as I've said before, several come in on a google "proxy" - or at least something that has a multi-purpose google IP with no rDNS.
The use of instr is because the data is (in this case) read in from a file as a single string separated by CRLF, which makes an instr theoretically faster than an array check. But as I say, not really practical anyway.
As I say, if you're using strings then you're doing it wrong.
An IP address is nothing more than a 4 byte integer. If you're using strings to record IP addresses then you're going to be using something like 16 bytes per single IP address!
Using a byte array saved to disk it will only cost 4 bytes for each individual IP, and no contiguous IP range should be using more than 8 bytes! Any full Class A, B or C ranges that you have blocked should only be costing 1,2 or 3 bytes respectively. If you're using 16 bytes for a single range then that's a massive overhead.
However it's not just the excessive memory use. If you're using string manipulation to determine whether a particular string is in a list of ranges then you're going to be using massively more cpu time than you should be.
[edited by: mrMister at 3:30 am (utc) on Dec. 24, 2008]