Do Bot-Blocking Techniques Alter Bot Behavior?

Forum Moderators: open

Message Too Old, No Replies

Do Bot-Blocking Techniques Alter Bot Behavior?

Various methods discussed, 403 forbidden vs 200 OK

dstiles

11:09 pm on Dec 10, 2008 (gmt 0)

System: The following 8 messages were cut out of thread at: http://www.webmasterworld.com/search_engine_spiders/3758861.htm [webmasterworld.com] by incredibill - 10:31 am on Dec. 11, 2008 (PST -8)

The problem with robots.txt is that you can't just keep adding disallows for ever on the off-chance that a bot will actually obey them - most don't and there are far too many in any case, especially when one has to update robots.txt by hand for several dozen sites, all with different blocking requirements. Life is too short.

The converse is to allow certain bots (eg googlebot) and disallow all the others, but then one never can be sure when a useful new one turns up without trawling through the logs.

I favour blocking /logging on certain header and IP criteria within an actual page access and returning an appropriate header code (eg 403, 405). Not that a lot of bots take any notice either way. It would be useful if htaccess were a standard part of IIS but as ever MS broke it and the cost to add a proprietary version to a multi-site server is prohibitive.

incrediBILL

7:07 am on Dec 16, 2008 (gmt 0)

@Webwork, if we told you it would be self promotion, against the TOS ;)

I understand the frustration but I'm sure it wouldn't be that hard to find someone willing to do the work.

phred

9:30 am on Dec 16, 2008 (gmt 0)

RAM is exactly where the IP list should be stored.

I range block colo’s identified first hand or though posts here at WW, most non-US IPs, and some people/organizations I consider slimy (M$ for example) – a relatively static list of 6300 or so unique ranges. I thought about a RAM resident table but it wasn’t obvious just how to achieve it in a shared/virtual hosting environment using PHP - even though I switch all web pages though a central bit of PHP code where I do all checking. So went for a read only MySQL IP range table.

Without giving away all my tricks, I can literally ping the OS to see if the file exists and know if the IP is bad or not based on the filename. Each IP gets it's own filename and those are cleaned up as I go along.

I originally built velocity checking, fast/slow scrape, around AlexK’s code. Turns out I want to accumulate more temp info than can be kept in the directory entry date/time fields. So had to implement another MySQL table where I record “current” IPs and data about their behavior. I track flow through the site using the central switch/routing PHP code and can punish atypical human behavior or reward good behavior. I can also limit real human “scraping” to an extent, at least make them come back over a few days. Would like to use just the operating system file system for current IP data but the only way I can see it working for me is if I actually write data into the file – probably not all that efficient.

Phred

mrMister

12:24 pm on Dec 16, 2008 (gmt 0)

I suspect the recovery time would still be slower than MySQL in any case.

I half fancy doing a test to see what the performace difference is between in memory and from a database.

Checking an IP range (in a full block) should, in theory at least be quicker than a single IP check. In your example you only have to check 3 bytes instead of 4.

I'm not sure where Instr comes in to play. Anyone who's storing IP addresses as strings is doing it wrong!

[edited by: mrMister at 12:25 pm (utc) on Dec. 16, 2008]

blend27

7:28 pm on Dec 16, 2008 (gmt 0)

We use the combination of Application and Session scopes and DB(MSSQL).

The first time the IP hits site it has no session so it is compared to Application Array of banned IPs(fast scrape). If not there then it goes against the session array, if not there then, UA and headers routine, then Hosting ranges Table, if there IP gets PRE-pended to application array, the reason to pre-pend is next time it hits, you could get out of the loop once found. The last IP in that banned array gets dropped from it. IF IP passes the GoodHuman Check it gets a Session so most DB Calls are unnecessary from there on. If the browser can not hold the session(cookies turned off - this happens very rarely I guess its the visitor type and audience) then there should a challenge presented.

Application level Arrays are also for Good Bots. The sites that I’ve been involved with usually crawled all day long by Google but the IP Address is somewhat the same. As for $M and Y! these are slightly different $M bots like to hunt in groups, where Y! little rascals Just RAIN from every possible variation of 1-255 they could get away with from the "white list" of ranges.

So if in Bill’s example 28 IPs/minute + all the good bots, say another 50 it should be OK I guess. There are software packages - that are almost impossible to detect, but then again stuff happens and every bot has its signature.

This is done on IIS server.

dstiles

8:56 pm on Dec 16, 2008 (gmt 0)

MrMister - my IP range example was simplified. It can be any range from 16 bytes to 30 million or more, so it has to be a greater/less comparison. For that reason it is necessary to split the list into arrays and check them one by one, whether they are singles or multiples. I suppose in theory it might look as though it were quicker to do an instr type of check for singles first but in practice most IPs fall into ranges so one would end up doing range checks in most cases anyway.

The use of instr is because the data is (in this case) read in from a file as a single string separated by CRLF, which makes an instr theoretically faster than an array check. But as I say, not really practical anyway.

Blend27 - the number of bots that return sessions are relatively small so of little use against fast non-browser scrapers. I do hold them but they are only minimally useful in reducing lookups. I accept the possibility of storing IPs for browser-based scrapers in session vars BUT it's still possible for the browser to hit a trap or change its UA and thus expose its nasty intentions, so one cannot rely on whitelisting via session var'd IPs.

For knwon good IP ranges I prefer to hard-code the ranges in the parsing script. The list is small and this makes checking them fast. The script does need to be updated occasionally but not often enough to be a chore. This has the added advantage that app var latency is of no account and there is no disk storage/retrieval involved for backup beyond what is involved in loading the parsing script.

incrediBILL

9:16 pm on Dec 16, 2008 (gmt 0)

I think many forget that modern OS's use lots of disk caching which plays well with my 1 file per IP scheme. Busy little bots or visitors tend to be cached during their travels and the minute they become inactive the OS flushes it out naturally.

Alternatively, I could use a ramdisk for the task but the performance savings is negligible.

dstiles

11:11 pm on Dec 16, 2008 (gmt 0)

Hmm. Never thought of RamDisk. It's been far too long since I did real programming! :)

Good point about disk caching but I wonder how effecient MySQL caching is now under Windows?

phred

1:54 am on Dec 17, 2008 (gmt 0)

the number of bots that return sessions are relatively small

:-) exactly the case in my limited experience.. Hits from one IP without returning a session is one of the additional items I track. The way the site is structured in many cases I can pretty much tell it’s a bot after just a couple of hits, independent of any velocity info.

Phred

incrediBILL

2:28 am on Dec 17, 2008 (gmt 0)

in many cases I can pretty much tell itï¿½s a bot after just a couple of hits, independent of any velocity info.

Exactly, behavior analysis will catch them faster than anything.

However, I'm seeing some behave more and more like a browser just to avoid that sort of thing, including site spammers.

There's irony in the fact that we block bots to conserve resources (bandwidth/cpu) which has driven them to use more resources attempting to mask the fact that they're bots in the first place.

Oh the tangled world wide web we weave...

[edited by: incrediBILL at 2:30 am (utc) on Dec. 17, 2008]

blend27

3:39 am on Dec 17, 2008 (gmt 0)

-- it's still possible for the browser to hit a trap or change its UA ---

if the browser(assigned to an IP) changes UA in the same session - just 403 that request(or 200 freindly page for some AV Software that we talked about), after the session is established and the browser, by a ration of 1:1000, hits the bot trap then present present the challenge.

This one sends back a valid Cookie, also would be blocked by Hosting ranges, but blocked on the first request, before it gets to to hosting ranges, Proxy header part is only 10 % of the formula.

IP: 194.8.75.nnn (currently in Project HoneyPot and local pool of man made forums created to be spammed, but besides that point...)

Headers:
Cookie: CFID=11nnn111; CFTOKEN=22nnn22
Accept: */*
Proxy-Connection: Keep-Alive
Host: www.example.com
User-Agent: Opera/9.0 (Windows NT 5.1; U; en)
Content-Length: 0
------------------------
request_method: GET
server_protocol: HTTP/1.0

It's funny how they do it, but it gets trapped and blocked for a short while...

BTW: You could always return 200 with prefered IBL formula(they deside on it), if you now where you content ends up ;), don't tell me your never thought ot that one...

Blend27

dstiles

8:09 pm on Dec 17, 2008 (gmt 0)

Known-IP blocking is the first thing on the agenda as this covers most of the "attacks" (30:1). Blank UA's are the next on the list since that is a low-cost test with no pass-throughs to check for.

IBL formula?

enigma1

2:06 pm on Dec 18, 2008 (gmt 0)

dstiles, Yes I know you mentioned before you block blank referrers and I did mention I always block my referrer when I browse. And what happens if someone fakes the referrer? You let him through isn't it? (assuming the rest of checking is ok)

Don't tell me you can block based on referrer, because you cannot.
Ex. Referrer:
[google.com...]

where example.com the site you want to browse. And at the top someone can make combinations with the keywords. (or, and, the, for, etc).

And same goes for UA. You may block my blank one because I expose it willingly so, but you don't block someone's else because he uses a fake UA (that very well appears valid, very easy to do) scraping content from your site.

For me the referrer is out of the question to take a decision. The UA I only use it to determine if someone intents to enter as a bot or regular visitor and has to be consistent throughout a session, but again is not a decision to ban/block something. Data centers Bill mentioned earlier on, is something efficient but I have seen a couple of false positives also. One of the ips for the slurp bot was blacklisted for some reason and had to modify code to allow it.

Samizdata

4:11 pm on Dec 18, 2008 (gmt 0)

Don't tell me you can block based on referrer, because you cannot

Too sweeping a statement there I think:

RewriteCond %{HTTP_REFERER} example.com

Sure, it's not foolproof, but it will deal with a lot of automated pests.

As for your example:

RewriteCond %{HTTP_REFERER} google
RewriteCond %{HTTP_REFERER} site

what happens if someone fakes the referrer?

Then you hope to catch them with a different filter - most here will be using a combination of multiple methods to separate bots from humans, and if a human deliberately chooses to fake or conceal their referrer or user-agent they can have no complaints if they are refused access.

had to modify code

I would imagine that most of us do so frequently.

We are shooting moving targets, and expect an occasional miss.

And we are all entitled to set the rules for our sites as we see fit - I don't intercept blank referrers myself, but if another webmaster wants to serve them an interstitial page or block them outright then that is their prerogative.

I don't think anyone here claims there is one perfect method of bot control.

Looking for one is probably a doomed quest.

...

wilderness

4:32 pm on Dec 18, 2008 (gmt 0)

Hey Samiz,
These folks are discussing a highly focused area (as opposed to general harvesters/visitors) and how same activity affects the return of the bot, or at least the bots actions and/or ability to adapt to the restrictions that may have been imposed.

I started to word a reply similar to yours and then realized it was not what they are disccussing (i. e., general techniques).

Don

Samizdata

5:21 pm on Dec 18, 2008 (gmt 0)

Cheers Don, perhaps I should address the topic title:

Do Bot-Blocking Techniques Alter Bot Behaviour?

Yes.

Some bots (or their masters) are quicker to adapt, some are programmed to change fast.

Different bots use different techniques at different times and require different responses.

Only eternal vigilance and a combination of techniques can hope to thwart them.

There is no magic bullet.

...

wilderness

6:33 pm on Dec 18, 2008 (gmt 0)

There is no magic bullet.

Totally agree and don't understand (at least the translation) of all the p-i-s-s-i-n-g-in-the-wind ;)

Glad it works for these folks though :)

incrediBILL

9:36 pm on Dec 18, 2008 (gmt 0)

There is no magic bullet

True, that's why we use buckshot.

The scatter blast of various bot blocking techniques is bound to stop the bot.

The DMZ method I described is your best shot (pun alert) IMO

dstiles

9:43 pm on Dec 18, 2008 (gmt 0)

Enigma1 - I said blank UA not referer. There are cases for blocking on blank referer for inner pages such as forms but even that is a suspect method without optional human bypass such as CAPTCHA.

I do block on referer but these are very specific and generally based on keywords or bad domains.

Most faked UAs are obvious to some degree or fall into a general "be wary" category to be cross-checked against other characteristics. Blank UAs are VERY suspicious. They're blocked. No argument. You want me to talk to you, you be polite and at least tell me which train you arrived on.

Samizdata - I have reservations about whether bots adapt. I suspect the majority don't, although targetted scrapers might if the bot-driver really wants your site. On our server not many do.

I think most bots visit so many sites that the occasional rejection is just ignored. Sure, they come back next pass, but my logs suggest they haven't changed anything in the meantime. Why should they? Well over 99% of sites will let them in so why bother with anti-social webmasters like us?

Let's face it, a lot of bots are bored-kid-in-bedroom playthings. They have little idea what a bot is but manage to drive a simple one until they eventually discover girls. Most of these come from dynamic IP ranges. Others are possibly good-intentioned let's-build-a-new-google attempts from dedicated servers but those are usually blocked by IP range anyway. If they turn out to be any good they will eventually come to light here and be provisionally accepted.

incrediBILL

10:47 pm on Dec 18, 2008 (gmt 0)

Let's face it, a lot of bots are bored-kid-in-bedroom playthings.

That accounts for some of the stuff but others are very organized, often corporate data mining operations or scrapers involved in criminal activities using your content to lure victims into their snares.

Those types of data mining operations are mostly likely to try to adapt because without access to the content they're out of business.

The scary part is where I most often find my scrapings is on the criminal sites leading to malware.

[edited by: incrediBILL at 10:50 pm (utc) on Dec. 18, 2008]

dstiles

12:06 am on Dec 19, 2008 (gmt 0)

There's only one site on my server that seems to get a regular scraping of any proportion and that's almost always through dynamic IPs with bog-standard UAs and normal headers working within expected countries.

As far as I can tell these are home users with high-speed browser attachments but I'm still working on that diagnosis. Certainly I can't so far persuade google to return secondary sources for specific keywords, and since these keywords are product parts I would expect it to if it were a commercial scrape.

The problem with detecting on speed is that broadband is getting ludicrously fast. Is it a scraper or just some idiot trying to be clever? I know I should slow them down but see previous discussions. :)

incrediBILL

12:20 am on Dec 19, 2008 (gmt 0)

The problem with detecting on speed is that broadband is getting ludicrously fast.

Connection speed isn't the issue, it's how fast a human can read that's the issue.

I block the pre-fetch technology so that when people hit my site they don't get 20 pages they'll never read on the first access.

So then you have people with feed readers and other new bookmark tools in Firefox and such with asinine innovations such as "Open all in tabs" and they do it with 20-40 pages attempting to open all at once and BOOM! it locks them out ;)

Beyond that silliness, a real human can't visually process a page in under a second so loading 3 pages in a mere second is enough to trigger a trap.

Samizdata

12:20 am on Dec 19, 2008 (gmt 0)

I have reservations about whether bots adapt

I didn't mean to suggest Darwinism of any sort.

But a trawl through this forum will surely find cases where bot behaviour has changed over time (due to a rethink and code amendment by the botmaster).

The other suggestion made by enigma1 was that some bots are programmed to try different approaches if they hit a 403, and I believe that may be the case, though as you say they are far from the majority.

why bother with anti-social webmasters like us?

You are my kind of people :)

...

enigma1

2:50 pm on Dec 19, 2008 (gmt 0)

Ok the bots we are talking about don't have to be some innovative/original scripts. As you're aware they are selling complete malware and scrapping tools online with smokes and mirrors like promises. What happens is that anyone can buy them and deploy them, with that millionaire promise thing. It works like with pyramid schemes. And yes there is the organized crime as mentioned that's a different category.

The other thing is to look what kind of bots we are talking about. A scrapper doesn't need some sophisticated code. The script can be just 50 lines of php code and run reliably. With php, all it needs is a fscockopen, send the right headers, retrieve the content from a given set of pages. It can also be done via a jscript or via an ad, by anyone who simply browses another site. The browser downloads an ajax script, once it runs it retrieves data from another site and uploads the info to the server completely transparent to the visitor. So in essence the visitor becomes the man in the middle for this kind of service.

And the UA, referrer, even IP will be legitimate for all these filters we are talking here. And if you detect and ban an IP that does it, it may very well be a dynamic one.

@dstiles, There are tools/programs that once installed (eg: firewalls, antivirus, browser plugins etc., block the UA and Referrer fields). They may have some cryptic option like surf privately or something. So what you blocking really could be real visitors that could buy something from your store. My point was if you rely on these headers to take a decision you will end up with false positives more often. And no I am not trying to talk you out of it. Each one of use does what he thinks best for his site. Having said that, this diversity makes extremely hard to setup a browser securely and be able to browse the web without hitting restricted pages. Other webmasters rely on cookies. If they see the cookie is not accepted by a human they block access. Others block access if the jscripts don't run and so on.

dstiles

11:30 pm on Dec 19, 2008 (gmt 0)

IncrediBill - I block prefetches anyway if I can detect them - i'm uncertain whether browsers are now prefetching in secrecy (certainly the AVG one was - or tried to).

Most "scrapes" I see tend to be more than a second between hits - well, sometimes maybe a couple in a "second" with a step beyond for the next fetch. Could be because a lot of hits are from UK broadband, which is still a bit slower than some places in the world (one company just announced 50Mbps cable), but my main server is also fairly slow on a 10Mbps bandwidth so that may be part of the reason.

And that is a problem for the future: high-speed downloads using accelerators or other browser add-ons working on connections that are far faster than a lot of servers. Half a dozen hits from something like that simultaneously and the server collapses (I already had something like that with one site and had to move it to a faster new server). I'm convincing myself I really do need to implement slow-down techniques. :(

Samizdata - I'm not saying bots don't evolve but my own experience suggests it's relatively slow apart from UAs that rotate anyway. Could be (probably is!) we're seeing different types of bots, possibly due to different site content and location (do bots geo-locate?). But I agree: longer term they are bound to evolve - define "longer term". :)

Enigma1 - I see very few false positives through blank UAs. Most FPs are through badly modified UAs (eg recent threads re: nested Mozilla's and broken AV scanners). In fact the most persistent blanks are through google proxies on the 72.14.193.* range.

If a firewall or proxy blocks all of the headers (as some do) how can the correct content be served up? There are several reasons for serving up different content to different types of browsers and robots and by no means all are black-hat. For example, I block contact details and forms from robots. What do I serve up to a blank UA? Is it a nasty robot or a paranoid (present company excepted!) person?

Apropos which, I have one site whose customers really are paranoid and often try to hide behind proxies or block referers. They've been like this for about a decade now. And some of them complain if the web site won't let them in. I have yet to have a complaint from a blank UA.

There are combinations of browser characteristics that CAN be relied on to indicate baddies. Our problem isn't that: it's the baddies who are good at masquerading as goodies that we need to resolve.

You can't rely on cookies (nor javascript for that matter). I block cookies on a lot of sites and I know a lot of others do. I enable them if I think the site won't work without (eg shopping) but if I'm never likely to visit the site again (which is usually the case) I leave them off or perhaps only enable them for a session. Judging by the requirement for browser cookie-blockers I'd say I'm by no means alone. Exactly the same applies to javascript (I think that's what you're saying anyway).

enigma1

12:02 pm on Dec 23, 2008 (gmt 0)

dstiles, there is one other thing I thought of. So ok, say some people block the UA, you block access to the site, lets also assume you have a ppc campaign running. Now if someone with blank UA clicks the ad on the search engine sponsors section (I tried that already and search engines don't block blank UAs), to get into your store, the click is recorded (via the redirect), but since you block the attempt at the end, effectively you pay for the click and the other end never gets a chance to enter.

Is that right or am I missing something?

dstiles

9:22 pm on Dec 23, 2008 (gmt 0)

You're probably correct but for one fact: we don't do ppc on our sites. If we did then that could easily be taken into account.

More info: Of 87 blank UAs this month only 17 came with referers. Of those, almost all would have failed on other browser characteristics. Possibly some or even all were genuine browsers behind very strict firewalls or proxies that wouldn't identify themselves in any way but since they wouldn't admit their browser identites they didn't gain access.

In any case about half of those with referers came from unexpected country IP blocks including countries known for scraping and having no obvious legitimate need for the site it hit. Ditto for the referer-less ones. In both cases several came from IPs that responded to an http request, indicating there were servers behind the IP. Not all such are bad but in certain circumstances it can add to the evidence.

And as I've said before, several come in on a google "proxy" - or at least something that has a multi-purpose google IP with no rDNS.

mrMister

3:30 am on Dec 24, 2008 (gmt 0)

The use of instr is because the data is (in this case) read in from a file as a single string separated by CRLF, which makes an instr theoretically faster than an array check. But as I say, not really practical anyway.

As I say, if you're using strings then you're doing it wrong.

An IP address is nothing more than a 4 byte integer. If you're using strings to record IP addresses then you're going to be using something like 16 bytes per single IP address!

Using a byte array saved to disk it will only cost 4 bytes for each individual IP, and no contiguous IP range should be using more than 8 bytes! Any full Class A, B or C ranges that you have blocked should only be costing 1,2 or 3 bytes respectively. If you're using 16 bytes for a single range then that's a massive overhead.

However it's not just the excessive memory use. If you're using string manipulation to determine whether a particular string is in a list of ranges then you're going to be using massively more cpu time than you should be.

[edited by: mrMister at 3:30 am (utc) on Dec. 24, 2008]

This 57 message thread spans 2 pages: 57