homepage Welcome to WebmasterWorld Guest from 54.242.200.172
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Featured Home Page Discussion

This 86 message thread spans 3 pages: 86 ( [1] 2 3 > >     
The Whitelist
Key Elements
tangor




msg:4640883
 12:13 am on Jan 29, 2014 (gmt 0)

Whitelisting access to a web site is easier than blacklisting. One is who you let in, the other is endless whack-a-mole. To start:

.htaccess to allow all comers access to robots.txt

robots.tx allows a short list (b, g, y, examples) all others 403

UAs allowed, a slightly longer list, but still very limited. No match: 403

Getting grancular: Referer keywords (this is a blacklist, but still pretty short): 403

These are my basic tools. Are there others, or perhaps refinements? And, where does Whitelisting fail? And how to poke holes for desired? (I know HOW, but have yet to really find a need to do it).

These are concepts I've been using for a few years, but I suspect there are other methods, so this is a request for discussion on Whitelisting in general. I went down the blacklisting hole for a number of years--and the subsequent heartburn and agitation--until making that paradigm shift from takiing out the bad to allowinig the good. And, though I'm not unhappy with the rsults so far, I wonder if the Whitelist might be losing some potential traffic.

Your thoughts?

Thanks.

 

not2easy




msg:4640903
 1:52 am on Jan 29, 2014 (gmt 0)

UAs allowed, a slightly longer list, but still very limited. No match: 403


Does the whitelisting have a way to keep up with all variety of mobile browsers that are in use? How can you discern unlabeled bots? There are far more only detected by their activities all the time. Lots of these use badly configured UAs, but they are catching on.

Don't get me wrong, I'd like to have enough faith in whitelisting to use it, but find so much (junk) in access logs. If I never found new tactics, UAs or camoflage attempts in the access logs, maybe then.

tangor




msg:4640909
 2:33 am on Jan 29, 2014 (gmt 0)

Does the whitelisting have a way to keep up with all variety of mobile browsers that are in use? How can you discern unlabeled bots? There are far more only detected by their activities all the time. Lots of these use badly configured UAs, but they are catching on.

Whitelisting does not mean ignoring logs or activity. That is regular serivce on the site and leads to adjustments on the whitelist... that said, even the unlabled won't be fit traffic most times. 403, for me, is "take a look and adjust" if need be.

I know it sounds crazy, but out of 16B hits per day I get 1M real, I'm not that disappointed. (exaggerated numbers, pick your traffic and go from there)

trintragula




msg:4641056
 6:10 pm on Jan 29, 2014 (gmt 0)

On my site, I use some algorithmic filters to separate the bots from the browsers and whitelist a handful of the bots and all browsers.
It's not 100% accurate but it appears to work pretty well.
In the last year I've had to add a couple of new keywords to catch new browser technologies - both mobile.
I don't manually collect any blacklists, and don't currently subscribe to any online services.
This is in PHP though: I'm not using robots.txt or .htaccess very much.
We also don't send 403s - we send Login prompts: as we're a forum we just look like we've turned off guest access if you're a bot - visitors we think are browsers get guest access. This seems like a gentler option, given that our filters can't always be right. It's also not so obviously challenging the bots.

lucy24




msg:4641086
 9:10 pm on Jan 29, 2014 (gmt 0)

I'd like to have enough faith in whitelisting to use it, but find so much (junk) in access logs.

There's an underlying philosophical position that applies in many aspects of life, not just on your www site. It goes like this:

#1 Mistakes Will Be Made.
#2 Given that there will be mistakes, you've got an a priori choice: punish the innocent or let the guilty go unpunished?

Either way, your 403 page should be written for humans, because they are the ones who will look at it. That's why mine says, in effect, "I'm heartbroken to have to tell you that the server thinks you are a robot" even though in reality the server only "thinks" what I have told it to think.

dstiles




msg:4641101
 10:10 pm on Jan 29, 2014 (gmt 0)

Tangor - UAs are not very reliable. They are very easy to forge - even the major browsers allow you to switch them. I would recommend examining all the HTTP headers. I won't say more about those here as I'm sure the forum is monitored by, er, blackhats? :(

The other point is: block all server farms. Those are the sources for many bots, good and bad. You can only block botnet access by serious attention to means. A common feature of botnet accesses is using "known" UAs to bull a way in: googlebot, of course, but I've seen endless variations on current and out-of-date browser UAs.

trintragula




msg:4641113
 11:51 pm on Jan 29, 2014 (gmt 0)

I think about 10% of the bots I block are not who they say they are. I'm not blocking/blacklisting any server farms, but they don't figure prominently in the traffic that my filters let through. In fact at the moment, I've had 40 guests in the last 15mins. The unblocked ones are from comcast, sbcglobal, btinternet, googlebot.com, yahoo.net, etc. The blocked ones are from 163data.com.cn, vpn999.com, amazonaws.com, mail.ru, etc. There is rarely any overlap.
There are alternatives to blacklisting.

tangor




msg:4641129
 1:28 am on Jan 30, 2014 (gmt 0)

The other point is: block all server farms. Those are the sources for many bots, good and bad. You can only block botnet access by serious attention to means.

Conversely, you whitelist acceptable IPs and disregard the rest. Depends on your intended audience.

I'm looking for whitelisting methods, not blacklisting... though both are useful, and some times necessary. Whitelisting is who I include/allow, blacklisting is not. A fine point, I know, but that's the approach I'm taking, and looking to make that work better.

keyplyr




msg:4641147
 5:24 am on Jan 30, 2014 (gmt 0)



Another opinion - The difference between whitelisting & blacklisting can be compared in this perspective:

When developing a whitelist strategy, many legit users will be blocked until you determine these to be in fact legit users. They may never come back.

When developing a blacklist strategy, many bad agents will get through until you determine these to be in fact bad agents. When they come back, they'll be blocked.

Ergo, it is prudent to use a combination of blacklisting & whitelisting along with header checks, UA filters and server-side scripting in developing an effective defense.

trintragula




msg:4641173
 11:04 am on Jan 30, 2014 (gmt 0)

I dry ran my filters for several weeks and monitored who was getting flagged, so by the time I threw the switch to actually start blocking, I was pretty confident that the right visitors were getting blocked.

dstiles




msg:4641309
 8:09 pm on Jan 30, 2014 (gmt 0)

> Conversely, you whitelist acceptable IPs and disregard the rest.

If I did that I would need to whitelist more IPs than I currently blacklist. But then, some people block the whole of Europe, Asia and South America. Our web server gets valid as well as invalid hits from all over the world. Many of these are transients, a small number become regulars.

I have reached the state where my server blocks most bad hits, whether or not I've seen the IPs before, and blocks only a very few goodies.

trintragula




msg:4642120
 3:39 pm on Feb 3, 2014 (gmt 0)

I'm still interested to talk more about this.

Whitelisting is only part of the non-blacklisting story for me.
My approach is:

1. separate the bots from the browsers using several standalone software filters
2. allow the browsers through
3. stop all bots except those on a short whitelist

I think you can do a certain amount of separating and whitelisting using just .htaccess pattern matching, but I don't think it's enough in general to be satisfactory unless your audience is unusually easily identifiable. .htaccess may be more expressive than I'm aware of, but I don't really use it all, so I'm no expert.

Is there a thread somewhere here about bot-blocking software? I rolled my own, but I realise that many people are not in a position to do that.

wilderness




msg:4642144
 5:35 pm on Feb 3, 2014 (gmt 0)

I'm still interested to talk more about this.


If your referring to white-listing?
Good luck with that.
Users are willing to explain "I do this or I do that", however any examples (except for two-brief-fundamental examples) of syntax are non-existent in the open-forum.

Whitelisting is only part of the non-blacklisting story for me.
My approach is:

1. separate the bots from the browsers using several standalone software filters


Why reinvent the wheel?
There are some log analysis software's that work both online (if you have your won server) and offline.
One example is "Analog [analog.cx]"

2. allow the browsers through

A list of browser UA's (along with the many variations of each (i. e., the recent Yahoo thread)) would takes years to accumulate.

3. stop all bots except those on a short whitelist


any examples (except for two-brief-fundamental examples) of syntax are non-existent in the open-forum.

I think you can do a certain amount of separating and whitelisting using just .htaccess pattern matching, but I don't think it's enough in general to be satisfactory unless your audience is unusually easily identifiable.


You only believe that because you've not either used or implemented htaccess extensively.

RegEx is very powerful, either for the simplest of techniques, or the most complicated of techniques.

.htaccess may be more expressive than I'm aware of, but I don't really use it all, so I'm no expert.


lucy is an example of a person with RegEx skills that are leaps and bounds above the rest of us.
Just roll up your sleeves and get to work!

Is there a thread somewhere here about bot-blocking software?


There are a couple of old Perl or PHP methods (scripts) of which links were supplied to recently (in another recent thread) that people are still using.

Bill (others also use some similar methods) uses extensive PHP methods for indexing, identifying and tracking bots, however despite his repeated mention of their existence (and their results), the syntax has never been provided in the open-forum.

I rolled my own, but I realise that many people are not in a position to do that.


IMO, nothing beats going through logs manually, however if your website (s) traffic is very-high-daily, than manual viewing is impossible.

lucy24




msg:4642217
 11:35 pm on Feb 3, 2014 (gmt 0)

leaps and bounds above the rest of us.

Aargh, don't talk to me about Regular Expressions. I just discovered that one of my sites was down for 12 hours because when I edited out a no-longer-needed |\.pdf from a pattern, I inadvertently deleted the following close-parenthesis as well.

In theory, anything can be done in htaccess. But beyond a certain point it's safer to detour to a php script.

Whitelisting is much simpler if none of your visitors use cell phones ;)

iomfan




msg:4642224
 12:06 am on Feb 4, 2014 (gmt 0)

1. separate the bots from the browsers using several standalone software filters

What standalone software?
I use a combination of whitelisting and blacklisting, and all i need for that is .htaccess :)

A) Some simple pre-filtering (see examples below) weeds out the bulk of the unsavory access
B) Bots are managed via a combination of white listing and access control: approved bots get access to information rich "main pages" but not to images, not to information poor pages like forms, not to pages that only interested humans would need, such as price lists or contact information, and not to incidental information like SSI
C) A "fool the scammers" site design: most unsavory access (including screenshot and thumbnail scrapers) is to a domain's index page, so I have decided to make those pages irrelevant, meaning, my index pages contain only minimal information (and bots are told they are "noindex"), and human users quickly click through to the information rich second page. The bots that are welcome on my sites "understand" what's going on and scammers get what they deserve: nothing of value...
D) Certain information only of use to serious visitors (sich as price lists or certain videos) can only be accessd from within the site in question

An illustration of the pre-filtering i employ:

1) Everybody has access to robots.txt and error messages (403, etc.)

2) Bots are tightly managed. Those that have "Googlebot" or "Googlebot-Mobile" in the UA are allowed in IF they come from a Google domain; otherwise they are served a zero byte file. Bots (or other users) that come from a Google domain WITHOUT an acceptable UA also get the zero byte file. Bots that have "msnbot" or "bingbot" in the UA are allowed IF they come from a msn.com domain; otherwise zero bytes. Bots (or other users) that come from a msn.com domain and in the UA neither show "msnbot" nor "bingbot" get the same, as well. And so on, for a dozen other bots whose visits I appreciate... This also means that bots coming in from hosts where reverse name lookup does not work get no access. The settings in robots.txt (which is based on whitelisting) match those in .htaccess, and there are many bots that are well-behaved but unwanted on my sites and that therefore get "User-agent: * Disallow: /" and never look for more than "robots.txt". Then there are a few bots that have shown to not play by the rules (example: Baidu), and they are managed more strictly: anything from a related IP block gets a zero byte file.

3) KNOWN malicious or otherwise unwanted access results in a zero byte file being sent (examples: looking for filenames like "admin" or "config"; using methods like "put" or "trace", having in the user agent terms like "harvest" or "image").

E) HTTP/1.0 gets denied with a (human) user friendly explanation

F) In some cases site access is restricted to a specific country - the wanted bots are allowed in regardless of that restriction, the known junk is pre-filtered, and everybody else who happens to not get in sees a user friendly explanation with a feedback button and an invitation to use it if they think they should have access (this helps me fix errors in the access whitelist for the country in question).

Looking at the log files I fine tune the settings from time to time...

m2c

iomfan




msg:4642225
 12:17 am on Feb 4, 2014 (gmt 0)

Aargh, don't talk to me about Regular Expressions. I just discovered that one of my sites was down for 12 hours because when I edited out a no-longer-needed |\.pdf from a pattern, I inadvertently deleted the following close-parenthesis as well.

Been there... done that... therefore, after an .htaccess file edit, I ALWAYS check to confirm there is no error 500. ;)

incrediBILL




msg:4642693
 4:28 pm on Feb 5, 2014 (gmt 0)

Separating bots from browsers can be simplified somewhat using browscap.ini, see [browscap.org...] for details and keep that updated on a cron job.

Basically, use crowd sources user agent identification.

Why there isn't more stuff like this out there, considering the sheer volume of people chasing down user agents and identifying them, in a format everyone can use has continuously baffled me since the beginning.

Sure, I could volunteer to build a list but I too have a day job and Gary did a great job with browscap so why reinvent the wheel?

trintragula




msg:4643003
 1:30 pm on Feb 6, 2014 (gmt 0)

Thanks, there are some useful-looking ideas there.
1. separate the bots from the browsers using several standalone software filters

What standalone software?

I only meant standalone in the sense of standalone filters - those requiring no external black-lists that need ongoing manual updates. The filters I use are not separated from the rest of my website code.

The filters are implemented using PHP and MySQL:
1. separate bots from browsers based on matching a fixed list of keywords: a short list of words like 'Mozilla' or 'Opera' indicate browser, words like 'bot', 'spider', 'crawl' indicate bot. UAs without either are also assumed to be bots. This catches 90% of the bots we identify with a few lines of PHP - the ones that aren't pretending to be a browser. A handful of bots are whitelisted explicitly, the browsers are not.
2. keep a list of recent requests and watch for bursts of requests that are too fast to be possible for a human with a browser.
3. watch for requests for pages which are not followed up with requests for icons.
4. watch for requests for links on a page that are not visible to humans.
5. watch for visitors who take an unreasonable number of pages in a short period.
There are other possibilities, but this is most of what I'm using so far.

These filters can afford to be conservative because they don't issue 403s - just a means to recover access for humans.

The thing I like about this approach is that it does not require ongoing attention as part of its normal operation: most new bots are blocked automatically; most new browsers are not.
It's not 100% accurate, but nothing is, and with 'soft blocking' the consequences of being wrong are not bad. As it is the entire public content of my site is in google's cache anyway, so it's not like this is really going to prevent content theft very much, unless google are really good at stopping bots from scraping their cache.

Blocking the bots has removed 2/3 of the traffic on our site, and most of the older bots that are still trying are asking only for the pages that were present before I put in the bot blocker, and are not actually hitting us very often now. At this point, about 15% of our traffic is identifiable bots (mostly being refused access), 20% is known registered users, and most of the other 65% resembles human behaviour (it might be largely bot nets for all we know - archive.is certainly got in under the wire - but it doesn't appear to be coming from server farms).

What I've built is a working prototype. There are actual available tools out there that do bot blocking to some degree. I'm aware of a couple (ZBblock and Bad Behavior) and I would assume there are many more. There are doubtless better tools than I have built.
I think it would be good to find out more about them.

I definitely agree with iBill - it seems mad to have everyone building their own lists.
I think it's still better to use methods that don't depend on indefinite lists at all if you can.

wilderness




msg:4643008
 2:05 pm on Feb 6, 2014 (gmt 0)

I definitely agree with iBill - it seems mad to have everyone building their own lists.


When it comes to UA's and IP's there's not a one-size fits all for all websites and all webmasters.

Each person must determine what is beneficial or detrimental to their own site (s).

If the task was standard, than hosting servers would have a universal implementation in CP available for all their customers, and that is simply not so. Most CP options don't even create error free syntax.

trintragula




msg:4643021
 3:43 pm on Feb 6, 2014 (gmt 0)

Indeed, one size does not fit all. Let's explore that a little bit.

I think there are two stages to deciding what we do with our visitors:
1. mechanism: putting them into categories (e.g. bot or not)
2. policy: deciding what to do with each category (e.g. block or not)

With regard to categories:
1. You can make your own lists of visitors from each category
2. You can use someone else's list (either by dynamic lookup, or by download)
3. You can use a software filter

With regard to what you do with them:
1. You can let them through
2. You can send them a 403
3. You can send them a message
4. You can allow them to change their category by correcting your guess
5. You can just treat them differently on an ongoing basis

For the most part we're all picking one or more points in this 2D space.

There are also many possible categories:
1. Browser
2. Good Bot
3. Badly behaved bot (according to various criteria)
4. Known Spammer
5. Known Hacker
6. Known Content thief
7. Member of server farm, or other distinguished ASN
8. IP from a given country
9. Old Browser
10. Download tool
11...
The list is potentially long but not as long as the list of Us. What categories a visitor can be placed in is not generally a judgement call: you can use on evidence of behaviour or identity. This is mechanism. The judgement (policy) is in what you do with them.

* BrowsCap is a list that distinguishes user agents as bot or browser (among other things).
* NirSoft have a database of IP blocks by country that you can download (google "nirsoft country"). Probably there are others.
* domaintools knows the ASNs and CIDRs associated with each IP address. If it were possible to use an API to get this for free, that would be great, but I think it's monetised. I don't know if there's a legal free source of this information.
Maybe there are other lists available out there.

One common feature of all lists of this kind is that they are all in constant churn, so if you use them you have to keep up to date, either by doing your own research, or by signing up for some update process.

I'm presenting the above just as a way of thinking about the issues. I make no claims of authority about this.

lucy24




msg:4643068
 7:26 pm on Feb 6, 2014 (gmt 0)

7. Member of server farm, or other distinguished ASN

I don't think this is a separate category. People don't block server farms on those grounds alone; they block them because it's a simple means of excluding robots.

dstiles




msg:4643079
 8:00 pm on Feb 6, 2014 (gmt 0)

It's not possible to rely soley on browscap, even if server farms are excluded.

Botnets include many DSL IPs and are often used with a user-Agent du jour (firefox, opera, chrome, whatever) that are accurate (or at least reasonable facsimiles of) true browser User-Agents, though usually (but not always) around a year out of date. Also, the browscap UAs are woefully outdated in many cases - Firefox/1.n anyone?

It may be useful to accept browscap's list of bots but if those few are whitelisted anyway there is little point.

There are other tell-tales that have to be investigated before accepting browscap UAs, both browser and bot. Once browscap gets involved one has to then filter out unlikely UAs and IPs so it's probably down to manually-maintained lists anyway.

incrediBILL




msg:4643087
 8:36 pm on Feb 6, 2014 (gmt 0)

It's not possible to rely soley on browscap, even if server farms are excluded.


It's not possible to rely exclusively on any one technology, that was not the recommendation whatsoever, but browscap is a good UA validator, nothing more.

Bot blocking is like peeling layers of skin off an onion starting with robots.txt for the nice bots, .htaccess for the not so nice bots, filtering bot names from browser names, validating browser user agents, validating browser headers, validating the source of the IP address as residential, business or hosting, so on and so forth.

You use all of the methodologies organized to perform various checks along the way and if anyone of them fire off a warning shot then the IP is blocked from further access.

Worrying about which technology does this or that, or how effective it is, is kind of meaningless as each one is just a single voice in the chorus of the bot blocking choir.

keyplyr




msg:4643088
 8:37 pm on Feb 6, 2014 (gmt 0)

I agree. Many tout "I whitelist" like this alone is some higher ascended conscience capable of curing all woes. In reality, it's a combination of blacklisting & whitelisting along with header checks, UA filters, behavior filters and server-side scripting that builds an effective defense (in proper linear server-response order.)

trintragula




msg:4643109
 9:29 pm on Feb 6, 2014 (gmt 0)


7. Member of server farm, or other distinguished ASN

I don't think this is a separate category. People don't block server farms on those grounds alone; they block them because it's a simple means of excluding robots.


There have been a couple of replies since I started writing this and possibly they overlap with this. But here it is anyway:

@lucy24
The idea with the list of categories is that these are 'mechanism'. Membership of these categories is mechanically verifiable, either by some authority (such as ARIN), or by behaviour.

Separately, there's the policy of what you do with members of those categories: your choice whether to use membership of a server farm as grounds for blocking, or not: some people may choose to block some or all server farms, some may not. But before you block them, you identify them.

@dstiles
I think most of us will use multiple mechanisms to implement our policies.
A useragent can tell you a visitor is a bot, but it can't tell you it isn't.

On my site I have no manually maintained or subscriptions to lists of UAs or IPs. If a visitor shows up with Firefox/1.n or some other improbable old browser UA, my User Agent filter will not stop them, and I haven't looked far into finding a way to fix that. I'm certainly not going to list them all by hand.
Having said that, visitors with unusually old browser UAs rarely get in, but generally because they get caught in other ways.

Following up iBill and Keyplyr,
I agree. I think I'm saying much the same thing.

I guess the one sentence summary of what I'm saying (which may be surprising) is that with a reasonable suite of software checks, you can avoid having any manually maintained open-ended lists at all, and still keep out the guys you don't want to see on your site.

moTi




msg:4643144
 12:10 am on Feb 7, 2014 (gmt 0)

allow the browsers through

first of all, it is useless to check for ua string, as it is being faked, period. we want a solution that is hassle-free, low-maintenance and nearly perfect. i have other things to do than professionally maintaining a htaccess blacklist for the spammers and scrapers.

task: detect humans, weed out all other traffic except desired bots

my solution (deliberately vague):

1. client-side script that executes a server-side script that logs ip + reverse dns.
2. server-side script that logs ip + reverse dns.
3. match those against each other.
4. except for the few desirable bots: send 403 if mismatch

works flawlessly since day one.

two problems:

1. javascript (which is essential for my sites anyway)
2. headless browsers (which i have under control yet)

tangor




msg:4643223
 8:55 am on Feb 7, 2014 (gmt 0)

All thanks to folks who participated. I've been heading in some of the same directions for my "whitelist quest".

As we all know, it all evolves, over time, and will continue to do so.

Again, thanks!

blend27




msg:4643235
 10:33 am on Feb 7, 2014 (gmt 0)

@moTi - 2. headless browsers (which i have under control yet)

mouseMove + mouseOver or css/media query content overlay style & and touch(ontouchstart or onmsgesturechange) for Mobile UAs, think like a Human... triggered by unusual request not from.... (say you know where your overage visitor is from).

Angonasec




msg:4643289
 3:04 pm on Feb 7, 2014 (gmt 0)

A useful thread.

Wilderness:
"...hosting servers would have a universal implementation in CP available for all their customers..."

I suggested (privately) to my site hosting service that it'd surely be attractive to their customers to offer the +option+ of being hosted on a server running a config file that bans problem nations such as the BRICs.

Volunteering to be on that server would be at volunteers' own risk, naturally.

The idea was promptly scorned. Locked in 90's www thinking, despite seeing their server CPUs glowing with over %60 bot-traffic, and most of that undesirable if not malicious.

Wake-up hosting providers, there's a niche-market awaiting you.

trintragula




msg:4643343
 5:32 pm on Feb 7, 2014 (gmt 0)

@moTi
Getting the client to execute some javascript and verifying it happened is a good idea. I've used this for spam prevention, but hadn't thought of applying it to more general bot blocking. I'll think seriously about doing that.
--
There's something of an arms race here, much as there is with spam: when a defensive measure becomes popular enough to be a nuisance to the offenders, they'll figure out what they have to do to work around it. In this case building better and better human-simulators.

I hear reports that Google are already running javascript on pages they visit. Doubtless some of the other bots will follow suit with headless browsers if they haven't already.

When building more and more elaborate human-detectors, it's worth bearing in mind that regardless of what the human has to do, the only clue the server gets is a stream of bytes that the client sends back based on the stream of bytes it sent. The reply doesn't have to be generated the way you expected.

This 86 message thread spans 3 pages: 86 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved