homepage Welcome to WebmasterWorld Guest from 54.167.244.71
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How do you decide on an IP ban?
Status_203




msg:4150339
 9:41 am on Jun 10, 2010 (gmt 0)

I find this forum interesting but feel a little out of my depth. I do a little bit of bot handling - mostly white listing certain SEs, cloaking robots.txt to all others and putting a few technical hurdles in the way of the stupider bots. What I haven't had the confidence to do, but get the impression is becoming increasingly important with the spread of cloud based hosting, is to get into banning at the IP level.

Say you've determined that a visitor from IP address 256.257.258.259 (yes, I know ;) ) is a bot that you would prefer to block but that you cannot, for whatever reason, rely on the user-agent.

How do you determine:

a) whether that IP address belongs to an organisation that only supplies hosting rather than access to any significant (more than just employees) number of human beans? Is there anything that can be done other than checking out their website? Do you need a certain traffic level before you can be confident that if they provided legitimate access you would be seeing it in your logs/analytics?

b) the entire range containing that IP? Is this just based on what hits your site or can it be looked up? I've tried a few searches for tools in this area but I don't really know how to phrase the search.

c) whether the organisation has any other IP ranges that you might want to block? Clearly cannot be done from logs/analytics before showing up in said logs/analytics.

 

wilderness




msg:4150522
 3:13 pm on Jun 10, 2010 (gmt 0)

How do you determine:

a) whether that IP address belongs to an organisation that only supplies hosting rather than access to any significant (more than just employees) number of human beans? Is there anything that can be done other than checking out their website? Do you need a certain traffic level before you can be confident that if they provided legitimate access you would be seeing it in your logs/analytics?


You've multiple questions here, however you have them out of order (i.e., the "b)" section of your inquiry is the first order of business:
1) "backbone" (i.e., server farm, reseller, colocation, or what you may chose to call them. Generally speaking there is NOT any benefit in potential visitor traffic from other websites or visitors that are using this type of IP/hosting.
2) the backbone is in the business of selling their services and generally will NOT deceive the public in presenting their service products.
3) Most of these types of pests, will use a bare-bones initial visit to test the waters. If your able to recognize them in these test-the-waters visit, that's enough in most instances. Others, will simply appear and grab every page on your site (s), although the latter has become less and less in recent years (see this thread [webmasterworld.com])

b) the entire range containing that IP? Is this just based on what hits your site or can it be looked up? I've tried a few searches for tools in this area but I don't really know how to phrase the search.


There are three major registrars for the entire world (ARIN, RIPE, and APNIC (there are a few smaller ones as well), ALL of which offer a "whois search".
Enter the IP from your logs and the entire range (including any registered sub-nets will be provided in the results.

There are other websites and/or tools which use the data from these orgs and/or provide queries to these same org websites, however why deviate when the source is available.


c) whether the organisation has any other IP ranges that you might want to block? Clearly cannot be done from logs/analytics before showing up in said logs/analytics.


I recently provided an example of doing a name search at the registrar's (i.e, ARIN) in this thread [webmasterworld.com]. (see "PPPoX Pool")

Summary. . .In the end, each webmaster must decide what is beneficial or detrimental to their own website (s).
There's NO rule of thumb. A visitor (IP range or otherwise) that I chose to deny, may be a visitor that another webmaster desires.

Most folks today are using "white listing" (denying MOST all visitors and allowing specific User Agents or IP Ranges), rather than "black listing" (denying specific User Agents or IP Ranges).
The problem with presenting usable examples in these open and readable forums is that the pesty-bots and harvesters may also read the criteria for access and then modify their forthcoming User Agents. As a result, most longtimers here are very reluctanct to add examples of "white listing".

dstiles




msg:4150834
 10:26 pm on Jun 10, 2010 (gmt 0)

Lacnic, Afrinic and JPnic are also important registries. The first two are for countries that form a significant source of individual DSL "bad hits" (but by no means the only ones - I get more from Arin areas than from all three of these). :(

My primary defense is UA detection, not only specific UAs but generalised ones as well (robots.txt only mops up a few genuine-ish bots). The UA trap blocks access at this level and logs the fact in a "this is new" log (banned IPs get logged in a different log to simplify viewing, bots in another). Depending on the type of UA the IP may get automatically blocked, sometimes with no warning; some such as MSIE screw-ups / downloaders (eg bsalsa) may get a warning message, possibly with a complaints form.

Over several years this set of logs, plus a lot of extra help from this forum, has led to about 1000 server ranges being blocked (these are mostly ones that have hit my sites or shown up in various forums such as this one). If a detected IP is in a server / hosting range then that's the end of the matter: dead. Likewise certain persistent DSL or business ranges.

Exceptions to this are a few good bots which have sub-records to "drill a hole" in these blocks to whitelist the bot (eg cuil). An increasing number of bots are now publishing their bot IPs, so it's worth looking for them on the bots' pages. Bots in the Clouds stand no chance here even if they are good because they do not present a consistent IP.

Discovering whether an IP is used for hosting or not is tedious: I'm still discovering new ranges now, after all this time (but I have a more efficient system for trapping and logging now).

Easiest way of checking for hosting services is to check rDNS for, say, a cnet (ideally a locally based tool but robtex is useful). If a significant set of IPs within the /24 have different domains they are likely to be a server farm of some kind - OR a set of static business DSL lines, which it is usually a bad idea to block (but there are some business lines that run bots...). If the IP range is large then take several samples; sometimes only a sub-range is servers. Good service providers register usage in DNS and a whois should show this if there is.

Blocking sequence: first check for previously-blocked IP (it may be a nasty with a different UA and it may be quicker, depending on the IP database used), then SEs (UA plus IP), then known harvesters, then possibly bad browsers. Header checks scattered through these tests as applicable. Vary according to your own ideas. :)

On the way through make a note of whether the UA is a bot or a browser - useful for minor page content cloaking such as hiding email addresses. :)

Status_203




msg:4152239
 9:09 am on Jun 14, 2010 (gmt 0)

Thanks for the replies.

Some very helpful information for me there.

@wilderness: One question though...

Most folks today are using "white listing" (denying MOST all visitors and allowing specific User Agents or IP Ranges), rather than "black listing" (denying specific User Agents or IP Ranges).


Do you mean that most folks (in this forum anyway) are using white listing for crawlers (which was my understanding from following the forum) or white listing for any access to the site (which is how the quote reads to me but seems like increasing hassle rather than reducing it)?

wilderness




msg:4152357
 2:43 pm on Jun 14, 2010 (gmt 0)

or white listing for any access to the site


Whether crawler or visitor are intended, when accepted/conforming User Agents are used to determine acceptable traffic, the result is applicable to all visitors.

Most crawlers/harvesters have changed their User Agents to either standard or dysfunctional UA's. some examples are extra spaces, missing spaces, extra charaters, missing or extra semi-colons, and more.

Some terms in UA's were only used for specific operating systems and/or browsers, thus a Win95 term in a UA is generally not used (or acceptable) with IE8.

Although the learning curve is more complex, in the long run, whitelisting reduces the time required of constant monitoring your of raw logs and the time required to stay abreast.

keyplyr




msg:4152595
 10:33 pm on Jun 14, 2010 (gmt 0)

I use a combination of UA whitelisting, IP blacklisting and generic UA blacklisting.

1.) UA whitelisting using mod_rewite:
Allowing only verified IPs access when certain UA criteria is present. This filters out UA spoofing.

2.) IP blacklisting using mod_access:
Banning specific IP address as well as partial IP ranges that are known to send trouble. Info gathered from server access logs, research from web searches and here at WW forums.

3.) generic UA blacklisting using mod_rewrite:
Blocking access when certain criteria is found in UA string regardless of who it is.

Status_203




msg:4152805
 9:12 am on Jun 15, 2010 (gmt 0)

Ah, thanks wilderness. Just differing opinions as to terminology then. Personally I'd class that as validation - determining whether or not it *is* a bot before taking an approach of

likely human - innocent until proven guilty and allowing access unless and until other evidence suggests a bot - blacklisting.

likely bot - guilty until... well probably just guilty unfortunately these days ;) , if it's not on the list it's not coming in - whitelisting.

(although I can see an argument for claiming you're simply whitelisting humans! ;) )

UAs are next on my (bot-blocking) list. I hadn't thought about how what was once a valid component might become invalid over time. Thanks.

Then it'll be on to behavioural heuristics. I'm well aware that people probably won't want to discuss that in too much detail. I have some ideas it's just deciding where to draw the line!

enigma1




msg:4152875
 12:32 pm on Jun 15, 2010 (gmt 0)

@Status_203, you should be cautious about IP bans especially permanent ones. The way I prefer to configure a server for identification is:

UA - irrelevant, don't care
Referrer - irrelevant, don't care
other HTTP headers, check for validity browser vs bot identification.
Then check IP itself, rDNS valid and to what? ISPs vs Hosts.
time-frame pages access

Have some logic in the application to decide whether or not to bounce a specific request.

Finally keep in mind iframe with some js, or redirects, or pointers to external sites via resources present in HTML pages, can achieve stealth access (visitor doesn't even know it) to other domains and can trigger traps and bans. You don't want to ban IPs and hosts because of that.

Megaclinium




msg:4155113
 1:03 am on Jun 19, 2010 (gmt 0)

While it is obvious from your raw logs what are bots sometimes (e.g. they only retrieve HTML when the user would also retrieve embedded graphics for example),

I add two small files that isn't visible on main web page that robots will tend to follow.

This identifies them as bots if other methods fail.

The reason I have two: one is in robots.txt as excluded.

That way if they retrieve this they are both a bot and a 'bad bot'.

I've seen increasing ripe small ranges of maybe 255 IP addr on registry that at bottom show the 'route' to that range, which is probably the host server farm so I add this in too to the ban.

I also have the setting in my control panel so that end users can't directly get my media files. They have to retrieve a web page on my site with embedded link to the media (this is how most visitors view web pages, they get a web page and it pulls up the embedded graphics).

Robots could get your pages this way but they are either stupid or lazy: they compile a list of media they want then try to get it directly. They eat endless 302's on my site due to this. (saves alot of bandwidth as the 302 only returns a few bytes).

This will frustrate the occasional real user who wants to to link directly to a .jpeg on your site. They can still link to the web page that contains that .jpg which is what I want anyway.

Status_203




msg:4155884
 8:25 am on Jun 21, 2010 (gmt 0)

Not fetching images - indicative, not proof positive. I'll browse with images off sometimes; especially over a mobile connection. This to me would be a reason to throw up a (text based) captcha early in the session (and continue to watch carefully for further indications just in case). If they can't answer the captcha or trigger further indicators then flag the session for offline research.

Hidden links - could be a bot, could be a downloader. Don't really want either on the site but I don't want to IP ban a downloader on a home connection IP range. A flag for research again.

Hotlink protection - probably a referrer check. I track sessions on my sites so I can do a little better by ensuring the presence of a valid session and that the 'gatekeeper' page was requested recently by that session (and probably throw in a referrer check (allowing blank as well) for good measure!). Unrelated to IP banning though ;)

Ocean10000




msg:4156833
 2:55 pm on Jun 22, 2010 (gmt 0)

Here is an older thread which I outlined some of the items I used to filter out unwanted traffic from spiders/scrapers/problem makers.

Quick primer on identifying bot activity [webmasterworld.com]

Status_203




msg:4157431
 8:06 am on Jun 23, 2010 (gmt 0)

Thread seems to be going off topic now, but if a one-stop-shop-reminder-thread is going to be useful then Point 8 in the (very good) post Ocean10000 linked to reminded me of Incredibill's thread Default User Agents of Programming Libraries and Command Line Tools [webmasterworld.com]

And while looking for that I was reminded of the following threads (from the very useful Library [webmasterworld.com])

IP Banning Primer [webmasterworld.com] by httpwebwitch

and

Do Bot-Blocking Techniques Alter Bot Behavior? [webmasterworld.com] by dstiles

Megaclinium




msg:4168072
 1:47 pm on Jul 11, 2010 (gmt 0)

Not fetching images - yes, I usually don't ban them just for this unless they start scraping,

I hadn't really thought about users on mobile devices or others for speed doing this but I hadn't banned them just for that usually unless they repeatedly do this and never visit other parts of the site

or if they have weird UA indicating a scraper or bot,
or when I lookup the address and resolves to a hosting service rather than end users.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved