homepage Welcome to WebmasterWorld Guest from 54.211.230.186
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Filtering Out Really Hard To Find Bad Bots
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 7:44 am on Jan 16, 2013 (gmt 0)

Everything that comes to my sites are already filtered into 3 buckets:
1. Allowed - whitelisted crawlers given instant access
2. Blocked - automated tools blocked from data centers and a variety of other criteria
3. Browsers - appear to be valid browsers, at least by today's definition ;)

Now I'm analyzing what was assumed to be browsers that have been logged and doing more in-depth analysis.

To do this, first I've built a residential ISP rDNS filter. This filter is a big list of rDNS strings from all of the major residential ISPs. Any logged browser is then discarded if it's rDNS result has an entry on the ISP rDNS list.

Now we've basically filtered all traffic down to eliminate IPs from data centers and well known ISPs into a much smaller and very manageable list of stuff left to evaluate.

The dregs of the web.

So far I'm finding even smaller and lesser known ISPs and of course smaller and more obscure hosts and after adding much of that to the filters, I repeat and see what's left until there's hopefully nothing as everything is being filtered into the allow and deny lists.

With all that filtering, what you have left is the nastiest stealth stuff that really tries hard to hide and is now easily exposed and just sitting there like a glaring sore thumb.

With IPv6 it may not be possible due to the sheer volume of IP addresses to track, but with IPv4 it's looking pretty good so far and I'm quite pleased with the results.

I think all the data miners that don't want to be found will be switching to IPv6 just to be harder to find.

Assuming ISPs continue to provide useful rDNS on IPv6 entries, maybe the solution to that problem isn't blocking data centers, but allowing business and residential ISPs only, the ultimate whitelist. Assume everything else is a data center and block it with holes punched thru that firewall as needed on a case by case basis.

Any thoughts or comments?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4536448 posted 11:20 am on Jan 16, 2013 (gmt 0)

sitting there like a glaring sore thumb

With the remaining IPv4 space being doled out in /22 slivers, "glaring ingrown hair" may be more like it :(

Eventually you hit a point of diminishing returns. Sounds as if you expect yours to be somewhere in IPv6-land.

:: business with calculator leading to depressing discovery* that 2^32 is considerably less than the current population of the planet ::


* Followed by doubly depressing realization that I could have estimated this perfectly well in my head.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4536448 posted 7:38 pm on Jan 16, 2013 (gmt 0)

I think there are more "broadband" ISP ranges than server ranges. Admittedly many are /11 or greater but there are also a lot of very small ones, as Lucy suggests. A mitigation would be to select only those ranges assigned to a specific set of countries but that would not be helpful to those web websites that attract legitimate traffic from, eg, RU, UA, CN, IN etc etc.

I would also point out that a LOT of "broadband" IPs I block are run by botnets: there are a lot of very careless (mostly windows) users out there whose machines are compromised, often massively so. A lot of the mobile devices are also easily compromised, especially android. These IPs are responsible for scrape and injection attacks.

In the scope of "business" ranges, I have blocked several sub-ranges (and sometimes full ranges) which are used by scraping and mining businesses.

Also see my current thread hereabouts re: synapse, which seem to be coming in (at least mostly) on "broadband" IPs.

As to ipv6 - I'm holding off on converting to that for as long as possible.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 5:35 am on Jan 19, 2013 (gmt 0)

I think there are more "broadband" ISP ranges than server ranges.


That's why I'm doing it with reverse DNS results at the moment as that list is a lot smaller than the large lists small fragmented IP ranges.

Also, when you're using reverse DNS you can automatically compile IP range lists as you go along based on actual accesses and do a WHOIS sanity check later if something looks suspicious.

I would also point out that a LOT of "broadband" IPs I block are run by botnets:


Those IPs are handled on an individual basis because you couldn't block all of Comcast or Cox unless you wanted to lose a ton of customers. Likewise, compromised Android, which I've yet to see as I'm suspecting you're seeing a fake user agent string, if it's legit Android is typically using a 3G or 4G IP pool and you can't really block those except for a few hours at most unless you don't want mobile customers either.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 2:20 pm on Jan 19, 2013 (gmt 0)

Those IPs are handled on an individual basis because you couldn't block all of Comcast or Cox unless you wanted to lose a ton of customers.


Bill,
I agree with the broad statement of denying an entire provider, however if a webmaster is aware of the origin of traffic for his site (s), than he could certainly deny a sub-net region, or even a portion of a region.
EX:
I get very little valid traffic from the San Francisco Bay Area, regardless of provider.

Visitors from Nebraska, The Dakota's Oklahoma, Colorado, Las Vegas, Oregon, Washington, in fact for me most everything west of the Mississippi. These folks are simply without interests in widgets.

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 10:15 pm on Jan 19, 2013 (gmt 0)

It is easy enough to block and individual IP or a malformed UA that you find doing things you don't want them to do but what do you do with IPs inside AT&T or Verizon that use a common vanilla agent? I am blocking individual IPs, I just don't think that is a very reliable defense when I see my own IP change daily and even in the same day. Do most Communications ISP clients have unique static IPs?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 10:52 pm on Jan 19, 2013 (gmt 0)

It is easy enough to block and individual IP or a malformed UA that you find doing things you don't want them to do but what do you do with IPs inside AT&T or Verizon that use a common vanilla agent?


The standard browser assignment for a new customer seems to be IE, however these days folks are using a variety of browsers and versions.

I am blocking individual IPs, I just don't think that is a very reliable defense when I see my own IP change daily and even in the same day.


Assignments (at least broadband) are dynamic by nature,
See below.

Do most Communications ISP clients have unique static IPs?


My broadband IP assignment (US) is dynamic, however mine has not changed in months, and for the most part (with the exceptions of short periods, has been the same IP for more than two years. And despite turning my computer off nightly. I rarely reset my broadband router, unless theres an issue.
I've seen similar with longtime widget correspondents with other US Broadband providers.

My biggest problems are with the mobile devices.
Both browsers and IP's.
The browsers are vanilla agents.
The IP's may change on the next connection and/or place of origin, while all the while being the same person and/or device.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4536448 posted 2:14 am on Jan 20, 2013 (gmt 0)

Do most Communications ISP clients have unique static IPs?

Genuinely static, as opposed to just not changing very often? Depends on how much you pay. When I had cable internet it came with a static IP; with DSL I'd have to move to a higher (= more expensive) service level.

But in practice the IP only changes when I turn off the modem, which I normally don't do. Neither the computer nor the router factors in. In fact it's useful for when people think they've got you identified and then you duck behind the curtain and change IPs, mwa ha ;)

Then there are those remote satellite locations where each town has its own /30 block* so you may not be officially static, but in practice...


* Really. When a disproportionate amount of your traffic comes from Arctic Bay, you notice this stuff. /30 is extreme though; generally it's /27 or /28.

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 2:41 am on Jan 20, 2013 (gmt 0)

My own IP is usually in the middle of a range that I have blocked from several sites (locking myself out more than once and needing to ftp an edit) sometimes, but it jumps all over (I'm on ADSL) from being thisclose to PEAK in a 67 class A to being sandwiched in between HSI and INTERSERVER in a 173 class A range and sometimes on 192. gah! If a higher class of service was available to me I'd be on it. I had hughes dish for several years but their limits are incredibly low even with premium service. That's why I had questions about reliably blocking individual IPs. Your answers are helpful, thanks!

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 3:37 am on Jan 20, 2013 (gmt 0)

n2e,
The rural US areas certainly face different challenges than the larger metro areas.

I've a home in a very small rural area and there is no cable or DSL.

ADSL appears to be an option, however I'm not sure if that requires two phone lines, which would deter the expense for that slower speed (256kb up).

Hughes simply has too many restrictions, however for speed-users it's the only valid rural option. It's my understanding that the Hughes routers only allow a single machine connection.

Dial-up is simply no longer acceptable.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4536448 posted 3:59 am on Jan 20, 2013 (gmt 0)

My own IP is usually in the middle of a range that I have blocked from several sites (locking myself out more than once and needing to ftp an edit) sometimes, but it jumps all over

That's where mod_rewrite steps in. (Assuming Apache here.) Most of the time a clean and simple "Deny from..." will do, but sometimes you simply have to look at more than one option: "slam the door on this IP range unless the user-agent is me".

Admittedly this is easier if you've got a browser with some distinguishing features. In other words, not Chrome ;) and definitely not a tablet.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 4:05 am on Jan 20, 2013 (gmt 0)

Admittedly this is easier if you've got a browser with some distinguishing features.


It's not difficult to modify UA's, I have widget users due it occasionally. The user adds a keyword, notifies me of the word and I add an exception.
The most recent was a rural area that was assigned an Embarq IP from their local provider. (I'm not a happy camper with Embarq).

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 3:22 pm on Jan 20, 2013 (gmt 0)

ADSL appears to be an option, however I'm not sure if that requires two phone lines, which would deter the expense for that slower speed (256kb up).

No, one phone line handles it, but what affects the service is where you are in relation to the nearest physical routing point. I am rural and remote at the end of the phone service line and I am lucky to get 225kb (down)on the 2Mb plan so I have stopped trying to buy the 50Mb plan.

Hughes simply has too many restrictions, however for speed-users it's the only valid rural option. It's my understanding that the Hughes routers only allow a single machine connection.

The satellite service enters via a cable to their modem with only one outlet so I ran the hughes service through a wireless workstation router and connected whatever I wanted but staying up til 3AM to have unlimited up/down service was too much. Simple platform update downloads could put you offline for a day. The fear of clicking links that might (auto)start a video really cramps online enjoyment. I had the top of the line ElitePlus plan so I had a whopping 500 Mb a day of bandwidth to play with and it cost 3 times what I pay today. Because they use a running average to determine fair use, it is easy to forget how close you are to the edge and then you are offline until it averages out to their limits. And their ads encourage people to "Download your favorite music" heh.

OK sorry for OT ISP rants.

One of the things I do to check for unidentified robots in raw logs to to pull out all the lines with
"GET /robots.txt and one with all the
"GET / HTTP/1.1" lines and another for all the lines with HTTP/1.1" 304 responses. With these I can go look at the activity in context and see what was going on. It also makes it easier to check against my long list of bad IPs. It is slow and done manually but I only do this for an update audit. If I find I need more data I look at more logs for the same site. I know there are lots of people here who could do these things dynamically but I have decided I'll never have the time available to learn all that. Just added them as ideas.
dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4536448 posted 7:17 pm on Jan 20, 2013 (gmt 0)

The best method I have found for dynamic IPs, including static, broadband and mobile, is to define the full IP range as "dynamic" and then ban offenders for a given period of time, removing the ban at the end of that time. If the IP offends several times within the ban timeout, increase the timeout. Servers and known bad ranges are permanently banned, with holes for good bots.

I run half a dozen computers here off a single static broadband IP. The router connects to the phone line and an 8-way Netgear switch-box fans out the single router output to the local computers, one switched line per computer. Switching is high-speed and automatic. It's worked well for over a decade now. I see no reason why this technique should not work for any router or modem, with the proviso that the modem does not rely on a computer for any reason.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4536448 posted 11:52 pm on Jan 20, 2013 (gmt 0)

One of the things I do to check for unidentified robots in raw logs to to pull out all the lines with "GET /robots.txt and one with all the "GET / HTTP/1.1" lines and another for all the lines with HTTP/1.1" 304 responses. With these I can go look at the activity in context and see what was going on. It also makes it easier to check against my long list of bad IPs. It is slow and done manually but I only do this for an update audit. If I find I need more data I look at more logs for the same site. I know there are lots of people here who could do these things dynamically but I have decided I'll never have the time available to learn all that. Just added them as ideas.

About the same here.

Logs of course have one advantage over real-time activity: you can see what the next request will be. So f'rinstance if I get a request for robots.txt followed by a 403 from the same IP, then both lines get chopped out of the log-wrangling routine and I don't have to think about them. The 403 may not even be IP-based; all that matters is that this source has already been Dealt With. The only ones that need brain-and-eyeball attention are the robots.txt requests from unknown sources.

In my case it also helps that I'm not a front-driven site. Most one-off robots go no further than the front page, which no human ever visits except on the way to somewhere else, and those go in the "no skin off my nose" category.

Another check I've added recently is any large number of page requests from the same IP. It might be someone spreading out and looking at lots of your pages-- which is gratifying when it happens :) --but it may also be a robot harvesting everything in sight.

And then there are the auto-referers. mod_rewrite pigheadedly refuses to let me block these upfront (that is, it's syntactically not possible) but in your own log-wrangling script it's trivial. So when someone comes by and asks for robots.txt, giving robots.txt as referer... well, that's what linguists call Double Markedness.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4536448 posted 3:42 am on Jan 21, 2013 (gmt 0)

@lucy24
Logs of course have one advantage over real-time activity: you can see what the next request will be.

Now take the knowledge you have learned, create a mySQL/MSSSQL schema, and log all that info.

request headers
robots.txt access
URI requested/QueryString/Referrer
UAs
IPs(including rdns)
hosting ranges
country ranges(2 indexed views - first search allowed, if not found search not allowed(log data, block))
media files access
speed of access
Errors, redirects / Click Path / Scrape Path

You will be surprised how much real time data matters/is useful now days. And how much faster

I have 9 tables with 3GB of data in MSSQL with a sub-domain on one of the busiest site's that is used for WebServices that spit out all that data live to other sites I own. 7 queries, all together all under a second. Authenticated access only.

I could tell you how many times GoogleBot had crawled URI #672 in the second week of April of 2004 or a particular UA first showed up on the site and which geographical area in US or CA was more interested in "curly red widgets" on BlackFriday/CyberMonday of 2008. Oh, and that IPhone and IPad based UAs send request headers in different order all together :).

I could ban/unban an IP/range based on that info on more than 2 dozen sites via an Custom Blackberry App that I wrote.

It's is a lot more fun that way.

and no it's not on Apache/PHP platform, sorry ;)
-----------------------------------------------------------

@incrediBILL

I have a function that does look ups that takes advantage of Java Classes.

in short:
function rdnsLookUp(address) {
// Variables
var iaclass="";
var addr="";
// Init class
iaclass=CreateObject("java", "java.net.InetAddress");
// Get address
addr=iaclass.getByName(address);
// Return the name
return addr.getCanonicalHostName();
}

Problem with running rDNS requests against the IPs that do not have them is that the time to look up is USUALLY 4-5 seconds. So if someone would run a ddos style scrape that would slow down a server a bit. So I time out the requests after 2 seconds(no more), then scheduled tasks that runs on the back burner(diff app pull) picks it up. Mostly these are hosting ranges.

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4536448 posted 7:09 am on Jan 22, 2013 (gmt 0)

I've discovered many bots using old fashioned honeypots. Using randomized class names, <a> elements are scattered around a page like land mines, hidden from view using a variety of CSS tricks. If any of those receive a click, then ka-pow. IP Banned!

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 7:41 am on Jan 22, 2013 (gmt 0)

Problem with running rDNS requests against the IPs that do not have them is that the time to look up is USUALLY 4-5 seconds


That's not a problem if you cache the rDNS lookups for 24 hours like I do which means only the first hit from that IP causes a delay. If you run into any problem IP ranges that cause incredibly long delays you can drop them in a list to skip. Been doing this for years and survived many a DDOS so far.

Also, using the "one page free" theory you do the rDNS lookup post page processing so it doesn't hold up the page from being displayed and when you do get an answer you block subsequent page requests. Not a 100% solution, but it mixes the best of all solutions.

Additionally, you can set up an rDNS daemon that processes them in a dedicated multi-threaded process which takes up very little memory unlike a bunch of large scripts handing around. Subsequent page requests can query the daemon for results.

With good engineering practices you can always find suitable solutions ;)

hidden from view using a variety of CSS tricks.


Sites have natural honeypot pages such as legal.html, policy.html, terms.html. etc. which humans rarely ever look at but bots blindly crawl. Links like "dontcrawl.html" hidden with CSS are just gravy and confirmation it's not human. As a matter of fact, whether they honor it or not, a lot of bad bots access robots.txt which is another honeypot I use.

Visitors from Nebraska, The Dakota's Oklahoma, Colorado, Las Vegas, Oregon, Washington, in fact for me most everything west of the Mississippi. These folks are simply without interests in widgets.


Except the danger there is blocking travelers or people using services with IP pools that could be forward traffic from literally anywhere.

I use Comcast which is pretty static, even rebooting the modem doesn't tend to change the IP but when they do change the IP tends to remain from the same general region although sites like MaxMind and others can't seem to keep up. However, my mobile provider often uses landline IP addresses over 600 miles away! I could be in No. CA, AZ or NV and still be using an IP pool from Los Angeles.

My point is never assume where your customer is based solely on IP except within a country itself because it could literally be from anywhere these days, especially with large IP pools. Even country IP blocking isn't foolproof as IP blocks on border states could be in either country. Takes a lot of extra data to check for all those edge cases, beyond the score of simple blocking.

Anyway, country filtering is still close enough for me as long as you except some collateral damage will happen.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4536448 posted 7:28 pm on Jan 22, 2013 (gmt 0)

httpwebwitch - I have similar traps, but I find they only return a small percentage of the total trapped.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 2:38 am on Jan 23, 2013 (gmt 0)

I have similar traps, but I find they only return a small percentage of the total trapped.


Yes, but it's often stuff pretending to be a browser that stumbles into those traps that might otherwise go unnoticed. It's why those types of bots also read robots.txt trying to avoid those traps which is why robots.txt is also a trap.

When you get down to filtering out stealth crawlers, the Picscouts and other data miners that really don't want you to know they're watching you, every little trap helps as well as checking for little tells.

It's the combination of all the filters and tracking bugs that catch the sneakiest so I never discard any method just because of a low rate or return/ Just the mere fact that it's catching something that slipped thru all the other cracks means it's a useful tool to keep in your arsenal.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4536448 posted 5:06 am on Jan 23, 2013 (gmt 0)

often stuff pretending to be a browser that stumbles into those traps that might otherwise go unnoticed

IT IS CALLED: zactly disease, minus "TE: deflate,gzip;q=0.3" header ;)

@incrediBILL, thanks for rdns cache hint.

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4536448 posted 4:52 pm on Jan 23, 2013 (gmt 0)

another subtle one: block any IP that makes a request for phpmyadmin.

if you're using phpmyadmin... stop.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4536448 posted 8:33 pm on Jan 23, 2013 (gmt 0)

Bill, I agree - I was just pointing out that I do not see "many", as httpwebwitch does.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 11:08 pm on Jan 23, 2013 (gmt 0)

if you're using phpmyadmin... stop.


It's installed by default for millions of websites on Plesk, CPanel, etc.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4536448 posted 1:18 am on Jan 24, 2013 (gmt 0)

if you're using phpmyadmin... stop.

Do you mean the file/function, or the literal name?

If you look through your past 403s and 404s you will occasionally find a helpful robot that rattles off a list of the most likely variant names. I remember one that went through about 30 of them. Almost as useful as those pages that list the Top Ten transparent passwords :) ("Oh, oops, I guess 'sesame' wasn't so clever after all.")

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4536448 posted 2:16 am on Jan 24, 2013 (gmt 0)

For access logs on Wordpress I pull out all lines with
"POST /wp-login.php and for sites without any access controls there can be hundreds of lines in a month's time, from dozens of IPs, most of them at regular intervals, automated. Easy to see how a poorly chosen password gives them entry - or there wouldn't be so many trying.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved