June 2013 - Blocking Bad Spiders / Bots / User Agents - Crawler, Spider, and User Agent ID forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

June 2013 - Blocking Bad Spiders / Bots / User Agents

Imaster

12:24 pm on Jun 21, 2013 (gmt 0)

So there's this website <snip link> (User-Agents.org) which has a list of all latest spiders / bots/ user agents to block.

Should I block most of them considering that 99% of my site visitors are from US & Canada.

Does anyone have the latest list of bots to block?

[edited by: incrediBILL at 10:40 pm (utc) on Jun 22, 2013]
[edit reason] Removed Link Because of Stale Data [/edit]

incrediBILL

7:20 pm on Jun 22, 2013 (gmt 0)

Bot blocking blacklists are useless as some rogue spiders just generate random user agent strings so you will never have them in your list to start with.

The simpler route, which is a shorter, faster and more efficient list, is to whitelist the good spiders like Googlebot, Bingbot, etc. and allow browser user agents and deny everything else.

Less maintenance and much quicker for the server to process on every page instead of that large linear list of useless junk.

Key_Master

7:53 pm on Jun 22, 2013 (gmt 0)

That list is old- very old. I don't even think the link should be allowed because it's only going to confuse people more.

Anyhow, most of the really bad bots don't use obvious user-agents.

wilderness

8:00 pm on Jun 22, 2013 (gmt 0)

See my reply in this recent thread [webmasterworld.com], which is a good beginning for black-listing, and is currently being used on a website.

See the thread Request Protocol [webmasterworld.com], which will deny many pests with a single line.

wilderness

8:01 pm on Jun 22, 2013 (gmt 0)

That list is old- very old. I don't even think the link should be allowed because it's only going to confuse people more.

Ditto to moderators.

lucy24

10:21 pm on Jun 22, 2013 (gmt 0)

Blacklists are useful for keeping out stupid and lazy robots. But keep the list short. Last time I looked at mine it was somewhere between 20 and 30. Just the no-brainers and the ones you find personally offensive.

Whitelisting by itself doesn't do much for the robots with humanoid UAs. (Has anyone ever figured out what the plainclothes bingbot is for? It isn't translation, and I'm not prepared to believe it's just ms employees on their lunch break.)

The number of genuine human cell phone user-agents appears to be infinite.

Key_Master

10:44 pm on Jun 22, 2013 (gmt 0)

Whitelisting is the way to go, but you can't just apply it to one header. All the headers have to be parsed for accuracy. It also requires a fair amount of knowledge about browsers and the headers each one uses.

Lucy, that bot is used to test JavaScript.

incrediBILL

10:46 pm on Jun 22, 2013 (gmt 0)

The user agent list is actually pretty lame either way it goes these days.

Most bots don't send the proper headers and real humans do, so simply checking the browser headers is almost sufficient to solve the entire problem.

However, a couple of real sneaky bots that don't want to get caught send a set of valid headers so you have to do other things to catch them. Always one in every crowd!

The number of genuine human cell phone user-agents appears to be infinite.

Re: cell phones, I allow SMART PHONES on my site which are easy to detect as then use Safari, Chrome, etc. and have normal user agents. The rest of those phones can't use an HTML5 javascript site anyway so I don't bother lettering them have access. Considering most cell phones are Smart Phones these days the myriad of wacky user agents problem is quickly going away.

FWIW, I'm not sure why an old blacklist is a problem because it still blocks the old bots and if they are no longer active it doesn't hurt anything except making the list longer. I've observed some bots take a random sample of user agent names from lists like that and rotate thru them on your site so if you're going to waste time blacklisting, at least make it extensive enough so you can catch them when they occur.

Anyone know why it's a couple of years out of date? Did he fall off the planet or just quit updating it?

wilderness

2:32 am on Jun 23, 2013 (gmt 0)

The user agent list is actually pretty lame either way it goes these days.

Works just fine for me ;)

wilderness

1:33 pm on Jun 23, 2013 (gmt 0)

(Has anyone ever figured out what the plainclothes bingbot is for? It isn't translation, and I'm not prepared to believe it's just ms employees on their lunch break.)

Lucy, that bot is used to test JavaScript.

How MS typical.
They began these requests (and still utilize in some instances) with a malformed UA.

Whilst, all the while testing for something that does not exist on my sites.

Had MS named their UA with a proper term which identified their intent and/or use, perhaps it could have been more effective.

lucy24

7:43 pm on Jun 23, 2013 (gmt 0)

Had MS's javascript-testing robots introduced themselves with a request for robots.txt which identified their intent and/or use, perhaps it could have been more effective ;)

Key_Master, I don't suppose you happen to know why the need to test JavaScript exempts a robot from following the rules about proper self-identification? It's no use saying they compare notes with the bingbot, since robots.txt by its nature points to user-agents.

Key_Master

8:40 pm on Jun 23, 2013 (gmt 0)

Lucy, that bot has been around forever and as far as I know, nobody from Microsoft has ever acknowledged it's existence, much less explain why it doesn't use robots.txt. If I had to hazard a guess, it would probably be security related. Checking sites for malware and the like. Or checking for cloaked sites that use JavaScript redirects. Who knows. FWIW, Google also has a JavaScript bot that uses proxies to anonymously check sites.

incrediBILL

9:08 pm on Jun 23, 2013 (gmt 0)

Back to the topic, this isn't about MS bots, it's about an up-to-date blacklist, anyone got one worth mentioning?

not2easy

10:20 pm on Jun 23, 2013 (gmt 0)

The one wilderness posted about in the thread mentioned above ([webmasterworld.com ]) is similar to what I use, I have a few different UAs blocked and added |Preview| to rout Google and Bing's misappropriation of files. When I see bots grabbing, then ignoring instructions, their UAs get added.

To keep the list short and useful you need to look through access logs to see what shows up, sometimes there is only a plain vanilla human UA but you can usually spot the activity of a bot because of the selectively requested files, the number of requests over a short visit.

incrediBILL

10:29 pm on Jun 23, 2013 (gmt 0)

See, that's the problem with blacklisting is that you're always chasing bot user agents and by the time they've already been on the machine and done their dirty deeds it's too late.

It's an endless game of cat and mouse, complete no-win scenario and a continual waste of time.

I switched to whitelisting in 2005 and rarely ever add new bots to the list but on the other hand, I never worry about anything attacking my site that uses any user agent I know know about because if I don't know about it, it's not getting access.

I don't see why people continue to play the blacklist game because you can only block it once you're aware of it and if I write "XYZABC bot" right now and scrape your site, I can do it because you don't know I exist.

Not only that, the bigger the list gets the longer it takes to process each time because every single line for every user agent must be processed for every page view which is a lot of wasted CPU time over a million page views.

lucy24

10:54 pm on Jun 23, 2013 (gmt 0)

Do you have a user-agent whitelist that is absolutely guaranteed to identify all humans? (Excluding the ones who have willfully and intentionally obfuscated their UA. Those don't count :)) False positives-- like the plainclothes bingbot-- are fine. It's the false negatives I'm leery of.

Earlier today while-- stop me if you've heard this one-- looking for something else*, I came across this ancient thread [webmasterworld.com] with particular attention to posts from jdMorgan. On some types of sites you can afford to lock out the occasional human. On others, visitors with elderly browsers-- or security-conscious people who never let their browsers tell the whole truth-- may be a large part of your target audience.

not2easy

11:00 pm on Jun 23, 2013 (gmt 0)

See, that's the problem with whitelisting: while information on how to stop unwanted behavior on a domain is readily available, whitelisting information is scarce, vague and often outdated or incomplete. So people with the right skillz get to do it and those of us who can't figure out how to do it do what we can with the skillset we do have. I spent months trying to figure it out and finally went ahead with my log analysis, database and blacklisting. Your superior skills give you a superior set of options that work in your circumstances. Speaking for myself, I do what I can with what I know how to do.

wilderness

11:11 pm on Jun 23, 2013 (gmt 0)

See, that's the problem with blacklisting is that you're always chasing bot user agents and by the time they've already been on the machine and done their dirty deeds it's too late.

Bill,
I cannot tell you the last time I added a UA to my bots list, course I'm certainly not a newcomer to black-listing, nor are most of the regulars in this forum.
(FWIW the preliminary list I provided in another in the other thread explains that it's simply a beginning. My primary list is far more extensive.)

It's an endless game of cat and mouse, complete no-win scenario and a continual waste of time.

Bill,
We may have the white & black (list) debates all day long, and what it really comes down to is a method that useable and easy for newcomers to comprehend.
It also comes back to the old choice, about what's beneficial or detrimental to our own sites.

I'm "assuming" under the pretense of "cat & mouse" that you haven't monitored your logs since 2005? (rhetorical).
Which I know to be untrue because you have multiple scripts in place that monitors your logs for you, along with session-ids, which you use to pinpoint the original source.

Do you have a user-agent whitelist that is absolutely guaranteed to identify all humans?

Course he does lucy, unfortunately sharing and/or publishing such a syntax for others to use also opens the door for bots to change their procedures as they become aware of the methods.
That's why there has never been a useful list published for white-listing rather vague explanations.

lucy24

12:59 am on Jun 24, 2013 (gmt 0)

also opens the door for bots to change their procedures as they become aware of the methods.

I'm not concerned with bots appearing to be human. What I'm very concerned with is humans misidentified as robots. Hence the reference to "false positives" vs. "false negatives".

It's a subject particularly close to my heart because my own browser for obscure technical reasons insists on identifying itself as "like Firefox 3.6" which is enough to fall afoul of some sites' user-agent detection. Generally not in access but in the features the site makes available. (Google's image search-- which I never use-- gives me the old format although the browser is perfectly capable of handling the scripting. TV guide has stopped working entirely; I've had to shift the bookmark to Safari.)

wilderness

10:02 am on Jun 24, 2013 (gmt 0)

FWIW, here's a similar inquiry by yourself in 2007.

Ban certain IP's [webmasterworld.com]

motorhaven

12:11 pm on Jun 24, 2013 (gmt 0)

I am in Bill's camp when it comes to this. It's really not *hard* to white list properly it just takes a little homework and a decent sized set of log file history. If you're sophisticated enough to make your own rewrite rules to catch bad user agents, you're got the 95% of the skill set to look at other headers and write rules based on those.

For me in the long run white listing result in less effort, not more. Blacklists, unfortunately, must be constantly updated plus they are terribly ineffective. Long gone are the days when they would catch most bots... in recent years I have found they catch the minority of bots.

What it takes to white list without giving the bad guys the info we don't want them to have.

- Log all headers for a month or so initially, not just user agent, IP, request page, referrer, etc.

- Hitting the site yourself with every browser and versions of that browser you can lay your hands on. Not just on your PC, but mobile browsers as well via services like Browser Cam. On my Android phone I must have installed at least a dozen browsers (luckily 95% of them use the same "engine" so the headers aren't unique!). Log the headers so you can see what all these send. Make sure you do this through a non-proxy connection.

- Using the header information from above, white list based on user agent, Accept-Encoding, Connection, and Accept headers when there are no proxy headers. I won't list proxy headers here, this information is readily available elsewhere but there's roughly a dozen which cover 99% of proxies humans use. If you want to accept only users with specific languages, also filter Accept-Language. You've now white listed 95% of valid users.

- Build a list of IP ranges from search engines you allow, and the user agents you want to allow from these. Whitelist them. Any search engine user agent not matching the valid IP ranges you block. These IP ranges are also available on webmasterworld.

- Next, filter your logs using grep, awk, etc. or similar tools on your platform for items which do not match the above preliminary white list. You want a large sampling so depending on your traffic this may be a week, a month or months worth of logs. Whenever you get something which doesn't pass the filters, look at the headers closely, especially the various proxy headers. Manually determine with IP look up tools, analysis of time between page fetches, etc. if this is a human or not. If it is, add the unique header combinations (usually user agent, accept-encoding, connect and one of the proxy headers) to the white list.

- Now you're at 99.9% or better! Sounds good, but 1 out of 1000 visitors blocked is too many for me. I like to get it better than 99.99%+ so next...

- Anything blocked you rewrite to a human verification page. Have the results of the human verification page emailed to you along with all headers. Use this information to add a few additional rules to your whitelist.

I found that doing the above, the first couple of weeks I ended up adding a handful of additional rules to my white list. After that I found I started saving time, expending far less effort chasing down bots, etc.

- First couple of months after this at the end of the month I went back through the logs looking for additional whitelist items. Now, a year later I spend very little time chasing bots.

- All this has resulted in a much smaller set of rules, and since these days page load time can be a ranking factor it also slightly speeds up time to first byte delivered.

Additional things which will result in less pulling out of your hair in the long run:

- Depending on your user base you may want to block certain countries. I choose to do this at the firewall level instead of rewrite rules or deny rules. This is far more CPU efficient (better time to first byte) plus it can also block out a ton of SSH probes, SMTP attacks, etc.).

- Use mod_geoip to block anonymous proxies. mod_geoip caches IP lookups so the overhead is low.

- Setup an hourly script to update a list of Tor exit nodes which connect to your server's IP(s) and block these.

- Setup a free account at projecthoney pot and install their Apache module. The module has a few settings you can setup to block recent comment spammers, harvesters, spam servers, dictionary attackers, and "rule breakers". To avoid overhead and increasing time to first byte, I have it setup to only do a lookup on any pages which require human input such as registration and contact pages.

- If you have a VBulletin forum, install Spambot Stopper plugin. I have found this plugin to be more effective than all other spam/crawler related plugs combined. It has helped me to catch and block via the firewall numerous IP ranges of server farms with very sophisticated bots which emulate real user headers effectively enough that they get through the white list.

- If you use a Content Delivery Network, do not use it for your actual content (ie, HTML pages and images). Use it for css, javascript and navigation images but not other images. Block any CDN requests which do not fetch these. Reason being is I found that many crawlers will attempt to get to your content via the CDN. If you are in a situation where you must serve actual content via the CDN, make sure the CDN company has a method or plug-in/module for your web server to receive the end-user's real header information so you can properly white list it.

As a side note...

Lucy24, my white list lets stuff like your browser through... I ran into that early on and addressed it. :)

wilderness

12:34 pm on Jun 24, 2013 (gmt 0)

March 2006 Whitelist [webmasterworld.com]

Nov 2006 whitelist [webmasterworld.com]

The blacklisting syntax examples at Webmaster World are surely in the thousands.

wilderness

12:51 pm on Jun 24, 2013 (gmt 0)

- Using the header information from above, white list based on user agent, Accept-Encoding, Connection, and Accept headers when there are no proxy headers.

Here's one example for headers [webmasterworld.com], however I don't see any mention of Proxy.

motorhaven

2:16 pm on Jun 24, 2013 (gmt 0)

Some of the proxy headers I've seen:
X-Forwarded-For
Via
Forwarded
User-Agent-Via
Proxy-Connection
XProxy-Connection
Http-Pc-Remote-Addr
Client-ip
X-Imfordwards
X-BlueCoat-Via
X_Cluster_Client_IP
X-proxy-id
x-nokia-ipaddress
x-mobile-gateway
x-drutt-client-ip
x-nas-ip

The content of these, especially the more standard headers, can vary quite a bit.

wilderness

4:41 pm on Jun 24, 2013 (gmt 0)

Block spammers on proxies [webmasterworld.com]

I implemented this for a few months and it was pretty much useless.

incrediBILL

2:15 am on Jun 25, 2013 (gmt 0)

I posted a starter kit for whitelisting in Apache .htaccess a few years ago:
[webmasterworld.com...]

That should get you started.