homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 39 message thread spans 2 pages: < < 39 ( 1 [2]     
Stale bad bot lists
Need a list of live versus stale bad bots

 2:07 am on Mar 24, 2012 (gmt 0)

A Google search about bad bots turns up several examples of long lists of bots to block.


RewriteCond %{HTTP_USER_AGENT} ^Morfeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]

Elsewhere, comments have been made about the number of script kiddies represented there that have moved on to other things.

It would be helpful if anyone knowledgeable was publishing a current hotlist of bots to ban, indicating their nature between pest or pure evil. Or indicate a link to the best source to turn to for list that gets updated.

It would be interesting to see a second list of apparently lifeless bots to perhaps purge as possible dead wood from .htaccess.

(Of course, the latter list becomes a resource for future bot namers.)



 9:26 pm on Mar 25, 2012 (gmt 0)

Problem with discussing all the bad headers in public is if you put out a list of what you're blocking it's easy for the bot owners to fix it. For instance, when everyone started posting lists of bot names all the bots started using MSIE's user agent.

They didn't want to be stopped, still don't, so there are some bot blocking trade secrets that we simply cannot post of they won't be valid in a week.

The simple fact that residential botnets are now being employed to do the bidding of some bot herder recently proves that too many high value target sites are now successfully blocking data centers.

It's just like any war with an escalation of weaponry being used by both sides until we either hit an impasse or one side wins.

In this case, short of forcing all non-verified possible human traffic to use CAPTCHA's, which is easily defeated with blow-through techniques (tricking stupid humans into answering those captcha's to gain access to other sites), I'm guessing ultimately it'll be an impasse.


 11:30 pm on Mar 25, 2012 (gmt 0)

Under the scheme of white-listing you're not required to ID every newcomer as you have all the doors closed.

I think I will have to give up on this, as there is obviously something utterly fundamental that I'M NOT GETTING and it's making me very cross.

If you say

Deny from all

then what's the site for? We are talking about ordinary, public websites, right? Not ones that are restricted to your immediate friends and family whom you admit on a case-by-case basis.


 11:40 pm on Mar 25, 2012 (gmt 0)

Deny from all

Goes more like this, and this is not real code:

# first block the world with a firewall
Deny from all
# now allow invited guests only
Allow MSIE
Allow Firefox
Allow Opera
Allow Android
Allow Safari
Allow Googlebot
Allow Slurp
Allow Bingbot
Allow Yandex
Allow Twitterbot
Allow Facebook
Allow a few others...
# anyone not specifically Allowed above gets the bounce

My whitelist is a little bigger than this, maybe 20-30 entries tops.


 11:47 pm on Apr 5, 2012 (gmt 0)

Bill, how about giving us a sample of your code? I am interested in what you are doing!


 1:20 am on Apr 6, 2012 (gmt 0)

I'm not Bill, however here's a couple of samples:

jdMorgan from 2006 [webmasterworld.com]

Bill 2006 and a few days after Jim [webmasterworld.com]

I just provided these links on March 24, 2012 [webmasterworld.com]


 1:21 am on Apr 6, 2012 (gmt 0)

A quick follow-up. After participating earlier in this thread I decided to take about 30 days worth of log files, compile a list of "good" user agents, including a few cell phone browsers and make a white list. This knocked out far more non-valid visitors than my previous huge black list. Though the black list is still in place for now until I have time to further refine everything.

Then I installed mod_geoip and started filtering out countries where we don't receive valid visitors from, and where zero income is earned.

Next, I started logging gzip/deflate Accept-Encoding headers. I compiled a very small white list of valid agents which do not use gzip/deflate. All others are blocked.

I've been able to chop about 30% off our bandwidth and drop server loads even further without any drop in legit traffic or income.

This significant change was accomplished without even looking at proxy related header. I'm saving this for last, because some valid proxy data is also used by "bad" proxies so its going to be a challenge to do this properly without blocking legitimate visitors.

By putting in some large crunch time just this once to determine exactly what "valid" visitors are to my site I will be spending less time in the future handling "bad" visitors since the majority will be blocked by default.


 1:40 am on Apr 6, 2012 (gmt 0)

I'm glad this worked out for you.
For some while I've contended that 40-50 is valid number of non-beneficial visitors to websites.

Just imagine how much spam could be reduced, and the possible reduction in costs across the entire internet, IF provider initiated these practices as preliminary steps.


 2:16 am on Apr 6, 2012 (gmt 0)

A little tip: Bots that use proxies send the same bad headers as those that don't. So if you learn what the bad headers are, it won't really matter if they do come in though a proxy.

Mobile phone proxies are what you really want to focus on. The headers are slightly different than conventional computer browsers, but easy enough to master.


 2:17 am on Apr 6, 2012 (gmt 0)

Heh. While the previous six people were posting, I was off doing some number-crunching of my own. My version involved grabbing a couple days of logs, pulling out various categories of visitor and then checking them against the response they actually got.

--authorized robots by IP (google and yandex): check
--authorized bing/msnbots by IP plus UA (I don't approve of the plainclothes MSIEbots): check
--authorized minor robots by UA: check
--all requests for robots.txt, favicon.ico (Oi!! You put those favicons back!), error documents: check

--Safari, Firefox, Opera: Whoops! discrepancy of 7 after adding "AppleWebKit" to allow for MUN-man, whose Vienna RSS reader eats the browser string.
--a specific image file: check
--two specific non-indexed Forums as referers: check (I would really have to add two whole domains, they just didn't come up in these few days)

--MSIE 7+ if not from bing/msn IP: discrepancy of 6

--leftover files should all be 403 (or 404/410): discrepancy of 13

The discrepancies:

Safari/Firefox/Opera: I should have said "Opera without 'Mozilla'" (4 false positives). One more was a human blundering into an index-less directory. That leaves two visits by
"Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0"
blocked in real life by IP.

MSIE: This is a pain. Three of the six are websense, which never reuses a UA. The other three are blocked by IP; one of them would also have been blocked for a bogus .ru referer. All six are HTTP/1.0. But so is one bona fide human using MSIE 9 (!) and-- woo hoo!-- one faker who shouldn't have got in.

False negatives (visitors who really got a 200 or similar):

--4 facebooks. Different thread.
--5 + 1 that deserved to be blocked (and now are).
--2 from a take-it-or-leave it bot. No loss-- but it might possibly be useful to my other site.
--1 from Singlehop. Now blocked; they've been around before, asking for the same isolated image file.*

Hm. Interesting exercise. But it's a pretty small sampling. I'm not so worried about blocking potentially useful robots. It's the humans with antiquated browsers and slow connections that trouble me. There's a reason those people are still using MSIE for Mac-- and I don't think the reason is that they're bonkers ;)

* Illustrating the three-syllable Inuktitut word for "Take out the garbage!" (intransitive singular imperative), in case anyone cares.

This 39 message thread spans 2 pages: < < 39 ( 1 [2]
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved