| 9:26 pm on Mar 25, 2012 (gmt 0)|
Problem with discussing all the bad headers in public is if you put out a list of what you're blocking it's easy for the bot owners to fix it. For instance, when everyone started posting lists of bot names all the bots started using MSIE's user agent.
They didn't want to be stopped, still don't, so there are some bot blocking trade secrets that we simply cannot post of they won't be valid in a week.
The simple fact that residential botnets are now being employed to do the bidding of some bot herder recently proves that too many high value target sites are now successfully blocking data centers.
It's just like any war with an escalation of weaponry being used by both sides until we either hit an impasse or one side wins.
In this case, short of forcing all non-verified possible human traffic to use CAPTCHA's, which is easily defeated with blow-through techniques (tricking stupid humans into answering those captcha's to gain access to other sites), I'm guessing ultimately it'll be an impasse.
| 11:30 pm on Mar 25, 2012 (gmt 0)|
|Under the scheme of white-listing you're not required to ID every newcomer as you have all the doors closed. |
I think I will have to give up on this, as there is obviously something utterly fundamental that I'M NOT GETTING and it's making me very cross.
If you say
Deny from all
then what's the site for? We are talking about ordinary, public websites, right? Not ones that are restricted to your immediate friends and family whom you admit on a case-by-case basis.
| 11:40 pm on Mar 25, 2012 (gmt 0)|
Goes more like this, and this is not real code:
# first block the world with a firewall
Deny from all
# now allow invited guests only
Allow a few others...
# anyone not specifically Allowed above gets the bounce
My whitelist is a little bigger than this, maybe 20-30 entries tops.
| 11:47 pm on Apr 5, 2012 (gmt 0)|
Bill, how about giving us a sample of your code? I am interested in what you are doing!
| 1:20 am on Apr 6, 2012 (gmt 0)|
I'm not Bill, however here's a couple of samples:
jdMorgan from 2006 [webmasterworld.com]
Bill 2006 and a few days after Jim [webmasterworld.com]
I just provided these links on March 24, 2012 [webmasterworld.com]
| 1:21 am on Apr 6, 2012 (gmt 0)|
A quick follow-up. After participating earlier in this thread I decided to take about 30 days worth of log files, compile a list of "good" user agents, including a few cell phone browsers and make a white list. This knocked out far more non-valid visitors than my previous huge black list. Though the black list is still in place for now until I have time to further refine everything.
Then I installed mod_geoip and started filtering out countries where we don't receive valid visitors from, and where zero income is earned.
Next, I started logging gzip/deflate Accept-Encoding headers. I compiled a very small white list of valid agents which do not use gzip/deflate. All others are blocked.
I've been able to chop about 30% off our bandwidth and drop server loads even further without any drop in legit traffic or income.
This significant change was accomplished without even looking at proxy related header. I'm saving this for last, because some valid proxy data is also used by "bad" proxies so its going to be a challenge to do this properly without blocking legitimate visitors.
By putting in some large crunch time just this once to determine exactly what "valid" visitors are to my site I will be spending less time in the future handling "bad" visitors since the majority will be blocked by default.
| 1:40 am on Apr 6, 2012 (gmt 0)|
I'm glad this worked out for you.
For some while I've contended that 40-50 is valid number of non-beneficial visitors to websites.
Just imagine how much spam could be reduced, and the possible reduction in costs across the entire internet, IF provider initiated these practices as preliminary steps.
| 2:16 am on Apr 6, 2012 (gmt 0)|
A little tip: Bots that use proxies send the same bad headers as those that don't. So if you learn what the bad headers are, it won't really matter if they do come in though a proxy.
Mobile phone proxies are what you really want to focus on. The headers are slightly different than conventional computer browsers, but easy enough to master.
| 2:17 am on Apr 6, 2012 (gmt 0)|
Heh. While the previous six people were posting, I was off doing some number-crunching of my own. My version involved grabbing a couple days of logs, pulling out various categories of visitor and then checking them against the response they actually got.
--authorized robots by IP (google and yandex): check
--authorized bing/msnbots by IP plus UA (I don't approve of the plainclothes MSIEbots): check
--authorized minor robots by UA: check
--all requests for robots.txt, favicon.ico (Oi! 126.96.36.199! You put those favicons back!), error documents: check
--Safari, Firefox, Opera: Whoops! discrepancy of 7 after adding "AppleWebKit" to allow for MUN-man, whose Vienna RSS reader eats the browser string.
--a specific image file: check
--two specific non-indexed Forums as referers: check (I would really have to add two whole domains, they just didn't come up in these few days)
--MSIE 7+ if not from bing/msn IP: discrepancy of 6
--leftover files should all be 403 (or 404/410): discrepancy of 13
Safari/Firefox/Opera: I should have said "Opera without 'Mozilla'" (4 false positives). One more was a human blundering into an index-less directory. That leaves two visits by
"Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0"
blocked in real life by IP.
MSIE: This is a pain. Three of the six are websense, which never reuses a UA. The other three are blocked by IP; one of them would also have been blocked for a bogus .ru referer. All six are HTTP/1.0. But so is one bona fide human using MSIE 9 (!) and-- woo hoo!-- one faker who shouldn't have got in.
False negatives (visitors who really got a 200 or similar):
--4 facebooks. Different thread.
--5 + 1 that deserved to be blocked (and now are).
--2 from a take-it-or-leave it bot. No loss-- but it might possibly be useful to my other site.
--1 from Singlehop. Now blocked; they've been around before, asking for the same isolated image file.*
Hm. Interesting exercise. But it's a pretty small sampling. I'm not so worried about blocking potentially useful robots. It's the humans with antiquated browsers and slow connections that trouble me. There's a reason those people are still using MSIE for Mac-- and I don't think the reason is that they're bonkers ;)
* Illustrating the three-syllable Inuktitut word for "Take out the garbage!" (intransitive singular imperative), in case anyone cares.
| This 39 message thread spans 2 pages: < < 39 ( 1  ) |