Forum Moderators: open
One of these visitors is uses their PlayStation 3 to browse with. So my job was to figure out what criteria that would use to validate against to rule out a bot spoofing the PlayStation 3 to get in. The following is what I came up with.
The User-Agent in question "Mozilla/5.0 (PLAYSTATION 3; 1.00)"
This is unusual most major browsers and bots will always supply this header. It is one of the standard tests used by me to determine that a browser is a spoofer, is by checking if the "Accept" Header is not present. Which usually means it is a bot trying to hide using a well known User-Agent .
Here are a few examples taken from my library to date.
"1.30 (WP; system=1.32)"
"1.70 (WP; system=1.70)"
"1.80 (WP; system=1.81)"
"1.90 (WP; system=1.90)"
"2.10 (WP; system=2.10)"
You have missed the newest BIOS versions in the x-ps3-browser string. The newest is 2.20 with the string "2.20 (WP; system=2.20)" and I remember that there has been a 2.16 or 2.17 for a short time a few weeks ago. The PlayStation 3 updates its BIOS rapidly because in some games features stop to work if the current BIOS is older than the on-line available version.
Here's the HTTP/1.1 definitions:
[w3.org...]
You won't find much about what constitutes valid HTTP headers out there, if anything at all, but you can draw some conclusions from that spec about which conflicting directives shouldn't show up in the same HTTP field at the same time.
FWIW, Ocean is basically teaching "BOTS ADVANCED 302" which doesn't show up in any log files or discussion forums anywhere. The only way to get this information is compare the HTTP headers sent by the actual tools and browsers against the spoofs and collect a database full of HTTP header information, which Ocean does, to find out what's valid and invalid on the web.
That makes Ocean basically the HTTP header guru when it comes to analyzing HTTP header fields like HTTP_ACCEPT, HTTP_CONNECTION, and knowing how they are set for legitimate tools vs. the quick and dirty scripts that spoof the UA but don't set these fields properly.
Ocean does more advanced stuff than even I do with headers but the basics are that HTTP_ACCEPT should exist and not be blank (unless you're a PS3 then existing is invalid), and MSIE, FIREFOX and OPERA are invalid if HTTP_ACCEPT is set to something like "text/html, text/plain" or "text/html".
The HTTP_CONNECTION field shouldn't have conflicting directives such as both "close" and "keep-alive" at the same time.
Then we get to PROXY detection which is fun.
If you want to track secondary IPs via a proxy server so you can allow individuals to access your site, via Google's translator for instance, without lumping them all into a single IP which is quickly blocked for abusive looking behavior you have to look at things like HTTP_VIA, HTTP_X_FORWARDED_FOR, HTTP_PROXY_CONNECTION.
I process the IP specified in HTTP_X_FORWARDED_FOR as an the actual IP I'm tracking which allows me to stop a scraper on a proxy while allowing others to continue to use the proxy at the same time I'm blocking some specific activity.
Lot's more stuff in HTTP headers you can use but it's typically beyond the scope of the majority of webmasters to deal with which is why I rarely discuss anything except blocking user agents and IPs because the log files don't show that info and most can't program to address it anyway.
If you want to know why we bother with this stuff, I've been having a rash of hundreds of hits per day on my site from random IPs around the world for single pages, no images, no CSS, no js, with the UA of "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" which appears to be a botnet. The only thing currently standing between my site and this huge distributed network of most likely hacked machines is checking for the bad headers that is currently keeping them out.
Once they read this post...
[edited by: incrediBILL at 7:10 pm (utc) on April 6, 2008]
You can use something like the apache_request_headers [php.net] function to get an associative array of the headers that have been sent with the HTTP request for the page. After you have the headers in your array, you can perform whatever tests you like on them.
By the way, this is really a great thread. Thanks, Ocean!
RewriteCond %{HTTP_ACCEPT} [,\ ]?text/html[;,]? [NC]
I also threw in an [NC] flag in case some user-agents use uppercase characters.
Jim
NC = No Case or Ignore Case so any variety of upper/lower case will be caught
RewriteCond %{HTTP_ACCEPT} ^text/html$ [NC,OR]
RewriteCond %{HTTP_ACCEPT} ^text/html,\ text/plain$ [NC]
RewriteRule !^403.*\.html$ - [F]
Jim pointed out that I needed to make the space escaped "\ " as I missed that.
However, you want to leave it anchored like in my original example and this one because a floating "text/html" could incorrectly match many things and generate lots of false positives.
Example of where a false positive would occur:
"MJ12bot/v1.0.8 (http://majestic12.co.uk/bot.php?+)"
HTTP_ACCEPT=text/html,text/plain,text/xml,text/*,application/xml,application/xhtml+xml
Thanks again to Jim for the proper syntax updates.
[edited by: incrediBILL at 3:29 am (utc) on April 7, 2008]