Here's the HTTP/1.1 definitions:
You won't find much about what constitutes valid HTTP headers out there, if anything at all, but you can draw some conclusions from that spec about which conflicting directives shouldn't show up in the same HTTP field at the same time.
FWIW, Ocean is basically teaching "BOTS ADVANCED 302" which doesn't show up in any log files or discussion forums anywhere. The only way to get this information is compare the HTTP headers sent by the actual tools and browsers against the spoofs and collect a database full of HTTP header information, which Ocean does, to find out what's valid and invalid on the web.
That makes Ocean basically the HTTP header guru when it comes to analyzing HTTP header fields like HTTP_ACCEPT, HTTP_CONNECTION, and knowing how they are set for legitimate tools vs. the quick and dirty scripts that spoof the UA but don't set these fields properly.
Ocean does more advanced stuff than even I do with headers but the basics are that HTTP_ACCEPT should exist and not be blank (unless you're a PS3 then existing is invalid), and MSIE, FIREFOX and OPERA are invalid if HTTP_ACCEPT is set to something like "text/html, text/plain" or "text/html".
The HTTP_CONNECTION field shouldn't have conflicting directives such as both "close" and "keep-alive" at the same time.
Then we get to PROXY detection which is fun.
If you want to track secondary IPs via a proxy server so you can allow individuals to access your site, via Google's translator for instance, without lumping them all into a single IP which is quickly blocked for abusive looking behavior you have to look at things like HTTP_VIA, HTTP_X_FORWARDED_FOR, HTTP_PROXY_CONNECTION.
I process the IP specified in HTTP_X_FORWARDED_FOR as an the actual IP I'm tracking which allows me to stop a scraper on a proxy while allowing others to continue to use the proxy at the same time I'm blocking some specific activity.
Lot's more stuff in HTTP headers you can use but it's typically beyond the scope of the majority of webmasters to deal with which is why I rarely discuss anything except blocking user agents and IPs because the log files don't show that info and most can't program to address it anyway.
If you want to know why we bother with this stuff, I've been having a rash of hundreds of hits per day on my site from random IPs around the world for single pages, no images, no CSS, no js, with the UA of "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" which appears to be a botnet. The only thing currently standing between my site and this huge distributed network of most likely hacked machines is checking for the bad headers that is currently keeping them out.
Once they read this post...
[edited by: incrediBILL at 7:10 pm (utc) on April 6, 2008]