Ocean10000 - 7:29 am on Mar 29, 2008 (gmt 0)
The following items can be used to identity bots and slow down and stop most unwanted traffic if applied with proper do care.
Reason to note this is because you do not want say Google bot to crawl your entire site though a proxy and get hit with a duplicate content penalty or have someone else earn money by inserting there own Google Adsense ads in.
Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.
The following checks will also stop major search engines which are crawling though a transparent proxy server unknowingly, thus saving duplicate content penalties for the website as a side benefit.
(A) DNS check, require looking up the IP to get the Hostname. Check resolved hostname against the known patterns for the search engine in question. And if they do not match mark the ip as banned and give it a proper message.
(B) Then doing a look up on the Hostname to see if it resolves back with a list of ip addressís that contain the ip which you started with.
Something to watch out for with some fake bots will have there ip address resolve to a hostname which matches there ipís address and thus would pass the test, so this must be explicitly tested for to bounce these results by default. For example ip "10.0.0.1" would resolve to "10.0.0.1" hostname.
MSN, Yahoo, Google, Ask Jeeves all support this functionality currently, others may as well. The purpose of this check is prevent others from spoofing well known Crawlers and setting up there DNS records to resolve there ipís to a well known Search Engine hostname, but since they will not control the reversing of the Hostname to ip they will get caught with this check..