Quick primer on identifying bot activity.

The following items can be used to identity bots and slow down and stop most unwanted traffic if applied with proper do care.

Check for signs that some proxy servers send in the supplied request headers. The goals here are to try to detect where a browser is really coming from, and to make a note to run some additional checks later on that are proxy specific. Proxy servers will often include by default the browsers original ip address in the "X-Forwarded-For" header. If this is present save it to so it can be checked later.
Reason to note this is because you do not want say Google bot to crawl your entire site though a proxy and get hit with a duplicate content penalty or have someone else earn money by inserting there own Google Adsense ads in.
Check if your business plan or site focus to see if allows you to exclude foreign countries where you can not or will not do business with, this includes if they came in via proxy server which supplied the "X-Forwarded-For" IP address as well. If you can exclude these countries by using Geo Location software to do one of the following: (A) redirecting them to a nicely worded page stating the reason why they can not order/view the site, (B) outright block them from accessing the site. And also remember just in case to allow proper openings for your allowed search engine crawlers just in case they come from some of these ranges. What this does is allow you to focus your attention to a smaller group of web browsers and crawlers.
If they read robots.txt log the IP addresses and User-Agents and weather by the rules outlined in the robots.txt if they should be banned or not based. I usually assume anything that reads the robots.txt file is a bot or someone snooping around who is up to no good.
Make sure robots.txt only allows the bots you wish to crawl and index the website. I suggest only the top 3 or 4, which in my opinion are Google, Yahoo, MSN, and Ask Jeeves.
Check to see if the IP address or User-Agent has been previously banned by disobeying the Robots.txt file. And take action if they are not allowed to access the site, with measures that are appropriate for the site. Generally I prefer to send the user a 403 status code with no further content, so not to waste valuable bandwidth on bad bots, and not to supply the bot owners with information how to sneak around the anti-scraping measure put into place on the website.
Check if the IP or User-Agent has previously been given a captcha check and has not answered it, send it another captcha check and make a note of how many captcha checks it triggered.
Check if the IP has previously been banned, and if it has been give it the proper message it so well deserves.
Check to see if the "From" header is present or not, this header should only be supplied by bots, so if it is found it can be marked as a bot even when the User-Agent is a non-crawler or Identifiable as bot by other methods. This header is usually takes the form of an email address, which you can report problems to in response to the bots activities on the website.
Check to see if the User-Agent contains one of the following terms so it is possible to flag the User-Agent as a possible bot. These may catch some malformed User-Agents but at this point it is only being flagged as being a possible bot, it is not known for sure if it is or not yet at this stage.
- "Crawler"
- "Bot"
- "Spider"
- "Ask Jeeves"
- "Search"
- "Indexer"
- "Archiver"
- "Larbin" <-- Email scraper
- "Nutch" <-- Open source web crawler which is abused.
- "Libwww" <-- Used by a lot of scrapers
- "User-Agent" <-- Badly formed User-Agent
Check if was previously flagged as a possible bot perform analysis of the supplied User-Agent. The purpose is to weed out bots that are not on a white list of allowed bots, or bots that have been explicitly banned from accessing the site.
- This is done by seeing if the User-Agent matches a known string which an allowed bot uses, and letting it continue on to go though further checks.
- And if the User-Agent matches a known disallowed bot mark the ip as banned and give it a proper message.
- If it does not match a known bot User-Agents which have been coded as disallowed or allowed, the proper thing here would be show it a captcha page where a human may continue on but a bot would get stuck. Mark the IP and User-Agent as being giving a captcha check and note if they answer it properly.
Check if the User-Agent [u]is identified as a bot[/u] and weather the "From" Header was supplied. And check the "From" Header against known valid from "From" headers from the white listed bots to see if it matches and is present when it is expected to be present. And if the white listed bot "From" header does not match what is expected ip as banned and give it a proper message.
Check the allowed bots that have made it this far against the list of bots that support DNS Checks to validate them.
The following checks will also stop major search engines which are crawling though a transparent proxy server unknowingly, thus saving duplicate content penalties for the website as a side benefit.
(A) DNS check, require looking up the IP to get the Hostname. Check resolved hostname against the known patterns for the search engine in question. And if they do not match mark the ip as banned and give it a proper message.
(B) Then doing a look up on the Hostname to see if it resolves back with a list of ip address�s that contain the ip which you started with.
Something to watch out for with some fake bots will have there ip address resolve to a hostname which matches there ip�s address and thus would pass the test, so this must be explicitly tested for to bounce these results by default. For example ip "10.0.0.1" would resolve to "10.0.0.1" hostname.
MSN, Yahoo, Google, Ask Jeeves all support this functionality currently, others may as well. The purpose of this check is prevent others from spoofing well known Crawlers and setting up there DNS records to resolve there ip�s to a well known Search Engine hostname, but since they will not control the reversing of the Hostname to ip they will get caught with this check..
[More to come at a later date]

Quick primer on identifying bot activity.

And a how to guide to slow and stop scraping

Ocean10000

blend27

wilderness

incrediBILL

Ocean10000

Webwork

mikedee

Romeo

Clark

himalayaswater

JAB Creations

incrediBILL

Ocean10000

wilderness

Bewenched

blend27

incrediBILL

chandrika

Ocean10000

chandrika

Ocean10000

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week