whitelist bots in .htaccess

Forum Moderators: phranque

Message Too Old, No Replies

whitelist bots in .htaccess

smallcompany

5:27 am on Feb 28, 2010 (gmt 0)

I searched and searched and searched and only came across how to block bots what I'm already doing.

For one site, I'm curious going the opposite way. I know that UA can be faked, but still I would like to block those that do not fake bots like of Google, Yahoo, etc.

What would be the least code in .htaccess that allows selected bots to go through?
How about regular visitors' UAs?

Thanks

jdMorgan

4:01 pm on Mar 1, 2010 (gmt 0)

Generally, whitelisting the major search engine robots requires anything from a few score lines of code that only check very basic user-agent-string factors, versus several hundred lines of code to more-fully check the request -- both lexically-valid variations (user-agent-string formatting) and logically-valid variations (e.g. claimed OS version versus version number), plus checking all HTTP request headers associated with that request (e.g. Accept, Accept-Encoding, Accept-Charset, Accept-Language, Connection, Via, X-Forwarded-For, From, and proprietary headers such as MJ12's Crawler-Ident header. In addition, the code may also check the claimed user-agent against the current list of known IP addresses from which that robot should be crawling.

This latter sort of code often has to be maintained on a daily basis, since the whitelist code's response to a new or unknown user-agent string will be a 403-Forbidden response, and unfortunately, crawler engineers seem to place little value on holding their own user-agent strings to any company-standard formats. Depending on how many "factors" are checked and cross-checked, I've seen examples ranging from 20 to 200 lines of code. And it should be noted that this code often contains complex regular expressions, and so is not easy to create or maintain.

However, a major problem with taking the simple (small-code) approach is that some of the most "dangerous" requests may in fact come from unknown companies or people spoofing the major search engine robots -- There are many Googlebot spoofers at work right now, for example, and I suspect that they will continue unless Google can identify them and sue them for "trading as" and infringing the Google trademark...

For browsers, the problem is far worse, since there are many, many more browsers and variations of browsers in use, and old browser versions don't disappear as soon as they are replaced because these user-agents are owned, controlled, and updated by many, many users, and not just by a single company.

In addition, you've got the mobile device user-agents to contend with, where even "real" and valid requests come with user-agent strings and HTTP requests headers that do not conform to standards and look invalid. (It seems to me that mobile devices are rushed to market so quickly that no-one even checks the validity of their user-agent strings, and the UA string for one model of phone may be completely different from that of the immediately-previous model, even within the same-named "product line."

As a result the code for browser whitelisting can be much bigger, say 400 to 800 lines, and the regular-expressions even more-complex. For example, just to check the HTTP Character-Set request header formatting of current Mozilla-based desktop browsers, a minimally-selective regex pattern might be something like:


^((([A-Za-z][A-Za-z0-9]*([_\-][A-Za-z0-9]+)*)+(,(\*|[A-Za-z][A-Za-z0-9]+([_\-][A-Za-z0-9]+)*))*)+(;q=0\.[0-9])?)+(,\*;q=0\.[0-9])?$

Note that the length of the character-set header is unknown and can vary according to how many character-sets the user has configured as acceptable, so this pattern contains several iterative (repeatable) internal sub-patterns.

This is why relatively-fewer Webmasters take on the whitelisting challenge: It may be too complex and take too much maintenance time for one Webmaster to keep up with, and far too costly to contract out to a third party.

Jim