Forum Moderators: phranque
Where available, double rDNS lookups is implemented automatically by using a hostname in a mod_access "Allow from" or "Deny from" directive. For example:
Allow from googlebot.com
See Apache mod_access and the notes under HostnameLookups in Apache core.
Note that not all hosts will allow the use of double-rDNS lookups, in which case the only option is to use a list of googlebot IP addresses, monitor googlebot accesses, and keep your IP address list updated.
Jim
from Matt's blog (read comments section as well)
How to verify Googlebot [mattcutts.com]
from G's official blog:
How to verify Googlebot [googlewebmastercentral.blogspot.com]
this should point you in the right direction...
No rDNS checking is done for other User-agents; As shown, the code allows them unconditional access. The purpose of this code is to prevent spoofers of desireable robots from crawling your site, and to prevent desireable robots from crawling your site through the "site proxies" discussed in this thread.
Major robot User-agents which fail the rDNS check will receive a 403-Forbidden response from your server.
The code is also enclosed in an (optional) <FilesMatch> container which limits the rDNS checks to requests for .htm, .html, .shtm, .shtml, .php, .php4, and .php5 files. This is intended to reduce the number of rDNS lookups your server must request, but it does leave other filetypes open to scraping by spoofed robots. You can omit this container or adjust it to suit your needs.
<FilesMatch "\.(s?html?¦php[45]?)$">
SetEnvIfNoCase User-Agent "!(Googlebot¦msnbot¦Teoma)" notRDNSbot
#
Order Deny,Allow
Deny from all
Allow from env=notRDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
</FilesMatch>
Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe character.
References:
Apache core HostnameLookups directive [httpd.apache.org]
Apache mod_setenvif [httpd.apache.org]
Apache mod_access [httpd.apache.org]
Jim
Try modifying the code at the top like this, and see if it works better for you:
<FilesMatch "\.(s?html?¦php[45]?)$">
[i]SetEnv notRDNSbot
SetEnvIfNoCase User-Agent "(Googlebot¦msnbot¦Teoma)" !notRDNSbot [/i]
#
Order Deny,Allow
Deny from all
Allow from env=notRDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
</FilesMatch>
Replace the broken pipe "¦" characters above with solid pipe characters before use; Posting on this forum modifies the pipe character.
Jim
[edit] Fix speling and re-issue pipe character warning.[/edit]
[edited by: jdMorgan at 10:08 pm (utc) on July 1, 2007]
Your other option is to allow only known Google, MSN, and Teoma IP addresses to claim that they're 'bots, as described in the Page hijacking by a proxy server can take your Google ranking [webmasterworld.com] thread.
[added] Do make sure you've flushed your browser cache completely, though, before testing *any* change to your .htaccess code. [/added]
Jim
[edited by: jdMorgan at 10:06 pm (utc) on July 1, 2007]
Anyway, does this do any kind of caching of lookups? Or does it do this for every access? IncrediBILL was saying something about doing it once every 24 hours in the other thread, but I think he's using a different method than this...
<FilesMatch "\.(s?html?¦php[45]?)$">
#
BrowserMatchNoCase Googlebot rDNSbot
BrowserMatchNoCase msnbot rDNSbot
BrowserMatchNoCase Teoma rDNSbot
#
Order Deny,Allow
Deny from [b]env=[/b]rDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
# For testing only: Put your own public ISP IP address in the following
# line. This will allow you to use your IP address to spoof the above
# robots successfully, using WannaBrowser or a "User Agent Switcher"
# extension for Firefox/Mozilla browsers. After testing, remove this
# line or comment it out; You should then no longer be able to spoof.
Allow from 192.168.0.1
#
</FilesMatch>
This code cannot simply be added to your .htaccess file if you already have other Order, Allow, and Deny directives. Instead, the code must be integrated* with the existing code, and this may require a complete re-design of that existing code. That said, any but the most general support for such projects goes far beyond the charter of this forum.
* For example, only one Order directive can be used in .htaccess, unless great care is taken to use <Files>, <FilesMatch>, and other containers to assure that they are mutually-exclusive. Otherwise, if more than one Order directive is present, the last one that applies will be used, and this can lead to unexpected results.
Jim
[edit] Fix typo/omission revealed by Moncao's testing -- See below. [/edit]
[edited by: jdMorgan at 3:26 pm (utc) on July 5, 2007]
What I did as per Jim's suggestion was insert the code into my htaccess file (after allowing a double dns look up on the server) and then check to make sure it did not block (403) normal users. Then I used the User Agent Switcher on Firefox to declare my browser to be "Googlebot" and to permit (allow) my IP address to see if I could then connect / fetch pages, I could. So far so good.
But then I edited out / removed my IP address, so it was no longer allowed and which should then have prevented me from fetching any page from my site. Unfortunately not; I was still able to fetch pages no problem.
I think the problem with Jim's code is that it works based on the server information provided, not the user agent. Therefore any IP / server address with googlebot in it, for example crawl-66-249-67-218.googlebot.com, the code WILL do a dns check on it. BUT (and it is a huge but), if the fetch / call is from a proxy server (e.g. i-steal-your-website-and-pr.nastyproxy.com) forwarding / proxying a request from Google on their own server, this code does not appear to block it.
I am not having a go at Jim, I am very, very grateful for his efforts and help and hope he might find a remedy for me, but Jim did mention he is not himself able to use this code because his server does not allow it, so he can not test it himself. This is just for people's information so they do not rely on this code to prevent proxy hijackers from stealing their PR and content.
If anyone can help come up with a remedy / solution that checks based on the user agent name not the server name, I for one would be very grateful.
For starters, the user-agent isn't independently matched with the hostname. Secondly, msnbot uses three hostnames that I am aware of. Finally, it's not uncommon for Ask [webmasterworld.com] and msnbot spiders to come from IPs that do not reverse resolve to a host name. Google IPs have also been known not reverse resolve at times.
As to the "holes" in the code, they are inevitable due to the limitations of Apache mod_access and my attempt to "keep it simple." However, the code will work to stop spoofers of the listed User-agents. Also, it is highly unlikely that, for example, Google will be spoofing as msnbot or Teoma, or the like.
Google, Ask, and MSN/Live have all published statements encouraging Webmasters to use rDNS to authenticate their robots. Neither I nor Apache can do anything about the fact that not all of their requests will pass a double-reverse DNS lookup, and I cannot vouch for the authenticity of those non-resolving IP address ranges they're using. You may add exceptions for IP address ranges that you believe are making authentic requests from these companies by using "Allow from <IP address or range goes here>" if you so desire.
This code, like many solutions, is "the best one can do" without a full-on access-control system. It is offered for those who need a quick fix to the currently-discussed problem, but who don't have the time, resources, or expertise to implement a more comprehensive solution.
Jim
Perfect! It works great. I went through the entire checking procedure to make sure nobody was being 403'd who I did not want (real users and search engines), when I switched my user agent to "Googlebot" in Firefox, I got a 403. When I switched back to default, no problem. Watched my logs, Google, MSN and ask all crawling OK.
Now all I need to do I guess is put up a custom 403 page just to make sure. Do you have code for that too?
:-)
If you want to add a custom 403 page, then create an HTML page that looks somewhat like the rest of your pages and upload it. Note: I suggest in the strongest possible terms that you create and use *simple* custom error pages; Do not use images, do not use CSS, do not use external headers and footers, external JavaScripts, PHP scripting, etc. Just a simple HTML page that has no external dependencies.
The reason I make this suggestion is that pages for 403 and 500 errors are displayed when there is a problem, and that having external dependencies in your error-handling may make that problem much worse.
For example, you will have to add some code to the routine posted above to *allow* your 403 page to be fetched -- even by bad robots spoofing one of the legitimate robots. If you don't add this code, then a denied client will try to fetch the custom 403 page, and get another 403 error, because access is denied. Then it will try to fetch the custom 403 page again because of this second denial, and you'll get yet another 403 error because of that. Now your server is in a loop, which offers opportunities for a low-tech denial-of-service attack, because one denied request can trigger a cascade of 403 errors...
So, if you use a custom 403 or 500 page, make sure it is simple and stands completely alone with no external dependencies. If you absolutely must use an image or external include in your custom error document, you will need to add an "Allow" for it in the same way as for the custom 403 page itself.
In order to allow serving a simple custom 403 page, you'll need to add three directives:
ErrorDocument 403 /path-to-custom-403-page.html
SetEnvIf Request_URI "/path-to-custom-403-page\.html$" AllowAll
Allow from env=AllowAll
So, pulling all of this together, the modified routine will look something like this:
ErrorDocument 403 /path-to-custom-403-page.html
#
<FilesMatch "\.(s?html?¦php[45]?)$">
#
BrowserMatchNoCase Googlebot rDNSbot
BrowserMatchNoCase msnbot rDNSbot
BrowserMatchNoCase Slurp rDNSbot
BrowserMatchNoCase Teoma rDNSbot
#
SetEnvIf Request_URI "/path-to-custom-403-page\.html$" AllowAll
#
Order Deny,Allow
Deny from env=rDNSbot
Allow from env=AllowAll
Allow from googlebot.com
Allow from search.live.com
Allow from crawl.yahoo.net
Allow from ask.com
#
# For testing only: Put your own public ISP IP address in the following
# line. This will allow you to use your IP address to spoof the above
# robots successfully, using WannaBrowser or a "User Agent Switcher"
# extension for Firefox/Mozilla browsers. After testing, remove this
# line or comment it out; You should then no longer be able to spoof.
Allow from 192.168.0.1
#
</FilesMatch>
Jim
It is dangerous to redirect any visitor to any external site not under your control; Who knows what the visitor might find there (nasty pages) or be subjected to (trojan/worm download). Don't be partly responsible for this possibility; Just 403 the request with a terse but helpful message, and be done with it. Remember, it is not the visitor's fault if he/she finds your site through the proxy site, so just invite them to use the 'correct' URL for your site.
Jim
In my htaccess file I also have;
RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
etc.
RewriteRule ^.* - [F,L]
Is this affected by your code do you think? I use the above to stop scrapper programs. I noticed a bot called Twiceler-0.9 coming from cuill.com which I think is a scrapper and decided to add it like this;
RewriteCond %{HTTP_USER_AGENT} ^Twiceler [OR]
But this does not work, their bot is back.
Any advice?
So, either change the anchoring of your pattern, or include all of the UA string that precedes "Twiceler" in the pattern.
See the concise regular-expressions tutorial cited in our Forum Charter for more information.
By the way, Twiceler fetches and respects robots.txt, and that would be a more "humane" way to stop this 'bot.
Jim
thanks for your work on this solution..
i'm trying to figure out how to get this solution to work for my situation where most pages are rewritten to directory names, but some are still html,php or pdf file extension. any ideas?
i looked at the Directory and DirectoryMatch directives, but couldn't figure out how to combine those with the file matching directives.
Jim