REVERSE->FORWARD DNS

Forum Moderators: phranque

Message Too Old, No Replies

REVERSE->FORWARD DNS

Moncao

6:23 am on Jun 29, 2007 (gmt 0)

Can anyone point me to details on how to set up the REVERSE->FORWARD DNS test outlined in another thread by IncredibleBill? I have seen the double forward dns solution by JPMorgan has a problem in that if Google use an IP not listed as assigned to Google, it may cause another problem altogether.

jdMorgan

6:28 pm on Jun 29, 2007 (gmt 0)

That was not a "double forward DNS" solution I mentioned, it was a simple IP address lookup. If you don't include all of googlebot's IP addresses in your "acceptable googlebot IP addresses list" and keep that list up-to-date, then you may reject a real googlebot, and that may cause Google to drop your pages from their index.

Where available, double rDNS lookups is implemented automatically by using a hostname in a mod_access "Allow from" or "Deny from" directive. For example:


Allow from googlebot.com

will only allow access for 'real' googlebot hosts.

See Apache mod_access and the notes under HostnameLookups in Apache core.

Note that not all hosts will allow the use of double-rDNS lookups, in which case the only option is to use a list of googlebot IP addresses, monitor googlebot accesses, and keep your IP address list updated.

Jim

Moncao

6:28 am on Jun 30, 2007 (gmt 0)

Thanks JD but I want to go with the REVERSE->FORWARD DNS test, do you know where I can get specifics on it / can anyone outline these to me?

Tastatura

7:49 am on Jun 30, 2007 (gmt 0)

This info is from 'horse's mouth' (Google/Matt C.):

from Matt's blog (read comments section as well)
How to verify Googlebot [mattcutts.com]

from G's official blog:
How to verify Googlebot [googlewebmastercentral.blogspot.com]

this should point you in the right direction...

Moncao

6:16 am on Jul 1, 2007 (gmt 0)

Thanks Tastatura, duly provided to my hosting company who said they would help.

jdMorgan

7:05 pm on Jul 1, 2007 (gmt 0)

This code, for use in .htaccess files on Apache servers, implements a double reverse-DNS check on requests coming from robots which claim to crawl from IP addresses that resolve back to the hostnames shown.

No rDNS checking is done for other User-agents; As shown, the code allows them unconditional access. The purpose of this code is to prevent spoofers of desireable robots from crawling your site, and to prevent desireable robots from crawling your site through the "site proxies" discussed in this thread.

Major robot User-agents which fail the rDNS check will receive a 403-Forbidden response from your server.

The code is also enclosed in an (optional) <FilesMatch> container which limits the rDNS checks to requests for .htm, .html, .shtm, .shtml, .php, .php4, and .php5 files. This is intended to reduce the number of rDNS lookups your server must request, but it does leave other filetypes open to scraping by spoofed robots. You can omit this container or adjust it to suit your needs.


<FilesMatch "\.(s?html?�php[45]?)$">
SetEnvIfNoCase User-Agent "!(Googlebot�msnbot�Teoma)" notRDNSbot
#
Order Deny,Allow
Deny from all
Allow from env=notRDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
</FilesMatch>

If you already have other Order, Allow, and Deny directives in your .htaccess file, you cannot simply add this code; The functions of the two blocks of mod_access code will have to be integrated. However, since everyone's mod_access code is likely to be different, that is a task beyond well beyond the scope of this post.

Replace the broken pipe "�" characters above with solid pipe characters before use; Posting on this forum modifies the pipe character.

References:
Apache core HostnameLookups directive [httpd.apache.org]
Apache mod_setenvif [httpd.apache.org]
Apache mod_access [httpd.apache.org]

Jim

Patrick Taylor

8:06 pm on Jul 1, 2007 (gmt 0)

Thanks Jim. I tried it and got a 403 Forbidden (using default Firefox). Is that the exact code, apart from converting the pipe symbols?

Does the position of:

Options +FollowSymLinks
RewriteEngine on

... have any influence?

Patrick

jdMorgan

8:33 pm on Jul 1, 2007 (gmt 0)

Options and RewriteEngine have nothing to do with this, since those are mod_rewrite-related, and this code uses mod_access.

Try modifying the code at the top like this, and see if it works better for you:


<FilesMatch "\.(s?html?�php[45]?)$">
[i]SetEnv notRDNSbot
SetEnvIfNoCase User-Agent "(Googlebot�msnbot�Teoma)" !notRDNSbot [/i]
#
Order Deny,Allow
Deny from all
Allow from env=notRDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
</FilesMatch>

This is just a slight tweak to the logic that determines whether the user-agent can be tested for valid rDNS.

Replace the broken pipe "�" characters above with solid pipe characters before use; Posting on this forum modifies the pipe character.

Jim

[edit] Fix speling and re-issue pipe character warning.[/edit]

[edited by: jdMorgan at 10:08 pm (utc) on July 1, 2007]

Patrick Taylor

9:57 pm on Jul 1, 2007 (gmt 0)

Jim, thanks, but still the same. The page is on a shared server.

Patrick

jdMorgan

10:05 pm on Jul 1, 2007 (gmt 0)

It sounds like your host has disabled rDNS lookups completely, then... So no rDNS solution will work. Unfortunately, this is fairly common. :(

Your other option is to allow only known Google, MSN, and Teoma IP addresses to claim that they're 'bots, as described in the Page hijacking by a proxy server can take your Google ranking [webmasterworld.com] thread.

[added] Do make sure you've flushed your browser cache completely, though, before testing *any* change to your .htaccess code. [/added]

Jim

[edited by: jdMorgan at 10:06 pm (utc) on July 1, 2007]

Bluesplinter

9:00 pm on Jul 3, 2007 (gmt 0)

Very cool stuff, and when I was checking the Apache docs, I found that as of Apache 2.1+, mod_access was renamed mod_authz_host.

Anyway, does this do any kind of caching of lookups? Or does it do this for every access? IncrediBILL was saying something about doing it once every 24 hours in the other thread, but I think he's using a different method than this...

jdMorgan

2:43 am on Jul 4, 2007 (gmt 0)

The caching, if any, would take place outside of Apache, likely in the hosting company's DNS infrastructure.

Jim

jdMorgan

2:46 am on Jul 4, 2007 (gmt 0)

Several people have reported problems getting the code above to work. So here is another somewhat-simplified version:


<FilesMatch "\.(s?html?�php[45]?)$">
#
BrowserMatchNoCase Googlebot rDNSbot
BrowserMatchNoCase msnbot rDNSbot
BrowserMatchNoCase Teoma rDNSbot
#
Order Deny,Allow
Deny from [b]env=[/b]rDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
# For testing only: Put your own public ISP IP address in the following
# line. This will allow you to use your IP address to spoof the above
# robots successfully, using WannaBrowser or a "User Agent Switcher"
# extension for Firefox/Mozilla browsers. After testing, remove this
# line or comment it out; You should then no longer be able to spoof.
Allow from 192.168.0.1
#
</FilesMatch>

You must replace all broken pipe "�" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

This code cannot simply be added to your .htaccess file if you already have other Order, Allow, and Deny directives. Instead, the code must be integrated* with the existing code, and this may require a complete re-design of that existing code. That said, any but the most general support for such projects goes far beyond the charter of this forum.

* For example, only one Order directive can be used in .htaccess, unless great care is taken to use <Files>, <FilesMatch>, and other containers to assure that they are mutually-exclusive. Otherwise, if more than one Order directive is present, the last one that applies will be used, and this can lead to unexpected results.

Jim

[edit] Fix typo/omission revealed by Moncao's testing -- See below. [/edit]

[edited by: jdMorgan at 3:26 pm (utc) on July 5, 2007]

Moncao

6:30 am on Jul 5, 2007 (gmt 0)

I have replied to Jim by email also, but have done some testing and do not believe his code works.

What I did as per Jim's suggestion was insert the code into my htaccess file (after allowing a double dns look up on the server) and then check to make sure it did not block (403) normal users. Then I used the User Agent Switcher on Firefox to declare my browser to be "Googlebot" and to permit (allow) my IP address to see if I could then connect / fetch pages, I could. So far so good.

But then I edited out / removed my IP address, so it was no longer allowed and which should then have prevented me from fetching any page from my site. Unfortunately not; I was still able to fetch pages no problem.

I think the problem with Jim's code is that it works based on the server information provided, not the user agent. Therefore any IP / server address with googlebot in it, for example crawl-66-249-67-218.googlebot.com, the code WILL do a dns check on it. BUT (and it is a huge but), if the fetch / call is from a proxy server (e.g. i-steal-your-website-and-pr.nastyproxy.com) forwarding / proxying a request from Google on their own server, this code does not appear to block it.

I am not having a go at Jim, I am very, very grateful for his efforts and help and hope he might find a remedy for me, but Jim did mention he is not himself able to use this code because his server does not allow it, so he can not test it himself. This is just for people's information so they do not rely on this code to prevent proxy hijackers from stealing their PR and content.

If anyone can help come up with a remedy / solution that checks based on the user agent name not the server name, I for one would be very grateful.

Key_Master

7:06 am on Jul 5, 2007 (gmt 0)

No disrespect intended for Jim, but anyone looking to implement this code on their sites should know it's full of holes and omissions.

For starters, the user-agent isn't independently matched with the hostname. Secondly, msnbot uses three hostnames that I am aware of. Finally, it's not uncommon for Ask [webmasterworld.com] and msnbot spiders to come from IPs that do not reverse resolve to a host name. Google IPs have also been known not reverse resolve at times.

Moncao

11:06 am on Jul 5, 2007 (gmt 0)

So what is the solution anyone?

jdMorgan

3:23 pm on Jul 5, 2007 (gmt 0)

The likely problem in the omission of "env=" from the "Deny from" in the code above. I have made a correction, and marked it with bold text, so as to prevent propagation of the incorrect code.

As to the "holes" in the code, they are inevitable due to the limitations of Apache mod_access and my attempt to "keep it simple." However, the code will work to stop spoofers of the listed User-agents. Also, it is highly unlikely that, for example, Google will be spoofing as msnbot or Teoma, or the like.

Google, Ask, and MSN/Live have all published statements encouraging Webmasters to use rDNS to authenticate their robots. Neither I nor Apache can do anything about the fact that not all of their requests will pass a double-reverse DNS lookup, and I cannot vouch for the authenticity of those non-resolving IP address ranges they're using. You may add exceptions for IP address ranges that you believe are making authentic requests from these companies by using "Allow from <IP address or range goes here>" if you so desire.

This code, like many solutions, is "the best one can do" without a full-on access-control system. It is offered for those who need a quick fix to the currently-discussed problem, but who don't have the time, resources, or expertise to implement a more comprehensive solution.

Jim

Moncao

11:44 am on Jul 6, 2007 (gmt 0)

Hi Jim

Perfect! It works great. I went through the entire checking procedure to make sure nobody was being 403'd who I did not want (real users and search engines), when I switched my user agent to "Googlebot" in Firefox, I got a 403. When I switched back to default, no problem. Watched my logs, Google, MSN and ask all crawling OK.

Now all I need to do I guess is put up a custom 403 page just to make sure. Do you have code for that too?

:-)

jdMorgan

2:43 pm on Jul 6, 2007 (gmt 0)

Having or not having a custom 403 page is largely a matter of "presentation" -- That is, if you don't have one, then Apache will return a default 403 page. The default page is not pretty, and it won't look like the rest of your pages, but it does accomplish the same thing -- Tells the client (and/or human user) that access is denied.

If you want to add a custom 403 page, then create an HTML page that looks somewhat like the rest of your pages and upload it. Note: I suggest in the strongest possible terms that you create and use *simple* custom error pages; Do not use images, do not use CSS, do not use external headers and footers, external JavaScripts, PHP scripting, etc. Just a simple HTML page that has no external dependencies.

The reason I make this suggestion is that pages for 403 and 500 errors are displayed when there is a problem, and that having external dependencies in your error-handling may make that problem much worse.

For example, you will have to add some code to the routine posted above to *allow* your 403 page to be fetched -- even by bad robots spoofing one of the legitimate robots. If you don't add this code, then a denied client will try to fetch the custom 403 page, and get another 403 error, because access is denied. Then it will try to fetch the custom 403 page again because of this second denial, and you'll get yet another 403 error because of that. Now your server is in a loop, which offers opportunities for a low-tech denial-of-service attack, because one denied request can trigger a cascade of 403 errors...

So, if you use a custom 403 or 500 page, make sure it is simple and stands completely alone with no external dependencies. If you absolutely must use an image or external include in your custom error document, you will need to add an "Allow" for it in the same way as for the custom 403 page itself.

In order to allow serving a simple custom 403 page, you'll need to add three directives:


ErrorDocument 403 /path-to-custom-403-page.html
 
SetEnvIf Request_URI "/path-to-custom-403-page\.html$" AllowAll
 
Allow from env=AllowAll

Also, I found out from the thread discussing proxy hijacking that Yahoo! Slurp now supports rDNS, so I've added Slurp to the list as well.

So, pulling all of this together, the modified routine will look something like this:


ErrorDocument 403 /path-to-custom-403-page.html
#
<FilesMatch "\.(s?html?�php[45]?)$">
#
BrowserMatchNoCase Googlebot rDNSbot
BrowserMatchNoCase msnbot rDNSbot
BrowserMatchNoCase Slurp rDNSbot
BrowserMatchNoCase Teoma rDNSbot
#
SetEnvIf Request_URI "/path-to-custom-403-page\.html$" AllowAll
#
Order Deny,Allow
Deny from env=rDNSbot
Allow from env=AllowAll
Allow from googlebot.com
Allow from search.live.com
Allow from crawl.yahoo.net
Allow from ask.com
#
# For testing only: Put your own public ISP IP address in the following
# line. This will allow you to use your IP address to spoof the above
# robots successfully, using WannaBrowser or a "User Agent Switcher"
# extension for Firefox/Mozilla browsers. After testing, remove this
# line or comment it out; You should then no longer be able to spoof.
Allow from 192.168.0.1
#
</FilesMatch>

Replace the broken pipe "�" characters above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

Jim

Moncao

6:30 am on Jul 8, 2007 (gmt 0)

Hi Jim

Perfect, thank you so much. And this means I can serve them a unique 403 page, a 403 page just for proxied Google, etc. attempts to steal my content and PR? So I can return a redirect to Osama's home page?

jdMorgan

6:11 pm on Jul 8, 2007 (gmt 0)

You can create both a default 403 page and a "special" 403 page, and *internally rewrite* unwelcome requests to the special 403 page if you find the "rDNSbot" environment variable set. I'd suggest a custom 403 page with an invitation to visit your site at its canonical URL.

It is dangerous to redirect any visitor to any external site not under your control; Who knows what the visitor might find there (nasty pages) or be subjected to (trojan/worm download). Don't be partly responsible for this possibility; Just 403 the request with a terse but helpful message, and be done with it. Remember, it is not the visitor's fault if he/she finds your site through the proxy site, so just invite them to use the 'correct' URL for your site.

Jim

Moncao

7:57 am on Jul 12, 2007 (gmt 0)

Hi Jim

In my htaccess file I also have;
RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
etc.
RewriteRule ^.* - [F,L]

Is this affected by your code do you think? I use the above to stop scrapper programs. I noticed a bot called Twiceler-0.9 coming from cuill.com which I think is a scrapper and decided to add it like this;
RewriteCond %{HTTP_USER_AGENT} ^Twiceler [OR]

But this does not work, their bot is back.

Any advice?

jdMorgan

3:32 pm on Jul 12, 2007 (gmt 0)

Your regular-expressions pattern anchoring specifies that the User-agent must start with"Twiceler" in order to be blocked. But Twiceler's current user-agent name does not start with "Twiceler", it starts with "Mozilla/5.0 (Twiceler".

So, either change the anchoring of your pattern, or include all of the UA string that precedes "Twiceler" in the pattern.

See the concise regular-expressions tutorial cited in our Forum Charter for more information.

By the way, Twiceler fetches and respects robots.txt, and that would be a more "humane" way to stop this 'bot.

Jim

Moncao

6:49 am on Jul 13, 2007 (gmt 0)

Hi Jim
Ok, thanks, I see. Actually I got a reply from them, it seems it is a legit new SE, so I do not want to ban them.

jexx

5:05 pm on Jul 18, 2007 (gmt 0)

jdmorgan:

thanks for your work on this solution..
i'm trying to figure out how to get this solution to work for my situation where most pages are rewritten to directory names, but some are still html,php or pdf file extension. any ideas?

i looked at the Directory and DirectoryMatch directives, but couldn't figure out how to combine those with the file matching directives.

jdMorgan

5:20 pm on Jul 18, 2007 (gmt 0)

Rewrites should be irrelevant, since the whole block of code above is enclosed in a <Files> container. And <Files> means "files" not URLs. Therefore, the code is applied to filenames, and your extensionless URL-paths should not matter at all -- The access controls will be applied (or not) depending on what *files* those URLs end up resolving to after all mod_dir, mod_alias, and mod_rewrite URL-to-filename translations are finished.

Jim