Forum Moderators: phranque

Message Too Old, No Replies

Need to ban Google from getting .php extensions

I have it on robots.txt, mod_rewrite is just in case

         

walkman

1:37 am on Oct 9, 2005 (gmt 0)



I recently got Google sitemaps and after validating my site, Google showed me that they tried to get many of my .php pages. They respected the robots.txt this time, but I rather double it up and ban Google from getting them via rewrite too.

I have the .php files rewriten to another extension and if Googlebot showed up and ignored robots.txt (known to have happened), I'm in trouble. Google must've gottten them from the toolbar, since there are links and no one has any idea of the .php extension.

Can it be done and does anyone have any suggestions?

thanks again,

jd01

9:11 am on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would need a little more information, but there are a couple of options:

1. Redirect the php files to the static equiv., so even if they are requested, they will redirect any link or browser request to the correct static page. (This is done with THE_REQUEST to avoid a loop.)

Option 1 need *much* more information -- the entire .htaccess you are running, and all possible query_string patterns.

2. Deny all requests to php pages, and do not worry about the redirect. (This is again done with THE_REQUEST, and is the one I personally use most of the time, because it offers some protection for the files that run my sites.)

Option 2 is easy:

RewriteCond %{THE_REQUEST} .
RewriteRule \.php - [F]

The rule in this case is not anchored, so any request containing .php will match the pattern. The condition just checks for a single character, so we can define it as an original request. If it is an original request (link, typed in a browser, etc.) a forbiden error will be generated. If the request is secondary (internal, rewritten to, etc.) the condition will fail and the page will be served. This way, you can stop external access to your php files, but can still use them to serve the information to the static locations.

If you are wanting to redirect, the pages, please let us know.

Hope this helps.

Justin

walkman

2:14 pm on Oct 9, 2005 (gmt 0)



Hi Justin,
the second option is perfect. The only thing is that a few pages like "Send this Page" are still .php for the users alone. Is it possible to make so just Googlebot, MSNbot and Slurp are denied .php-s?
Something like this:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC,OR]
RewriteCond %{THE_REQUEST} .
RewriteRule \.php - [F]

Not sure if the last two lines fit in
thanks again,

jd01

5:24 pm on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, you could do it that way, but as with cloaking, there are no guarantees you will pick them up by user agent alone, what I usually do is set another condition that allows access to those files, and noindex,nofollow,noarchive them in the meta tags if necessary.

RewriteCond %{THE_REQUEST} .
RewriteCond %{REQUEST_URI} !(thispage¦anotherpage¦somepage)\.php
RewriteRule \.php - [F]

I have left the second condition unanchored, because I do not know the path to the files, but you could use the full path like this:

RewriteCond %{REQUEST_URI} !^/(somedir/thispage¦another/dir/anotherpage¦somepage)\.php

I added the condition after checking for an original request, so we will not check all internal requests against the REQUEST_URI condition (IOW internal requests will break the condition sooner and free up a little processing.), but they can go in either order.

Hope this helps.

Justin

walkman

7:05 pm on Oct 9, 2005 (gmt 0)



Hi Justin,
sorry to bother you again. The user agent one is fine. I know it's not perfect, but between this and robots.txt I think I will be fine. The chances of Google ignoring robots and using a different name or technique at the same time are really small.

I tried this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC,OR]
RewriteCond %{THE_REQUEST} .
RewriteRule \.php - [F]

but my entire site is 403d (as a regular user too).

any ideas?

thanks again

jd01

7:27 pm on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You will need to remove the OR from the last User-Agent Condition, it should be the implicit AND.

If that does not work, change THE_REQUEST cond to \.php -- should not need to, but maybe I am missing something.

Justin

walkman

7:52 pm on Oct 9, 2005 (gmt 0)



I tested this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} .
RewriteRule \.php - [F]

with a tool that emulates a user agent and it still shows 200 as suppose to 403. Tried several .php files and even replaced {THE_REQUEST} just in case. The same tool shows a ban (403) for the same bots (from an entire domain) with this code:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Googlebot.*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST-Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Msnbot.*$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Teoma [NC]
RewriteRule .* - [F,L]

I will look at it again later on--with a clearer head I hope ;).

jd01

8:06 pm on Oct 9, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, I should have been more clear...

Should be this:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} .
RewriteRule \.php - [F]

OR

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php - [F]

Justin

BTW I find it easier to use F-fox user-agent switching to test than most sites -- allows you to set the user-agent to anything you want.

Added: This will still allow all regular users access to all php files, the best way to overcome that (if necessary) is to remove the user-agent cond. and use the specific files instead. Should have noted this before, sorry -- trouble communicating clear thought today =)

walkman

10:06 pm on Oct 9, 2005 (gmt 0)



Thank you very much Justin,
this worked perfectly. Thanks again.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Googlebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Msnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp [NC]
RewriteCond %{THE_REQUEST} \.php
RewriteRule \.php - [F]