Welcome to WebmasterWorld Guest from 50.16.68.229

Forum Moderators: Ocean10000 & incrediBILL & phranque

Blocking ALL robots using .htaccess?

block robots .htaccess

   
7:54 am on Aug 6, 2007 (gmt 0)

5+ Year Member



Hi all,

I am developing a website. First locally, then on a test server so my customer can look at it. Ultimately I will move everything from the test server to the website of the customer.

What I would like to accomplish on my test server is that what I do will not appear on the listings of all search engines of the world. My question is simple, how would I block all search engines from finding the content of my test site, and subsequent appearing on the search engines search results.

I already have the following robots.txt:


# No robots should visit this site
User-agent: *
Disallow:

But the contents of robots.txt is not always used by robots. .htaccess also prohibits access to the website. But in this instance you need to exclude robots by name. Is there a catch all that prohibits ALL robots of accessing my test pages?

Macamba

8:04 am on Aug 6, 2007 (gmt 0)

5+ Year Member



Would this help?

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^A* [OR]
RewriteCond %{HTTP_USER_AGENT} ^a* [OR]
RewriteCond %{HTTP_USER_AGENT} ^B* [OR]
RewriteCond %{HTTP_USER_AGENT} ^b* [OR]
RewriteCond %{HTTP_USER_AGENT} ^C* [OR]
...
RewriteCond %{HTTP_USER_AGENT} ^Z* [OR]
RewriteCond %{HTTP_USER_AGENT} ^z*
RewriteRule ^(.*)$ http://www.robotstxt.org/

I now have excluded all user agents starting with all letters of the alphabet. But do I now not exclude to much? Like in my own web browser?

Macamba

8:22 am on Aug 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



User-agent: *
Disallow:

This robots.txt explicitly allows all bots to crawl your site. If you want to disallow them all then it should look like this:

User-agent: *
Disallow: /

If you use .htaccess then you should consider that you will also deny access to robots.txt as well, and thus many bots will assume it does not exist and therefore your site is okay to crawl: of course they will keep getting access denied and you will get lots of those entries in log analysis, but don't complain to bot masters since you have not provided publicly available robots.txt that would tell bots that they should go away.

Always allow robots.txt to be read by any bot!

7:13 pm on Aug 6, 2007 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I'd recommend using robots.txt, as described by Lord_Majestic.

If you want more protection, then password-protect the entire domain, or exclude all IP addresses except your own by checking the REMOTE_ADDR variable, and denying access for any request except robots.txt

Jim

11:22 am on Aug 8, 2007 (gmt 0)

5+ Year Member



Thanks Lord Majestic,

The point is, there are robots who disregard what is in robots.txt. So I thought .htaccess would be a better way, as I deny access to all spiders at the server side.

I changed my robots.txt, as suggested, and wait what the future will bring.

11:40 am on Aug 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For the bots that ignore robots.txt you will certainly need to use .htaccess or something similar - however it is still a very good idea to have valid robots.txt denying access to legit bots and avoiding mistake in disallowing these bots to read robots.txt in the first place.
12:10 am on Aug 9, 2007 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The robots protocol is not foolproof. I would not use that.

Blocking by User-agent will likely block your intended visitors too.

Use .htaccess and .htpasswd to set a password that has to be typed in to get access.

7:10 am on Aug 9, 2007 (gmt 0)

5+ Year Member



If I change the last line of my .htaccess, as shown in my first post, to:

RewriteRule ^(.*)$ robots.txt

do I not deny access to all directories, AND, inform the robots who do what is in robots.txt, that browsing the website is not in order? Or, is this a valid modification?
4:29 pm on Aug 9, 2007 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The Rewrite is a bit scary. What it does is that for any requested URL for all visitors they will see the content of your robots.txt file.

That isn't what you want to do.

You just made your whole website have infinite duplicate content, and stopped everyone from seeing any of the real pages on the site.

This is not a good method to block access.

.

Try using .htpasswd to keep everyone except authorised people out.

 

Featured Threads

Hot Threads This Week

Hot Threads This Month