homepage Welcome to WebmasterWorld Guest from 54.205.7.136
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Blocking ALL robots using .htaccess?
block robots .htaccess
Macamba




msg:3414624
 7:54 am on Aug 6, 2007 (gmt 0)

Hi all,

I am developing a website. First locally, then on a test server so my customer can look at it. Ultimately I will move everything from the test server to the website of the customer.

What I would like to accomplish on my test server is that what I do will not appear on the listings of all search engines of the world. My question is simple, how would I block all search engines from finding the content of my test site, and subsequent appearing on the search engines search results.

I already have the following robots.txt:

# No robots should visit this site
User-agent: *
Disallow:

But the contents of robots.txt is not always used by robots. .htaccess also prohibits access to the website. But in this instance you need to exclude robots by name. Is there a catch all that prohibits ALL robots of accessing my test pages?

Macamba

 

Macamba




msg:3414630
 8:04 am on Aug 6, 2007 (gmt 0)

Would this help?

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^A* [OR]
RewriteCond %{HTTP_USER_AGENT} ^a* [OR]
RewriteCond %{HTTP_USER_AGENT} ^B* [OR]
RewriteCond %{HTTP_USER_AGENT} ^b* [OR]
RewriteCond %{HTTP_USER_AGENT} ^C* [OR]
...
RewriteCond %{HTTP_USER_AGENT} ^Z* [OR]
RewriteCond %{HTTP_USER_AGENT} ^z*
RewriteRule ^(.*)$ http://www.robotstxt.org/

I now have excluded all user agents starting with all letters of the alphabet. But do I now not exclude to much? Like in my own web browser?

Macamba

Lord Majestic




msg:3414639
 8:22 am on Aug 6, 2007 (gmt 0)

User-agent: *
Disallow:

This robots.txt explicitly allows all bots to crawl your site. If you want to disallow them all then it should look like this:

User-agent: *
Disallow: /

If you use .htaccess then you should consider that you will also deny access to robots.txt as well, and thus many bots will assume it does not exist and therefore your site is okay to crawl: of course they will keep getting access denied and you will get lots of those entries in log analysis, but don't complain to bot masters since you have not provided publicly available robots.txt that would tell bots that they should go away.

Always allow robots.txt to be read by any bot!

jdMorgan




msg:3415116
 7:13 pm on Aug 6, 2007 (gmt 0)

I'd recommend using robots.txt, as described by Lord_Majestic.

If you want more protection, then password-protect the entire domain, or exclude all IP addresses except your own by checking the REMOTE_ADDR variable, and denying access for any request except robots.txt

Jim

Macamba




msg:3416663
 11:22 am on Aug 8, 2007 (gmt 0)

Thanks Lord Majestic,

The point is, there are robots who disregard what is in robots.txt. So I thought .htaccess would be a better way, as I deny access to all spiders at the server side.

I changed my robots.txt, as suggested, and wait what the future will bring.

Lord Majestic




msg:3416672
 11:40 am on Aug 8, 2007 (gmt 0)

For the bots that ignore robots.txt you will certainly need to use .htaccess or something similar - however it is still a very good idea to have valid robots.txt denying access to legit bots and avoiding mistake in disallowing these bots to read robots.txt in the first place.

g1smd




msg:3417464
 12:10 am on Aug 9, 2007 (gmt 0)

The robots protocol is not foolproof. I would not use that.

Blocking by User-agent will likely block your intended visitors too.

Use .htaccess and .htpasswd to set a password that has to be typed in to get access.

Macamba




msg:3417639
 7:10 am on Aug 9, 2007 (gmt 0)

If I change the last line of my .htaccess, as shown in my first post, to:

RewriteRule ^(.*)$ robots.txt

do I not deny access to all directories, AND, inform the robots who do what is in robots.txt, that browsing the website is not in order? Or, is this a valid modification?

g1smd




msg:3418041
 4:29 pm on Aug 9, 2007 (gmt 0)

The Rewrite is a bit scary. What it does is that for any requested URL for all visitors they will see the content of your robots.txt file.

That isn't what you want to do.

You just made your whole website have infinite duplicate content, and stopped everyone from seeing any of the real pages on the site.

This is not a good method to block access.

.

Try using .htpasswd to keep everyone except authorised people out.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved