homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

htaccess help - stopping spiders, allowing others
need help with htaccess

 7:28 pm on Mar 7, 2005 (gmt 0)


I want to set up a command in my htaccess, where all pages ending in the following extensions:
are not indexed by spiders, but will work internally for customers on the site. This code is used to sort products by price, etc. on the same page, but i dont want spiders to index the same pages multiple times.
How can i prevent indexing of all these pages, but still allow them to work in my website?

Thanks in advance!



 10:21 pm on Mar 7, 2005 (gmt 0)

You can use mod_rewrite with several RewriteConds testing both {QUERY_STRING} and {HTTP_USER_AGENT} to accomplish what you describe. See our forum charter [webmasterworld.com] for links to some basic resources.



 10:36 pm on Mar 7, 2005 (gmt 0)

Did some research. Came up with this, but cant figure out how to end it, so spiders will know they should not index the page, and if they have to drop it...

# Redirect search engine spider requests which include a query string to same URL with blank query string
RewriteCond %{HTTP_USER_AGENT} ^FAST(-(Real)?WebCrawler/¦\ FirstPage\ retriever) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot(-Image)?/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*(Ask\ Jeeves¦Slurp/¦ZealBot¦Zyborg/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Overture-WebCrawler/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^(Scooter/¦Scrubby/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma
RewriteCond %{QUERY_STRING} .
ReWriteRule .*\?(sort=2d&page=1¦sort=2a&page=1¦etc.)$


 11:24 pm on Mar 7, 2005 (gmt 0)

What do you want to do with search engines that request these resources? Among other things, you can:
  • Return a 403-forbidden response.
  • Redirect them to another page.
  • Feed them alternate content, such as an html page with a robots "noindex" meta-tag on it.
  • Feed them a password-required page.

    It all depends on what you want to do in this case.


  • jmdb71

     2:18 pm on Mar 8, 2005 (gmt 0)

    The engines have already indexed some of these pages with these extensions. I want the engines to just not index these pages, and if they have, to remove the page from their indexes.
    Whats the best code for this, that would not hurt my ranking?


     5:12 pm on Mar 8, 2005 (gmt 0)

    I guess I'd go with the rewrite-to-noindex-page method in this case.

    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule .* /noindex.html [L]

    You can improve the efficiency of this code if all of the pages are of the same type -- for example, php -- by restricting the RewriteRule to act on those page types only:

    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule [b]\.php$[/b] /noindex.html [L]

    Create a file called "noindex.html" in your Web root directory, and place this code in it:

    <head><meta name="robots" content="noindex,nofollow"></head>

    You must allow access to this file in robots.txt.

    Change all broken pipe "¦" characters to solid pipe characters before use.



     6:17 pm on Mar 8, 2005 (gmt 0)

    I really appreciate the help.

    One final question - is there a way I can do this without keeping an updated spiders list in the file. Basically, where the RewriteCond %{HTTP_USER_AGENT} line of code will indentify any spider request/non-browser request.


     8:32 pm on Mar 8, 2005 (gmt 0)

    No, not really. I'd suggest you keep the user-agent strings simple, leaving out version numbers, etc. and only try to cover the engines that really matter to you.



     12:43 pm on Mar 9, 2005 (gmt 0)


    To send all Google queries for sub domain "foo.foo.com" to noindex.html, is this correct, or must the "." in the domain name be escaped?
    RewriteCond %{HTTP_USER_AGENT} ^Google
    RewriteCond %{QUERY_STRING} (foo.foo.com)
    RewriteRule .* /noindex.html [L]

    *foo will be replaced by actual sub domain name


     11:50 pm on Mar 9, 2005 (gmt 0)

    so to put everything together, i want the major spiders from the major engines - google,yahoo,aol,msn,jeeves,lycos,excite -
    to not index any url containing the? character.

    So would this work:
    RewriteCond %{HTTP_USER_AGENT} ^(spider1¦spider2¦etc.)
    RewriteCond %{QUERY_STRING} (?)
    RewriteRule .* /noindex.html [L]

    Also, some of the pages have already been indexed. Would this also tell google to remove the pages from their index if they already exist? Finally, could someone point me to a basic spiders list to use in the first line of code above..?

    Thanks in advance - i really appreciate this forum!

    Global Options:
     top home search open messages active posts  

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved