homepage Welcome to WebmasterWorld Guest from 54.205.242.179
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
htaccess help - stopping spiders, allowing others
need help with htaccess
jmdb71




msg:1511385
 7:28 pm on Mar 7, 2005 (gmt 0)

Hi,

I want to set up a command in my htaccess, where all pages ending in the following extensions:
?sort=2d&page=1
?sort=2a&page=1
?sort=3a&page=1
?sort=3d&page=1
are not indexed by spiders, but will work internally for customers on the site. This code is used to sort products by price, etc. on the same page, but i dont want spiders to index the same pages multiple times.
How can i prevent indexing of all these pages, but still allow them to work in my website?

Thanks in advance!

 

jdMorgan




msg:1511386
 10:21 pm on Mar 7, 2005 (gmt 0)

You can use mod_rewrite with several RewriteConds testing both {QUERY_STRING} and {HTTP_USER_AGENT} to accomplish what you describe. See our forum charter [webmasterworld.com] for links to some basic resources.

Jim

jmdb71




msg:1511387
 10:36 pm on Mar 7, 2005 (gmt 0)

Did some research. Came up with this, but cant figure out how to end it, so spiders will know they should not index the page, and if they have to drop it...

# Redirect search engine spider requests which include a query string to same URL with blank query string
RewriteCond %{HTTP_USER_AGENT} ^FAST(-(Real)?WebCrawler/¦\ FirstPage\ retriever) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot(-Image)?/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*(Ask\ Jeeves¦Slurp/¦ZealBot¦Zyborg/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Overture-WebCrawler/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^(Scooter/¦Scrubby/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma
RewriteCond %{QUERY_STRING} .
ReWriteRule .*\?(sort=2d&page=1¦sort=2a&page=1¦etc.)$

jdMorgan




msg:1511388
 11:24 pm on Mar 7, 2005 (gmt 0)

What do you want to do with search engines that request these resources? Among other things, you can:
  • Return a 403-forbidden response.
  • Redirect them to another page.
  • Feed them alternate content, such as an html page with a robots "noindex" meta-tag on it.
  • Feed them a password-required page.

    It all depends on what you want to do in this case.

    Jim

  • jmdb71




    msg:1511389
     2:18 pm on Mar 8, 2005 (gmt 0)

    The engines have already indexed some of these pages with these extensions. I want the engines to just not index these pages, and if they have, to remove the page from their indexes.
    Whats the best code for this, that would not hurt my ranking?

    jdMorgan




    msg:1511390
     5:12 pm on Mar 8, 2005 (gmt 0)

    I guess I'd go with the rewrite-to-noindex-page method in this case.

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule .* /noindex.html [L]

    You can improve the efficiency of this code if all of the pages are of the same type -- for example, php -- by restricting the RewriteRule to act on those page types only:

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule [b]\.php$[/b] /noindex.html [L]

    Create a file called "noindex.html" in your Web root directory, and place this code in it:

    <html>
    <head><meta name="robots" content="noindex,nofollow"></head>
    <body></body>
    </html>

    You must allow access to this file in robots.txt.

    Change all broken pipe "¦" characters to solid pipe characters before use.

    Jim

    jmdb71




    msg:1511391
     6:17 pm on Mar 8, 2005 (gmt 0)

    I really appreciate the help.

    One final question - is there a way I can do this without keeping an updated spiders list in the file. Basically, where the RewriteCond %{HTTP_USER_AGENT} line of code will indentify any spider request/non-browser request.

    jdMorgan




    msg:1511392
     8:32 pm on Mar 8, 2005 (gmt 0)

    No, not really. I'd suggest you keep the user-agent strings simple, leaving out version numbers, etc. and only try to cover the engines that really matter to you.

    Jim

    jk3210




    msg:1511393
     12:43 pm on Mar 9, 2005 (gmt 0)

    Jim-

    To send all Google queries for sub domain "foo.foo.com" to noindex.html, is this correct, or must the "." in the domain name be escaped?
    ...
    RewriteCond %{HTTP_USER_AGENT} ^Google
    RewriteCond %{QUERY_STRING} (foo.foo.com)
    RewriteRule .* /noindex.html [L]

    *foo will be replaced by actual sub domain name

    jmdb71




    msg:1511394
     11:50 pm on Mar 9, 2005 (gmt 0)

    ok
    so to put everything together, i want the major spiders from the major engines - google,yahoo,aol,msn,jeeves,lycos,excite -
    to not index any url containing the? character.

    So would this work:
    RewriteCond %{HTTP_USER_AGENT} ^(spider1¦spider2¦etc.)
    RewriteCond %{QUERY_STRING} (?)
    RewriteRule .* /noindex.html [L]

    Also, some of the pages have already been indexed. Would this also tell google to remove the pages from their index if they already exist? Finally, could someone point me to a basic spiders list to use in the first line of code above..?

    Thanks in advance - i really appreciate this forum!

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
    © Webmaster World 1996-2014 all rights reserved