homepage Welcome to WebmasterWorld Guest from 54.226.147.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
htaccess help - stopping spiders, allowing others
need help with htaccess
jmdb71

10+ Year Member



 
Msg#: 3036 posted 7:28 pm on Mar 7, 2005 (gmt 0)

Hi,

I want to set up a command in my htaccess, where all pages ending in the following extensions:
?sort=2d&page=1
?sort=2a&page=1
?sort=3a&page=1
?sort=3d&page=1
are not indexed by spiders, but will work internally for customers on the site. This code is used to sort products by price, etc. on the same page, but i dont want spiders to index the same pages multiple times.
How can i prevent indexing of all these pages, but still allow them to work in my website?

Thanks in advance!

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3036 posted 10:21 pm on Mar 7, 2005 (gmt 0)

You can use mod_rewrite with several RewriteConds testing both {QUERY_STRING} and {HTTP_USER_AGENT} to accomplish what you describe. See our forum charter [webmasterworld.com] for links to some basic resources.

Jim

jmdb71

10+ Year Member



 
Msg#: 3036 posted 10:36 pm on Mar 7, 2005 (gmt 0)

Did some research. Came up with this, but cant figure out how to end it, so spiders will know they should not index the page, and if they have to drop it...

# Redirect search engine spider requests which include a query string to same URL with blank query string
RewriteCond %{HTTP_USER_AGENT} ^FAST(-(Real)?WebCrawler/¦\ FirstPage\ retriever) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot(-Image)?/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*(Ask\ Jeeves¦Slurp/¦ZealBot¦Zyborg/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Overture-WebCrawler/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^(Scooter/¦Scrubby/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma
RewriteCond %{QUERY_STRING} .
ReWriteRule .*\?(sort=2d&page=1¦sort=2a&page=1¦etc.)$

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3036 posted 11:24 pm on Mar 7, 2005 (gmt 0)

What do you want to do with search engines that request these resources? Among other things, you can:
  • Return a 403-forbidden response.
  • Redirect them to another page.
  • Feed them alternate content, such as an html page with a robots "noindex" meta-tag on it.
  • Feed them a password-required page.

    It all depends on what you want to do in this case.

    Jim

  • jmdb71

    10+ Year Member



     
    Msg#: 3036 posted 2:18 pm on Mar 8, 2005 (gmt 0)

    The engines have already indexed some of these pages with these extensions. I want the engines to just not index these pages, and if they have, to remove the page from their indexes.
    Whats the best code for this, that would not hurt my ranking?

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3036 posted 5:12 pm on Mar 8, 2005 (gmt 0)

    I guess I'd go with the rewrite-to-noindex-page method in this case.

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule .* /noindex.html [L]

    You can improve the efficiency of this code if all of the pages are of the same type -- for example, php -- by restricting the RewriteRule to act on those page types only:

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule [b]\.php$[/b] /noindex.html [L]

    Create a file called "noindex.html" in your Web root directory, and place this code in it:

    <html>
    <head><meta name="robots" content="noindex,nofollow"></head>
    <body></body>
    </html>

    You must allow access to this file in robots.txt.

    Change all broken pipe "¦" characters to solid pipe characters before use.

    Jim

    jmdb71

    10+ Year Member



     
    Msg#: 3036 posted 6:17 pm on Mar 8, 2005 (gmt 0)

    I really appreciate the help.

    One final question - is there a way I can do this without keeping an updated spiders list in the file. Basically, where the RewriteCond %{HTTP_USER_AGENT} line of code will indentify any spider request/non-browser request.

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 3036 posted 8:32 pm on Mar 8, 2005 (gmt 0)

    No, not really. I'd suggest you keep the user-agent strings simple, leaving out version numbers, etc. and only try to cover the engines that really matter to you.

    Jim

    jk3210

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 3036 posted 12:43 pm on Mar 9, 2005 (gmt 0)

    Jim-

    To send all Google queries for sub domain "foo.foo.com" to noindex.html, is this correct, or must the "." in the domain name be escaped?
    ...
    RewriteCond %{HTTP_USER_AGENT} ^Google
    RewriteCond %{QUERY_STRING} (foo.foo.com)
    RewriteRule .* /noindex.html [L]

    *foo will be replaced by actual sub domain name

    jmdb71

    10+ Year Member



     
    Msg#: 3036 posted 11:50 pm on Mar 9, 2005 (gmt 0)

    ok
    so to put everything together, i want the major spiders from the major engines - google,yahoo,aol,msn,jeeves,lycos,excite -
    to not index any url containing the? character.

    So would this work:
    RewriteCond %{HTTP_USER_AGENT} ^(spider1¦spider2¦etc.)
    RewriteCond %{QUERY_STRING} (?)
    RewriteRule .* /noindex.html [L]

    Also, some of the pages have already been indexed. Would this also tell google to remove the pages from their index if they already exist? Finally, could someone point me to a basic spiders list to use in the first line of code above..?

    Thanks in advance - i really appreciate this forum!

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Code, Content, and Presentation / Apache Web Server
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved