Welcome to WebmasterWorld Guest from 54.205.75.60

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

htaccess help - stopping spiders, allowing others

need help with htaccess

   
7:28 pm on Mar 7, 2005 (gmt 0)

10+ Year Member



Hi,

I want to set up a command in my htaccess, where all pages ending in the following extensions:
?sort=2d&page=1
?sort=2a&page=1
?sort=3a&page=1
?sort=3d&page=1
are not indexed by spiders, but will work internally for customers on the site. This code is used to sort products by price, etc. on the same page, but i dont want spiders to index the same pages multiple times.
How can i prevent indexing of all these pages, but still allow them to work in my website?

Thanks in advance!

10:21 pm on Mar 7, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You can use mod_rewrite with several RewriteConds testing both {QUERY_STRING} and {HTTP_USER_AGENT} to accomplish what you describe. See our forum charter [webmasterworld.com] for links to some basic resources.

Jim

10:36 pm on Mar 7, 2005 (gmt 0)

10+ Year Member



Did some research. Came up with this, but cant figure out how to end it, so spiders will know they should not index the page, and if they have to drop it...

# Redirect search engine spider requests which include a query string to same URL with blank query string
RewriteCond %{HTTP_USER_AGENT} ^FAST(-(Real)?WebCrawler/¦\ FirstPage\ retriever) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot(-Image)?/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*(Ask\ Jeeves¦Slurp/¦ZealBot¦Zyborg/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Overture-WebCrawler/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^(Scooter/¦Scrubby/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma
RewriteCond %{QUERY_STRING} .
ReWriteRule .*\?(sort=2d&page=1¦sort=2a&page=1¦etc.)$

11:24 pm on Mar 7, 2005 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



What do you want to do with search engines that request these resources? Among other things, you can:
  • Return a 403-forbidden response.
  • Redirect them to another page.
  • Feed them alternate content, such as an html page with a robots "noindex" meta-tag on it.
  • Feed them a password-required page.

    It all depends on what you want to do in this case.

    Jim

  • 2:18 pm on Mar 8, 2005 (gmt 0)

    10+ Year Member



    The engines have already indexed some of these pages with these extensions. I want the engines to just not index these pages, and if they have, to remove the page from their indexes.
    Whats the best code for this, that would not hurt my ranking?
    5:12 pm on Mar 8, 2005 (gmt 0)

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    I guess I'd go with the rewrite-to-noindex-page method in this case.

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule .* /noindex.html [L]

    You can improve the efficiency of this code if all of the pages are of the same type -- for example, php -- by restricting the RewriteRule to act on those page types only:

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule [b]\.php$[/b] /noindex.html [L]

    Create a file called "noindex.html" in your Web root directory, and place this code in it:

    <html>
    <head><meta name="robots" content="noindex,nofollow"></head>
    <body></body>
    </html>

    You must allow access to this file in robots.txt.

    Change all broken pipe "¦" characters to solid pipe characters before use.

    Jim

    6:17 pm on Mar 8, 2005 (gmt 0)

    10+ Year Member



    I really appreciate the help.

    One final question - is there a way I can do this without keeping an updated spiders list in the file. Basically, where the RewriteCond %{HTTP_USER_AGENT} line of code will indentify any spider request/non-browser request.

    8:32 pm on Mar 8, 2005 (gmt 0)

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    No, not really. I'd suggest you keep the user-agent strings simple, leaving out version numbers, etc. and only try to cover the engines that really matter to you.

    Jim

    12:43 pm on Mar 9, 2005 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Jim-

    To send all Google queries for sub domain "foo.foo.com" to noindex.html, is this correct, or must the "." in the domain name be escaped?
    ...
    RewriteCond %{HTTP_USER_AGENT} ^Google
    RewriteCond %{QUERY_STRING} (foo.foo.com)
    RewriteRule .* /noindex.html [L]

    *foo will be replaced by actual sub domain name

    11:50 pm on Mar 9, 2005 (gmt 0)

    10+ Year Member



    ok
    so to put everything together, i want the major spiders from the major engines - google,yahoo,aol,msn,jeeves,lycos,excite -
    to not index any url containing the? character.

    So would this work:
    RewriteCond %{HTTP_USER_AGENT} ^(spider1¦spider2¦etc.)
    RewriteCond %{QUERY_STRING} (?)
    RewriteRule .* /noindex.html [L]

    Also, some of the pages have already been indexed. Would this also tell google to remove the pages from their index if they already exist? Finally, could someone point me to a basic spiders list to use in the first line of code above..?

    Thanks in advance - i really appreciate this forum!