Welcome to WebmasterWorld Guest from 54.196.214.35

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

htaccess help - stopping spiders, allowing others

need help with htaccess

     
7:28 pm on Mar 7, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 15, 2004
posts:42
votes: 0


Hi,

I want to set up a command in my htaccess, where all pages ending in the following extensions:
?sort=2d&page=1
?sort=2a&page=1
?sort=3a&page=1
?sort=3d&page=1
are not indexed by spiders, but will work internally for customers on the site. This code is used to sort products by price, etc. on the same page, but i dont want spiders to index the same pages multiple times.
How can i prevent indexing of all these pages, but still allow them to work in my website?

Thanks in advance!

10:21 pm on Mar 7, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


You can use mod_rewrite with several RewriteConds testing both {QUERY_STRING} and {HTTP_USER_AGENT} to accomplish what you describe. See our forum charter [webmasterworld.com] for links to some basic resources.

Jim

10:36 pm on Mar 7, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 15, 2004
posts:42
votes: 0


Did some research. Came up with this, but cant figure out how to end it, so spiders will know they should not index the page, and if they have to drop it...

# Redirect search engine spider requests which include a query string to same URL with blank query string
RewriteCond %{HTTP_USER_AGENT} ^FAST(-(Real)?WebCrawler/¦\ FirstPage\ retriever) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Gigabot/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot(-Image)?/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mediapartners-Google/[0-9]\.[0-9]{1,2} [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*(Ask\ Jeeves¦Slurp/¦ZealBot¦Zyborg/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^msnbot/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Overture-WebCrawler/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Robozilla/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^(Scooter/¦Scrubby/) [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teoma
RewriteCond %{QUERY_STRING} .
ReWriteRule .*\?(sort=2d&page=1¦sort=2a&page=1¦etc.)$

11:24 pm on Mar 7, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


What do you want to do with search engines that request these resources? Among other things, you can:
  • Return a 403-forbidden response.
  • Redirect them to another page.
  • Feed them alternate content, such as an html page with a robots "noindex" meta-tag on it.
  • Feed them a password-required page.

    It all depends on what you want to do in this case.

    Jim

  • 2:18 pm on Mar 8, 2005 (gmt 0)

    Junior Member

    10+ Year Member

    joined:Apr 15, 2004
    posts:42
    votes: 0


    The engines have already indexed some of these pages with these extensions. I want the engines to just not index these pages, and if they have, to remove the page from their indexes.
    Whats the best code for this, that would not hurt my ranking?
    5:12 pm on Mar 8, 2005 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Mar 31, 2002
    posts:25430
    votes: 0


    I guess I'd go with the rewrite-to-noindex-page method in this case.

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule .* /noindex.html [L]

    You can improve the efficiency of this code if all of the pages are of the same type -- for example, php -- by restricting the RewriteRule to act on those page types only:

    ...
    RewriteCond %{HTTP_USER_AGENT} ^Teoma
    RewriteCond %{QUERY_STRING} (sort=2d&page=1¦sort=2a&page=1¦etc.)
    RewriteRule [b]\.php$[/b] /noindex.html [L]

    Create a file called "noindex.html" in your Web root directory, and place this code in it:

    <html>
    <head><meta name="robots" content="noindex,nofollow"></head>
    <body></body>
    </html>

    You must allow access to this file in robots.txt.

    Change all broken pipe "¦" characters to solid pipe characters before use.

    Jim

    6:17 pm on Mar 8, 2005 (gmt 0)

    Junior Member

    10+ Year Member

    joined:Apr 15, 2004
    posts:42
    votes: 0


    I really appreciate the help.

    One final question - is there a way I can do this without keeping an updated spiders list in the file. Basically, where the RewriteCond %{HTTP_USER_AGENT} line of code will indentify any spider request/non-browser request.

    8:32 pm on Mar 8, 2005 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Mar 31, 2002
    posts:25430
    votes: 0


    No, not really. I'd suggest you keep the user-agent strings simple, leaving out version numbers, etc. and only try to cover the engines that really matter to you.

    Jim

    12:43 pm on Mar 9, 2005 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Mar 12, 2001
    posts:1150
    votes: 0


    Jim-

    To send all Google queries for sub domain "foo.foo.com" to noindex.html, is this correct, or must the "." in the domain name be escaped?
    ...
    RewriteCond %{HTTP_USER_AGENT} ^Google
    RewriteCond %{QUERY_STRING} (foo.foo.com)
    RewriteRule .* /noindex.html [L]

    *foo will be replaced by actual sub domain name

    11:50 pm on Mar 9, 2005 (gmt 0)

    Junior Member

    10+ Year Member

    joined:Apr 15, 2004
    posts:42
    votes: 0


    ok
    so to put everything together, i want the major spiders from the major engines - google,yahoo,aol,msn,jeeves,lycos,excite -
    to not index any url containing the? character.

    So would this work:
    RewriteCond %{HTTP_USER_AGENT} ^(spider1¦spider2¦etc.)
    RewriteCond %{QUERY_STRING} (?)
    RewriteRule .* /noindex.html [L]

    Also, some of the pages have already been indexed. Would this also tell google to remove the pages from their index if they already exist? Finally, could someone point me to a basic spiders list to use in the first line of code above..?

    Thanks in advance - i really appreciate this forum!