homepage Welcome to WebmasterWorld Guest from 54.167.144.4
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Wildcards on robots.txt disallows
MindTwist




msg:3503532
 9:36 am on Nov 13, 2007 (gmt 0)

I run an oscommerce store, and I am trying to disallow google from indexing the product review pages. The reason is that it seems that some of the review pages for products appear on google on top of the actual product page, so a potential customer gets taken to an empty reviews page (yeah, there are no reviews on the products, but the reviews page is found on top of the products page, doh)

So I just wanted to disallow google from indexing my review pages, but I am using SEO URL, so I don't really know how to do it. My real reviews URL would be:

http://www.example.com/product_reviews.php?products_id=72

but with SEO URL, it is:

http://www.example.com/vga-splitter-duplicator-puertos-pc-monitores-pr-72.html

so I do not know how I could add that to my robots.txt. What all the reviews have in common at the end is the *-pr-number.html, their rewriterule is the following:

RewriteRule ^(.*)-pr-([0-9]+).html$ product_reviews.php?products_id=$2&%{QUERY_STRING}

So I was wondering what would be the right way of disallowing all this review files, since I do not think that placing a "Disallow: /product_reviews.php" will do the trick.

Disallow: /*-pr-*

Would this work ok? I am not sure I am using the right sintax, and I do know that this will also disallow any other URLs that have -pr- on them (but I can live with that, I doubt I am going to use the word "pr" a lot around hehe)

Many thanks! :)

[edited by: jatar_k at 6:34 pm (utc) on Nov. 13, 2007]
[edit reason] please use example.com [/edit]

 

phranque




msg:3503562
 10:35 am on Nov 13, 2007 (gmt 0)

welcome to WebmasterWorld [webmasterworld.com], mindtwist!

you should examplify your urls in this forum (use example.com)

the correct way to exclude this would be "Disallow: /product_reviews.php" since wildcarding and globbing aren't officially supported for robot.txt.

MindTwist




msg:3503594
 11:25 am on Nov 13, 2007 (gmt 0)

Thx phranque :)

The problem is that the URLs I want to exclude are not /product_reviews.php?products_id=72 , but /vga-splitter-duplicator-puertos-pc-monitores-pr-72.html

I have a contribution installed on my store that changes the URLs to the second ones, to make them friendlier for search engines. SO I guess that if I add "Disallow: /product_reviews.php" to my robots.txt, URLs like example.com/vga-splitter-duplicator-puertos-pc-monitores-pr-72.html will still be spidered, which is what I want to avoid.

Thank you!
Aitor

jdMorgan




msg:3503637
 12:43 pm on Nov 13, 2007 (gmt 0)

It's probably too late, but could you use either of the following formats for your reviews' friendly URLs?

/pr-vga-splitter-duplicator-puertos-pc-monitores-72.html
/pr-72-vga-splitter-duplicator-puertos-pc-monitores.html

URL systems should be designed with the limitations of robots.txt URL prefix-matching in mind.

Jim

MindTwist




msg:3503923
 5:28 pm on Nov 13, 2007 (gmt 0)

Mmmh yeah, I could probably just modify a little bit the "Ultimate SEO URL" contribution so it will not SEO the product_reviews.php.

After all, if I do not want them indexed, I couldn't care less on how nice they look like to search engines... I wll try to take this route, make it so they are not SEOed, disallow product_reviews.php on robots.txt, and wait for the old -pr- URLs to vanish from google.

Thanks! :D

MindTwist




msg:3503926
 5:32 pm on Nov 13, 2007 (gmt 0)

Mmmh just saw on another thread that this could be used to disallow files that have "cat_id" as an argument.

Disallow: /*cat_id=*

Would this work for me to disallow files that have -pr- on the URL? So Google would not index my reviews pages

Disallow: /*-pr-*

Thx!

jd01




msg:3505072
 9:14 pm on Nov 14, 2007 (gmt 0)

I am not sure on the robots.txt answer, but osCommerce is written in php, so if you find the include that generates the head section, you could use the following to achieve the same effect:

<?php

if(strstr($_SERVER['REQUEST_URI'],"-pr-") === TRUE) {
echo "<meta name=\"robots\" content=\"noindex,nofollow,noarchive\" />";
}

?>

Justin

<added>
If you want some SE credit for the pages, you might consider changing the line to:
echo "<meta name=\"robots\" content=\"noindex,follow,noarchive\" />";
</added>

MindTwist




msg:3505168
 10:49 pm on Nov 14, 2007 (gmt 0)

OMG... You really made my day now.

That ought to be such a simple solution that it just couldn't get any easier. No messing with URLs, no messing with .htaccess, and no messing with almost anything. I only had to create a new variable on my meta tags module so it will add the "noindex,follow,noarchive" to my product_reviews.php and product_reviews_info.php, and will just leave it "all" everywhere else.

Thank you! :D

MindTwist




msg:3505182
 11:01 pm on Nov 14, 2007 (gmt 0)

Up and working, I checked everywhere, checked source, and indeed I have "all" everywhere except on the reviews, where I have "noindex,follow,noarchive"

Now Google should start forgeting about those indexed pages over time, shouldn'it it? :)

jd01




msg:3505954
 6:28 pm on Nov 15, 2007 (gmt 0)

You really made my day now.

Thanks.
It's nice to know when something I post actually helps someone out.

Now Google should start forgeting about those indexed pages over time, shouldn'it it?

Yes, they should be dropped from the index the next time they are spidered.

Justin

MindTwist




msg:3514592
 5:59 pm on Nov 27, 2007 (gmt 0)

Just came back to say that the solution worked great :D

I just checked with google webmaster tools, and I have a long list of URLs restricted by robots.txt (386). All the review pages seem to be coming out there, so people won't find any more review pages before the one for the product :D

Thx again! ^^

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved