homepage Welcome to WebmasterWorld Guest from 54.167.179.48
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Wildcards on robots.txt disallows
MindTwist

5+ Year Member



 
Msg#: 3503530 posted 9:36 am on Nov 13, 2007 (gmt 0)

I run an oscommerce store, and I am trying to disallow google from indexing the product review pages. The reason is that it seems that some of the review pages for products appear on google on top of the actual product page, so a potential customer gets taken to an empty reviews page (yeah, there are no reviews on the products, but the reviews page is found on top of the products page, doh)

So I just wanted to disallow google from indexing my review pages, but I am using SEO URL, so I don't really know how to do it. My real reviews URL would be:

http://www.example.com/product_reviews.php?products_id=72

but with SEO URL, it is:

http://www.example.com/vga-splitter-duplicator-puertos-pc-monitores-pr-72.html

so I do not know how I could add that to my robots.txt. What all the reviews have in common at the end is the *-pr-number.html, their rewriterule is the following:

RewriteRule ^(.*)-pr-([0-9]+).html$ product_reviews.php?products_id=$2&%{QUERY_STRING}

So I was wondering what would be the right way of disallowing all this review files, since I do not think that placing a "Disallow: /product_reviews.php" will do the trick.

Disallow: /*-pr-*

Would this work ok? I am not sure I am using the right sintax, and I do know that this will also disallow any other URLs that have -pr- on them (but I can live with that, I doubt I am going to use the word "pr" a lot around hehe)

Many thanks! :)

[edited by: jatar_k at 6:34 pm (utc) on Nov. 13, 2007]
[edit reason] please use example.com [/edit]

 

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3503530 posted 10:35 am on Nov 13, 2007 (gmt 0)

welcome to WebmasterWorld [webmasterworld.com], mindtwist!

you should examplify your urls in this forum (use example.com)

the correct way to exclude this would be "Disallow: /product_reviews.php" since wildcarding and globbing aren't officially supported for robot.txt.

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 11:25 am on Nov 13, 2007 (gmt 0)

Thx phranque :)

The problem is that the URLs I want to exclude are not /product_reviews.php?products_id=72 , but /vga-splitter-duplicator-puertos-pc-monitores-pr-72.html

I have a contribution installed on my store that changes the URLs to the second ones, to make them friendlier for search engines. SO I guess that if I add "Disallow: /product_reviews.php" to my robots.txt, URLs like example.com/vga-splitter-duplicator-puertos-pc-monitores-pr-72.html will still be spidered, which is what I want to avoid.

Thank you!
Aitor

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3503530 posted 12:43 pm on Nov 13, 2007 (gmt 0)

It's probably too late, but could you use either of the following formats for your reviews' friendly URLs?

/pr-vga-splitter-duplicator-puertos-pc-monitores-72.html
/pr-72-vga-splitter-duplicator-puertos-pc-monitores.html

URL systems should be designed with the limitations of robots.txt URL prefix-matching in mind.

Jim

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 5:28 pm on Nov 13, 2007 (gmt 0)

Mmmh yeah, I could probably just modify a little bit the "Ultimate SEO URL" contribution so it will not SEO the product_reviews.php.

After all, if I do not want them indexed, I couldn't care less on how nice they look like to search engines... I wll try to take this route, make it so they are not SEOed, disallow product_reviews.php on robots.txt, and wait for the old -pr- URLs to vanish from google.

Thanks! :D

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 5:32 pm on Nov 13, 2007 (gmt 0)

Mmmh just saw on another thread that this could be used to disallow files that have "cat_id" as an argument.

Disallow: /*cat_id=*

Would this work for me to disallow files that have -pr- on the URL? So Google would not index my reviews pages

Disallow: /*-pr-*

Thx!

jd01

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3503530 posted 9:14 pm on Nov 14, 2007 (gmt 0)

I am not sure on the robots.txt answer, but osCommerce is written in php, so if you find the include that generates the head section, you could use the following to achieve the same effect:

<?php

if(strstr($_SERVER['REQUEST_URI'],"-pr-") === TRUE) {
echo "<meta name=\"robots\" content=\"noindex,nofollow,noarchive\" />";
}

?>

Justin

<added>
If you want some SE credit for the pages, you might consider changing the line to:
echo "<meta name=\"robots\" content=\"noindex,follow,noarchive\" />";
</added>

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 10:49 pm on Nov 14, 2007 (gmt 0)

OMG... You really made my day now.

That ought to be such a simple solution that it just couldn't get any easier. No messing with URLs, no messing with .htaccess, and no messing with almost anything. I only had to create a new variable on my meta tags module so it will add the "noindex,follow,noarchive" to my product_reviews.php and product_reviews_info.php, and will just leave it "all" everywhere else.

Thank you! :D

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 11:01 pm on Nov 14, 2007 (gmt 0)

Up and working, I checked everywhere, checked source, and indeed I have "all" everywhere except on the reviews, where I have "noindex,follow,noarchive"

Now Google should start forgeting about those indexed pages over time, shouldn'it it? :)

jd01

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3503530 posted 6:28 pm on Nov 15, 2007 (gmt 0)

You really made my day now.

Thanks.
It's nice to know when something I post actually helps someone out.

Now Google should start forgeting about those indexed pages over time, shouldn'it it?

Yes, they should be dropped from the index the next time they are spidered.

Justin

MindTwist

5+ Year Member



 
Msg#: 3503530 posted 5:59 pm on Nov 27, 2007 (gmt 0)

Just came back to say that the solution worked great :D

I just checked with google webmaster tools, and I have a long list of URLs restricted by robots.txt (386). All the review pages seem to be coming out there, so people won't find any more review pages before the one for the product :D

Thx again! ^^

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved