homepage Welcome to WebmasterWorld Guest from 204.236.254.124
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Help: robots disallow pages with various query-strings
bramley




msg:4289725
 7:38 pm on Mar 30, 2011 (gmt 0)

Hi,

I need help to block (noindex) a page (index.php) from search engines. Because this is a CMS system, the pages have URLs like this :

/index.php?option=com_content&view=article&layout=form&Itemid=29

I would like to block ALL these pages, no matter what the query string is (?....)

How to do this in robots.txt ?

I have tried both index.php and index.php* but I still see such URLs in the search index (Google), even after using the Webmaster Tools URL removal, which seemed to accept them.

I have URLs not based on index.php that I wish to be the only ones indexed by the search engines.

Secondly, I have many URLs of the form /getpage?page=67 /getpage?page=109 etc but have changed these to append some extra information and wish to remove the URLs of this type that don't have the appened info.

For example :
/getpage?page=109 exclude this form
/getpage?page=109:video keep (index) this form

Thanks!

 

bramley




msg:4289748
 8:30 pm on Mar 30, 2011 (gmt 0)

Seems the answer to my first question might be :

/index*.php

which matches :
/index.php
/indexblah/index.php?anyparameters

accoeding to : [code.google.com...]

I assume it will match : /index.php?anyparameters

I'm still looking for an answer to the second question. Is it a case of disallowing ?a=
then allowing ?a=blah:text ?

bramley




msg:4289876
 1:42 am on Mar 31, 2011 (gmt 0)

I guess I need to make it clearer :

exclude this form :

/getpage?page=109
/getpage?page=77
/getpage?page=811
etc

keep (allow / index) this form :

/getpage?page=109:video
/getpage?page=32:photo
/getpage?page=816:article
etc

How do do this in robots.txt ?

I've updated the url creation to be more descriptive but unless can remove the old format urls I will face duplicate content / title issues. (There are too many to block one by one)

bramley




msg:4290212
 4:04 pm on Mar 31, 2011 (gmt 0)

Update:

My attempts to remove pages of this form :

/index.php?option=com_content&view=article&id=1061

is not working.

I have:

Disallow: /index*.php
Disallow: /index.php?*

and used the remove URL in Webmaster tools but these pages still show in the Google index. What am I doing wrong ?

pageoneresults




msg:4290216
 4:11 pm on Mar 31, 2011 (gmt 0)

Disallow: /index.php?*


I believe the trailing wildcard in this instance is ignored.

I'm not certain but I think your use of wildcards is incorrect.

I would like to block ALL these pages, no matter what the query string is (?....)


/*?

Block or remove pages using a robots.txt file
[Google.com...]

bramley




msg:4290298
 7:01 pm on Mar 31, 2011 (gmt 0)

I dont want to block every page, just those with index.php?something

Ideally keep index.php with no query string so that www.mydomain.com still appears in the index. Or will it anyway? Or is same as www.mydomain.com/index.php ?

g1smd




msg:4290344
 8:24 pm on Mar 31, 2011 (gmt 0)

Pattern matching in robots.txt is prefix matching "from the left".

Where a wildcard is used it is only needed "on the left" or "in the middle".

bramley




msg:4290408
 10:35 pm on Mar 31, 2011 (gmt 0)

I am meaning the '?' to be part of the URL (query string).

Where can I read more on pattern matching for robots.txt ?

The robotstxt.org site says nothing about it.

g1smd




msg:4290434
 11:27 pm on Mar 31, 2011 (gmt 0)

The pattern matching is simple.

Disallow: /
matches: /<anything>

Disallow: /this
matches: /this<anything>

Disallow: /*that
matches: /<anything>that<anything>

Disallow: /this*that
matches: /this<anything>that<anything>

bramley




msg:4290460
 12:24 am on Apr 1, 2011 (gmt 0)

Thanks g1smd.

So to allow index.php but disallow index.php?<anything>

use :

Disallow /index.php*=

? (used = r.t. ? in case ? has a special meaning)

Will this do what I want ?

to allow articles (index.php?option=article&...) but not forum stuff (index.php?option=forum&...)

I could use :

Disallow /index.php*=forum

?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved