Welcome to WebmasterWorld Guest from 54.226.32.234

Forum Moderators: goodroi

Message Too Old, No Replies

Help: robots disallow pages with various query-strings

     

bramley

7:38 pm on Mar 30, 2011 (gmt 0)



Hi,

I need help to block (noindex) a page (index.php) from search engines. Because this is a CMS system, the pages have URLs like this :

/index.php?option=com_content&view=article&layout=form&Itemid=29

I would like to block ALL these pages, no matter what the query string is (?....)

How to do this in robots.txt ?

I have tried both index.php and index.php* but I still see such URLs in the search index (Google), even after using the Webmaster Tools URL removal, which seemed to accept them.

I have URLs not based on index.php that I wish to be the only ones indexed by the search engines.

Secondly, I have many URLs of the form /getpage?page=67 /getpage?page=109 etc but have changed these to append some extra information and wish to remove the URLs of this type that don't have the appened info.

For example :
/getpage?page=109 exclude this form
/getpage?page=109:video keep (index) this form

Thanks!

bramley

8:30 pm on Mar 30, 2011 (gmt 0)



Seems the answer to my first question might be :

/index*.php

which matches :
/index.php
/indexblah/index.php?anyparameters

accoeding to : [code.google.com...]

I assume it will match : /index.php?anyparameters

I'm still looking for an answer to the second question. Is it a case of disallowing ?a=
then allowing ?a=blah:text ?

bramley

1:42 am on Mar 31, 2011 (gmt 0)



I guess I need to make it clearer :

exclude this form :

/getpage?page=109
/getpage?page=77
/getpage?page=811
etc

keep (allow / index) this form :

/getpage?page=109:video
/getpage?page=32:photo
/getpage?page=816:article
etc

How do do this in robots.txt ?

I've updated the url creation to be more descriptive but unless can remove the old format urls I will face duplicate content / title issues. (There are too many to block one by one)

bramley

4:04 pm on Mar 31, 2011 (gmt 0)



Update:

My attempts to remove pages of this form :

/index.php?option=com_content&view=article&id=1061

is not working.

I have:

Disallow: /index*.php
Disallow: /index.php?*

and used the remove URL in Webmaster tools but these pages still show in the Google index. What am I doing wrong ?

pageoneresults

4:11 pm on Mar 31, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Disallow: /index.php?*


I believe the trailing wildcard in this instance is ignored.

I'm not certain but I think your use of wildcards is incorrect.

I would like to block ALL these pages, no matter what the query string is (?....)


/*?

Block or remove pages using a robots.txt file
[Google.com...]

bramley

7:01 pm on Mar 31, 2011 (gmt 0)



I dont want to block every page, just those with index.php?something

Ideally keep index.php with no query string so that www.mydomain.com still appears in the index. Or will it anyway? Or is same as www.mydomain.com/index.php ?

g1smd

8:24 pm on Mar 31, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Pattern matching in robots.txt is prefix matching "from the left".

Where a wildcard is used it is only needed "on the left" or "in the middle".

bramley

10:35 pm on Mar 31, 2011 (gmt 0)



I am meaning the '?' to be part of the URL (query string).

Where can I read more on pattern matching for robots.txt ?

The robotstxt.org site says nothing about it.

g1smd

11:27 pm on Mar 31, 2011 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The pattern matching is simple.

Disallow: /

matches: /<anything>

Disallow: /this

matches: /this<anything>

Disallow: /*that

matches: /<anything>that<anything>

Disallow: /this*that

matches: /this<anything>that<anything>

bramley

12:24 am on Apr 1, 2011 (gmt 0)



Thanks g1smd.

So to allow index.php but disallow index.php?<anything>

use :

Disallow /index.php*=

? (used = r.t. ? in case ? has a special meaning)

Will this do what I want ?

to allow articles (index.php?option=article&...) but not forum stuff (index.php?option=forum&...)

I could use :

Disallow /index.php*=forum

?
 

Featured Threads

Hot Threads This Week

Hot Threads This Month