robots.txt question

Forum Moderators: phranque

Message Too Old, No Replies

robots.txt question

wheel

2:51 pm on Mar 17, 2010 (gmt 0)

How do I disallow all pages like
example.com/?page_id={page number}

Can I just
disallow: ?page

Advis

4:50 pm on Mar 17, 2010 (gmt 0)

Simply enter pages

wheel

6:18 pm on Mar 17, 2010 (gmt 0)

There are 5000 of them.

Dijkgraaf

8:05 pm on Mar 17, 2010 (gmt 0)

You probably want
disallow: /?page

Also this question should be in the forum Sitemaps, Meta Data, and robots.txt[webmasterworld.com ]

lammert

8:05 pm on Mar 17, 2010 (gmt 0)

Instead of adding them to robots.txt, you could add a meta robots tag in each page or an X-Robots-Tag: noindex header line dynamically with the script you use to generate these pages.

phranque

8:38 pm on Mar 17, 2010 (gmt 0)

robots text works on the url and matches left-to-right.

if you want to exclude obedient crawlers, Dijkgraaf is correct but i would prefer to be more specific if necessary:

Disallow: /?page_id=

however excluding robots does not equal preventing indexing, so if you don't want your urls indexed you should heed the advice lammert provided.

i would also suggest you check out the "interview with Matt Cutts by Eric Enge" [webmasterworld.com] and discussion on this subject.

g1smd

9:43 pm on Mar 17, 2010 (gmt 0)

If you do use the robots.txt method, do be aware that

Disallow: /?page_id=

does NOT disallow URL requests like

example.com/index.php?page_id=

so you'll need another rule like

Disallow: /index.php?page_id=

for that.

You might be tempted to use

Disallow: /*?page_id=

but not all robots understand that terminology. If you do use it, you need to place it in a specific Googlebot section, and then you have to also duplicate all of the rules from the User-agent: * section into the User-agent: Googlebot section of the file. This is because Google reads only the most specific section of the robots.txt file that applies to it.

wheel

11:59 pm on Mar 17, 2010 (gmt 0)

What happened was this, I'm using wordpress on a huge site (lots of pages). Wordpress's pretty url function fails completely at about a 1000 pages. So we rewrote our own mapping function for all the pages.

Unfortunately, I left a canonical URL setting on, so in every page was a link to the underlying wordpress pages - the ?page_id= page. So now all my pages got indexed twice. Messy.

I should probably htaccess deny the pages instead or something - I don't want Google touching these pages. Would that be a better solution? serve a 403?

lammert

12:42 am on Mar 18, 2010 (gmt 0)

By blocking these pages you may waste page rank which is assigned by links to these URLs but never propagates further. You could instead use the rel=canonical tag to tell the search engines that the /page_id? versions are just a copy of the pages with the pretty URLs. In that case the duplicates are combined in the major search engines and only the pretty version should show up in the SERPs.

JS_Harris

3:51 am on Mar 18, 2010 (gmt 0)

An excellent plugin exists called "redirection" that will allow you to redirect pages in any way you wish, one at a time or entire batches at a time, and it takes care of your .htaccess and/or robots.txt files too.

I've used this myself with excellent results on a 1500 page site to get rid of a yahoo forced "index.php" url that the new host refused to accept.

It is EXTREMELY flexible and will also provide statistics and 404 error monitoring and full logs so you can make sure it is working as intended.

It's even multi language friendly, I found it to be as close to godly as you will ever find for this problem. (Author's name is John Godley, pardon the pun)

phranque

9:36 am on Mar 18, 2010 (gmt 0)

the correct, preferred and ideal solution is a 301.
it is air-tight and unambiguous - no hints, if, ands or buts.
the fallback solution is rel=canonical, which i would only use if the 301 is technically impossible or cost-prohibitive for you.

wheel

11:57 am on Mar 18, 2010 (gmt 0)

Gah! Yet another hack. I have a hack already installed that seamlessly displays the page when the pretty url is entered. Now looks like I need to do another hack to look up the pretty url given a page_id, then 301 to the pretty url (which goes to the first hack that actually displays the wordpress page).

Not happy with wordpress over this. Drupal just works.