Simply enter pages
There are 5000 of them.
You probably want
Also this question should be in the forum Sitemaps, Meta Data, and robots.txt[webmasterworld.com ]
Instead of adding them to robots.txt, you could add a meta robots tag in each page or an X-Robots-Tag: noindex header line dynamically with the script you use to generate these pages.
robots text works on the url and matches left-to-right.
if you want to exclude obedient crawlers, Dijkgraaf is correct but i would prefer to be more specific if necessary:
however excluding robots does not equal preventing indexing, so if you don't want your urls indexed you should heed the advice lammert provided.
i would also suggest you check out the "interview with Matt Cutts by Eric Enge" [webmasterworld.com] and discussion on this subject.
If you do use the robots.txt method, do be aware that
Disallow: /?page_id= does NOT disallow URL requests like
example.com/index.php?page_id= so you'll need another rule like
Disallow: /index.php?page_id= for that.
You might be tempted to use
Disallow: /*?page_id= but not all robots understand that terminology. If you do use it, you need to place it in a specific Googlebot section, and then you have to also duplicate all of the rules from the User-agent: * section into the User-agent: Googlebot section of the file. This is because Google reads only the most specific section of the robots.txt file that applies to it.
What happened was this, I'm using wordpress on a huge site (lots of pages). Wordpress's pretty url function fails completely at about a 1000 pages. So we rewrote our own mapping function for all the pages.
Unfortunately, I left a canonical URL setting on, so in every page was a link to the underlying wordpress pages - the ?page_id= page. So now all my pages got indexed twice. Messy.
I should probably htaccess deny the pages instead or something - I don't want Google touching these pages. Would that be a better solution? serve a 403?
By blocking these pages you may waste page rank which is assigned by links to these URLs but never propagates further. You could instead use the rel=canonical tag to tell the search engines that the /page_id? versions are just a copy of the pages with the pretty URLs. In that case the duplicates are combined in the major search engines and only the pretty version should show up in the SERPs.
An excellent plugin exists called "redirection" that will allow you to redirect pages in any way you wish, one at a time or entire batches at a time, and it takes care of your .htaccess and/or robots.txt files too.
I've used this myself with excellent results on a 1500 page site to get rid of a yahoo forced "index.php" url that the new host refused to accept.
It is EXTREMELY flexible and will also provide statistics and 404 error monitoring and full logs so you can make sure it is working as intended.
It's even multi language friendly, I found it to be as close to godly as you will ever find for this problem. (Author's name is John Godley, pardon the pun)
the correct, preferred and ideal solution is a 301.
it is air-tight and unambiguous - no hints, if, ands or buts.
the fallback solution is rel=canonical, which i would only use if the 301 is technically impossible or cost-prohibitive for you.
Gah! Yet another hack. I have a hack already installed that seamlessly displays the page when the pretty url is entered. Now looks like I need to do another hack to look up the pretty url given a page_id, then 301 to the pretty url (which goes to the first hack that actually displays the wordpress page).
Not happy with wordpress over this. Drupal just works.