Welcome to WebmasterWorld Guest from 54.147.220.66

Forum Moderators: phranque

Message Too Old, No Replies

robots.txt question

   
2:51 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member wheel is a WebmasterWorld Top Contributor of All Time 10+ Year Member



How do I disallow all pages like
example.com/?page_id={page number}

Can I just
disallow: ?page
4:50 pm on Mar 17, 2010 (gmt 0)

5+ Year Member



Simply enter pages
6:18 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member wheel is a WebmasterWorld Top Contributor of All Time 10+ Year Member



There are 5000 of them.
8:05 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You probably want
disallow: /?page

Also this question should be in the forum Sitemaps, Meta Data, and robots.txt[webmasterworld.com ]
8:05 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member lammert is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Instead of adding them to robots.txt, you could add a meta robots tag in each page or an X-Robots-Tag: noindex header line dynamically with the script you use to generate these pages.
8:38 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



robots text works on the url and matches left-to-right.

if you want to exclude obedient crawlers, Dijkgraaf is correct but i would prefer to be more specific if necessary:
Disallow: /?page_id=


however excluding robots does not equal preventing indexing, so if you don't want your urls indexed you should heed the advice lammert provided.

i would also suggest you check out the "interview with Matt Cutts by Eric Enge" [webmasterworld.com] and discussion on this subject.
9:43 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If you do use the robots.txt method, do be aware that
Disallow: /?page_id=
does NOT disallow URL requests like
example.com/index.php?page_id=
so you'll need another rule like
Disallow: /index.php?page_id=
for that.

You might be tempted to use
Disallow: /*?page_id=
but not all robots understand that terminology. If you do use it, you need to place it in a specific Googlebot section, and then you have to also duplicate all of the rules from the User-agent: * section into the User-agent: Googlebot section of the file. This is because Google reads only the most specific section of the robots.txt file that applies to it.
11:59 pm on Mar 17, 2010 (gmt 0)

WebmasterWorld Senior Member wheel is a WebmasterWorld Top Contributor of All Time 10+ Year Member



What happened was this, I'm using wordpress on a huge site (lots of pages). Wordpress's pretty url function fails completely at about a 1000 pages. So we rewrote our own mapping function for all the pages.

Unfortunately, I left a canonical URL setting on, so in every page was a link to the underlying wordpress pages - the ?page_id= page. So now all my pages got indexed twice. Messy.

I should probably htaccess deny the pages instead or something - I don't want Google touching these pages. Would that be a better solution? serve a 403?
12:42 am on Mar 18, 2010 (gmt 0)

WebmasterWorld Senior Member lammert is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



By blocking these pages you may waste page rank which is assigned by links to these URLs but never propagates further. You could instead use the rel=canonical tag to tell the search engines that the /page_id? versions are just a copy of the pages with the pretty URLs. In that case the duplicates are combined in the major search engines and only the pretty version should show up in the SERPs.
3:51 am on Mar 18, 2010 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



An excellent plugin exists called "redirection" that will allow you to redirect pages in any way you wish, one at a time or entire batches at a time, and it takes care of your .htaccess and/or robots.txt files too.

I've used this myself with excellent results on a 1500 page site to get rid of a yahoo forced "index.php" url that the new host refused to accept.

It is EXTREMELY flexible and will also provide statistics and 404 error monitoring and full logs so you can make sure it is working as intended.

It's even multi language friendly, I found it to be as close to godly as you will ever find for this problem. (Author's name is John Godley, pardon the pun)
9:36 am on Mar 18, 2010 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



the correct, preferred and ideal solution is a 301.
it is air-tight and unambiguous - no hints, if, ands or buts.
the fallback solution is rel=canonical, which i would only use if the 301 is technically impossible or cost-prohibitive for you.
11:57 am on Mar 18, 2010 (gmt 0)

WebmasterWorld Senior Member wheel is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Gah! Yet another hack. I have a hack already installed that seamlessly displays the page when the pretty url is entered. Now looks like I need to do another hack to look up the pretty url given a page_id, then 301 to the pretty url (which goes to the first hack that actually displays the wordpress page).

Not happy with wordpress over this. Drupal just works.