homepage Welcome to WebmasterWorld Guest from 54.227.215.139
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
robots.txt question
wheel

WebmasterWorld Senior Member wheel us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4099545 posted 2:51 pm on Mar 17, 2010 (gmt 0)

How do I disallow all pages like
example.com/?page_id={page number}

Can I just
disallow: ?page

 

Advis



 
Msg#: 4099545 posted 4:50 pm on Mar 17, 2010 (gmt 0)

Simply enter pages

wheel

WebmasterWorld Senior Member wheel us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4099545 posted 6:18 pm on Mar 17, 2010 (gmt 0)

There are 5000 of them.

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4099545 posted 8:05 pm on Mar 17, 2010 (gmt 0)

You probably want
disallow: /?page

Also this question should be in the forum Sitemaps, Meta Data, and robots.txt[webmasterworld.com ]

lammert

WebmasterWorld Senior Member lammert us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4099545 posted 8:05 pm on Mar 17, 2010 (gmt 0)

Instead of adding them to robots.txt, you could add a meta robots tag in each page or an X-Robots-Tag: noindex header line dynamically with the script you use to generate these pages.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4099545 posted 8:38 pm on Mar 17, 2010 (gmt 0)

robots text works on the url and matches left-to-right.

if you want to exclude obedient crawlers, Dijkgraaf is correct but i would prefer to be more specific if necessary:
Disallow: /?page_id=


however excluding robots does not equal preventing indexing, so if you don't want your urls indexed you should heed the advice lammert provided.

i would also suggest you check out the "interview with Matt Cutts by Eric Enge" [webmasterworld.com] and discussion on this subject.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4099545 posted 9:43 pm on Mar 17, 2010 (gmt 0)

If you do use the robots.txt method, do be aware that
Disallow: /?page_id= does NOT disallow URL requests like example.com/index.php?page_id= so you'll need another rule like Disallow: /index.php?page_id= for that.

You might be tempted to use
Disallow: /*?page_id= but not all robots understand that terminology. If you do use it, you need to place it in a specific Googlebot section, and then you have to also duplicate all of the rules from the User-agent: * section into the User-agent: Googlebot section of the file. This is because Google reads only the most specific section of the robots.txt file that applies to it.
wheel

WebmasterWorld Senior Member wheel us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4099545 posted 11:59 pm on Mar 17, 2010 (gmt 0)

What happened was this, I'm using wordpress on a huge site (lots of pages). Wordpress's pretty url function fails completely at about a 1000 pages. So we rewrote our own mapping function for all the pages.

Unfortunately, I left a canonical URL setting on, so in every page was a link to the underlying wordpress pages - the ?page_id= page. So now all my pages got indexed twice. Messy.

I should probably htaccess deny the pages instead or something - I don't want Google touching these pages. Would that be a better solution? serve a 403?

lammert

WebmasterWorld Senior Member lammert us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4099545 posted 12:42 am on Mar 18, 2010 (gmt 0)

By blocking these pages you may waste page rank which is assigned by links to these URLs but never propagates further. You could instead use the rel=canonical tag to tell the search engines that the /page_id? versions are just a copy of the pages with the pretty URLs. In that case the duplicates are combined in the major search engines and only the pretty version should show up in the SERPs.

JS_Harris

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4099545 posted 3:51 am on Mar 18, 2010 (gmt 0)

An excellent plugin exists called "redirection" that will allow you to redirect pages in any way you wish, one at a time or entire batches at a time, and it takes care of your .htaccess and/or robots.txt files too.

I've used this myself with excellent results on a 1500 page site to get rid of a yahoo forced "index.php" url that the new host refused to accept.

It is EXTREMELY flexible and will also provide statistics and 404 error monitoring and full logs so you can make sure it is working as intended.

It's even multi language friendly, I found it to be as close to godly as you will ever find for this problem. (Author's name is John Godley, pardon the pun)

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4099545 posted 9:36 am on Mar 18, 2010 (gmt 0)

the correct, preferred and ideal solution is a 301.
it is air-tight and unambiguous - no hints, if, ands or buts.
the fallback solution is rel=canonical, which i would only use if the 301 is technically impossible or cost-prohibitive for you.

wheel

WebmasterWorld Senior Member wheel us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4099545 posted 11:57 am on Mar 18, 2010 (gmt 0)

Gah! Yet another hack. I have a hack already installed that seamlessly displays the page when the pretty url is entered. Now looks like I need to do another hack to look up the pretty url given a page_id, then 301 to the pretty url (which goes to the first hack that actually displays the wordpress page).

Not happy with wordpress over this. Drupal just works.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved