ergophobe - 6:51 pm on Nov 5, 2012 (gmt 0)
Whew! There's a lot in this thread. There's one potentially really dangerous piece of advice, though, so let me hit that one first.
As for pages vs posts...I think you are right to uses pages.
Emphatically, absolutely and definitely do NOT use pages if you are converting a lot of content and importing your existing URL structure. This is a BAD IDEA.
While this is formally true:
index.php which does the grunt work in making almost any url structure you want work.
there are problems. The "grunt work" is actually done by /wp-includes/canonical.php, but yes it is true that Wordpress as a system handles this internally these days rather than in a massive generated .htaccess file like it did early on. And it's also true that Wordpress can deal with any URL structure.
That said, it will deal with some of them very poorly, in particular if you are using Pages. If you have Page URLs in the form %postname% or %category%/%postname% Wordpress won't (or as of 2011 would not) scale well. I forget the reasons, but I believe it has to do with the fact that Post slugs need to be unique (regardless of what the rest of the path is, the last element must be unique, so it's a simple lookup), whereas Page slugs do not have this requirement, dramatically increasing the difficulty of the lookup, especially with a deep URL structure. I'm not sure I have that right, but here's what Sam Wood (aka Otto, core contributor to Wordpress) has to say:
Once you have about 50-100 static Pages or so, and you’re using an ambiguous custom structure, then the system tends to fall apart. Most of the time, the ruleset grows too large to fit into a single mySQL query, meaning that the rules can no longer be properly saved in the database and must be rebuilt each time. The most obvious effect when this happens is that the number of queries on every page load rises from the below 50 range to 2000+ queries, and the site slows down to snail speed.
See also: [ottopress.com...] and [core.trac.wordpress.org...] and [wordpress.org...]
The general advice is that if you think you're going to have thousands of Pages in the WP sense of that, or very very large numbers of Posts, you should do the lookup based on numeric data (date, post id) at the beginning of the URL rather than the post slug or category. Since you're importing an existing URL structure, that can't happen. Therefore, Pages are a Bad Idea(tm).
Which leads me to
Are you sure you only want to migrate your static sites to "Pages" in wordpress? Why not Posts?
I would be more emphatic and say that, in fact, you should be sure that you want Posts not Pages. Definitely NOT Pages.
Honestly, compared to that site-threatening issue, I think the rest is details.... but I'm into details.
Nobody shares this feeling using a Url-migration-tool in order to keep up the Urls?
I think people are sypmathetic to that. It's a legitimate concern. The question is weighing your options. Generally, I agree with g1smd - file extensions on URLs are *evil*. It's only going on 16 years that Tim Berners-Lee laid this out (see "Axiom of URL Opacity"). No new site should do this. But you don't have a new site and the question is whether or not now is the time to make that transition and you can argue it both ways. If it simplifies technology upgrades present (maybe) and future (definitely), then maybe it's the time. If it's a major headache and you are uncomfortable rolling out a huge number of 301s all at once, maybe it isn't the time.
You could roll it out with the same URL strucutre you currently have and then change the URLs in bacthes internally until you have gotten rid of the extensions and then can get rid of any corresponding plugins. In general, Wordpress has come a huge way since some years ago when it was likely to be sending out soft 404s and 302s and all that. Nowdays, if you have a URL alias and you change it from within Wordpress, it will automatically handle the 301. Obviously, you have to check your own setup, especially when you're doing something custom like this, but overwhelmingly Wordpress will now handle these sorts of things intelligently and do a decent job of canonicalization below the domain level.
htaccess can never "make" a URL.
I disagree about wp making those urls..
g1smd never said that "wp" can't make URLs, but that .htaccess can't. Of course, for internal links, WP is going to generate a lot of URLs (navigation, internal pingbacks, etc), the user will generate others (manual links inside a post) and, naturally, external sites will create URLs to your site. Of these three types of URLs, WP only really controls the first type, so you still have to deal with the others. Of course, it's important to get your internal linking structure right, because that is how search engines will crawl your site and will ultimately determine which URLs they have in their index, but in the case of a legacy site with old URLs inbound from other sites, you'll never have full control of who "makes" the URL and where. So .htaccess can serve as a traffic director, but it won't "make" anything.
That code does nothing for .html URLs.
that will work for .html urls
Again, it will work for .html URLs (and .asdfghwer URLS too) if those URL aliases are in the WP database and are a valid lookup path. But the .htaccess doesn't do anything for .html URLs. It only tests to see whether the URL points to a file or a directory, does a relatively expensive lookup to the filesystem for each of these checks, and if they fail, passes it to index.php which then does some preliminary checks and passes things on to canonical.php for all the URL parsing. So while that code will work for .html URLS given a WP setup designed to take that into account, the standard WP .htaccess does nothing with .html URLS except pass them on like any other URL.
So it's an apples and oranges discussion, which I think is leading to some disagreement where there need be none.
The .htaccess script (which is common not only with wordpress ( [codex.wordpress.org...] ), but with drupal and other CMS'es),
doesn't mean it's good code.
It is indeed standard and common to WP and Drupal and is used without problem on millions of sites. True, that doesn't mean it's "good" code - the KSES code that was shared across Wordpress, Drupal, Moodle and others made them all vulnerable to XSS attacks in 2010. But in general, a site will work fine with the distro .htaccess and compared to the Pages/Posts issue, this is minor.
g1smd's point in mentioning it, is that the standard .htaccess file for these CMSes has somewhat expensive file system lookups that can be avoided and you can gain some easy efficiency by modernizing your .htaccess file to the more efficient one that JDMorgan, g1smd and others banged out. See the relevant threads on
Joomla: [webmasterworld.com...] and [webmasterworld.com...]
What WP is doing is SERVING CONTENT from one location while pretending to live at a different location.
This only makes sense if you do not think in terms of Tim Berners-Lee's original Axiom of URI Opacity
The only thing you can use an identifier for is to refer to an object. When you are not dereferencing, you should not look at the contents of the URI string to gain other information.
He follows up with an example in somewhat simpler language:
For example, within an HTTP identifier, even when access is made to the object, the client machine looks at the first part of the identifier to determine which server machine to talk to and from then on the rest of the string is defined to be opaque to the client. That is the client does not look inside it, it can not deduce an information from the characters in that identifier.
It is a violation of the Axiom of URI Opacity to think in terms of "pretending to live at a different location." A URL can't pretend anything. It simply points to a resource in a way that should be opaque to the user and the user agent. It's up to the script and the server to decide how this maps to a given resource which may or may not exist as a file. In other words, the URL is not "pretending" to point to a file on the file system, but is actually pointing to an object that is constructed in a black box either from a simple file lookup or a thousand rows from a hundred tables in six different databases.
Once you are in WordPress, the "real" underlying name of the page isn't html any longer though is it? It's .php.
I'm not sure what you mean by "name" here since we have URLs, meta titles, H1s and other things, but no names. Assuming you mean URL, I must again emphatically say NO!, the URL in the address bar is the "real" location of the page. The server can dereference that URL any way it wants and to see page.html being the "fake" one and index.php being the real one can't be so -- the PHP files in Wordpress have no content so they are not pages in any sense at all. Again to say otherwise violates the Axiom of Opacity. Neo, there is no real and fake. There is only the URL.