Forum Moderators: phranque

Message Too Old, No Replies

Permanent redirect to remove /index.php/ sitewide.

         

Sgt_Kickaxe

6:32 pm on Jun 3, 2010 (gmt 0)



I have a wordpress blog which has index.php added to every single url (as a fix to make permalinks work on that host).
The site looks like this:
www.example.com/
www.example.com/index.php
www.example.com/index.php/somearticle.php
www.example.com/index.php/category/someotherarticle.php

I want the site to look like this:
www.example.com/
www.example.com/somearticle.php
www.example.com/category/someotherarticle.php

I know how to physically make the blog remove the index.php from every url but I'm concerned with setting up a permanent redirect to maintain search rankings, I'd like to get it right the first time.

When I fix the site to remove all traces of index.php what should I add to my .htaccess file to make sure search engines also know the article urls have permanently changed?

I've seen examples dealing with removing index.php but none that remove it when there is more to the uri after that.

Sgt_Kickaxe

6:47 pm on Jun 3, 2010 (gmt 0)



Perhaps I got the title of the post wrong. I know how to remove /index.php/ sitewide, I'm only needing to make sure all attempts to reach a uri that has /index.php/ in it get redirected to the version without it and that a 301 code is returned.

Hope that's clearer.

Sgt_Kickaxe

7:52 pm on Jun 3, 2010 (gmt 0)



Forgot the "add your best effort" rule in this forum, just looking for confirmation to get it right the first time.

best effort:
RewriteRule ^index.php/(.+)$ http://www.example.com/$1 [R=301,L]
RewriteRule ^index.php$ http://www.example.com/ [R=301,L]

Do I need a rewrite condition? Does this go before all other rules in htaccess or after? Any pointers welcome.

jdMorgan

2:03 pm on Jun 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You do need a rewritecond to prevent an infinite loop if DirectoryIndex is set (as is customary) to include index.php. Otherwise, your rule redirects "/index.php" to "/", then DirectoryIndex rewrites "/" back to "/index.php", then the rule matches again and redirects back to "/". Lather, rinse, repeat...

So this application calls for the bog-standard "redirect /index.xyz to / without looping" rule with only a small tweak to accommodate the additional path-info optionally appended to "/index.php". This isn't so much a "rule" question, as it is a "regular-expressions" question...

RewriteCond %{THE_REQUEST} ^[A-Z]+\ /index\.php(/[^\ ]*)?\ HTTP/
RewriteRule ^index\.php(/(.*))?$ http://www.example.com/$2 [R=301,L]

Note that since the slash following "index.php" is part of an optional subpattern, this rule works for both cases, so two rules are not required.

Also, because of the nested subpattern construct used in the rule, the top-level slash is always 'forced' in the target URL even if no additional path-info is appended, e.g. example.com/index.php --301--> example.com/

Jim

Sgt_Kickaxe

6:45 pm on Jun 4, 2010 (gmt 0)



I knew I was missing something, glad I asked, thanks jd.

Curious, i'm not an htaccess expert, in your example above you have a question mart in both lines, do I need that even though none of the site's pages have one next to the index.php ? It's all /index.php/yadayada without any variable. It all makes sense but I don't know what the ? is for, pardon my ignorance.

g1smd

7:10 pm on Jun 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This explains the question mark:

Note that since the slash following "index.php" is part of an optional subpattern, this rule works for both cases, so two rules are not required.

Also, because of the nested subpattern construct used in the rule, the top-level slash is always 'forced' in the target URL even if no additional path-info is appended, e.g. example.com/index.php --301--> example.com/

jdMorgan

7:48 pm on Jun 4, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Note that all of ? * + are regular-expressions quantifiers, and if one of those characters is meant to be matched as a literal character, it must be escaped by preceding it with a "\".

Similarly, many other characters such as $ % ^ ( ) {} [ ] | . and "\" itself must normally be escaped. Otherwise, they have meaning as 'tokens' or operators for regular expressions or for mod_rewrite itself.

For use specifically with mod_rewrite, literal ^ ! = < and > characters must be escaped if they are the first literal character to be matched in the pattern.

Note that required-escaping rules also change depending on whether the characters are part of an alternate-character group (like [a-z0-9] for example) or not.

There is a concise regular expressions tutorial cited in our Forum Charter. As regular expressions are used not only in mod_rewrite, but in PERL, PHP, C, Python, and almost all other modern high-level programming and scripting languages, gaining a passing familiarity with them is well worth the investment in time.

Jim

Sgt_Kickaxe

3:14 am on Jun 5, 2010 (gmt 0)



I move sites so rarely that I'll probably get rusty before I need this again but I've bookmarked this page, and several others, for future reference. Thanks for your help.

I've moved the site, from a host that required "index.php" be added to EVERY link to make wordpress pretty permalinks work (due to .htaccess access not being allowed with that host) to a host that DOES NOT allow index.php to be in ANY link (without a trailing parameter anyway, which I had none) so the site crashed and spit out "no such..." errors. The code above worked flawlessly in restoring it and redirecting all /index.php/ to /.

I'm not sure what the damage will be from not having been able to let the redirect sit before changing all links sitewide but since they didn't work I had no choice, all at once or nothing.

Will google see them as all new links and lower rankings until it re-crawls older pages and gets the redirect signal? Is there anything else I can do to change that? Should I change the sitemap to remove all traces of index.php too or let that sit until the indexing shows all of them have been indexed to reflect the new uri's?

jdMorgan

5:25 am on Jun 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, the URLs are different, so they'll be seen as 'new pages' -- any difference in a URL, whether major or minor, makes it 'unique' and therefore a 'different thing' as far as search engines go. URLs are the basis of how search engines index the Web: They don't care about "sites" or "pages" or "files" or "scripts" -- only URLs.

The impact depends on so many things that it's almost completely unpredictable. The three major factors in this situation, where it is only the index URLs that have changed, are likely to be how often your URLs are spidered, how many 'deep-links' you have (links to non-index pages), and how well the search engines 'trust' your site. If you get spidered fast, your deep-linked pages 'support' your index pages well, and your TrustRank is high, then this could be a very minor bump in the road. If none of those factors are positive, then the worst I've heard of is depressed rankings for about nine months... :o

However, it would not surprise me if the major search engines have 'hooks' in their algorithms to recognize that index-page URLs --like www and non-www hostnames-- are often screwed-up, and therefore incorporate some kind of 'forgiveness factor' for this issue.

Re The sitemap: I would put both the old and new index-page URLs in there. Leave the old ones in there until the search indexes get updated and then wait awhile longer until you've seen the search engines spider the old URLs and get redirected a few more times. Then you can get rid of the old index-page URL entries.

> .htaccess access not being allowed with that host

Unbelievable, in this day and age! Glad to hear you got out of that mess.

Please do let us know how it goes...

Jim

Sgt_Kickaxe

8:02 am on Jun 5, 2010 (gmt 0)



I can see it now...

hmmm, that's duplicate, we're not indexing that.
ahhh, a new page, let's take a look.
hmmm, 301 redirect to that duplicate page...

I have faith that some engines will restore at least partial trust to the new versions and faith that others will botch the transfer or lower site trust. I suppose any site that changes uri's deserves it really.

It's done, I'll check back with what search engines did in a couple of weeks. The sitemap has both versions of pages with the old listed first. The site however, all new versions of links. Worst case scenario is a reconsideration request :-)

Sgt_Kickaxe

10:07 am on Jun 5, 2010 (gmt 0)



In case anyone else needs to replace all instances of /index.php/ from a wordpress site this handy sql command can be a lifesaver (backup your database first and replace "index.php/" with "")...

UPDATE wp_posts SET post_content = REPLACE (
post_content,
'Item to replace here',
'Replacement text here');




8 hours after the change googlebot has already crawled more pages than its daily average, serps remain unchanged. I'll try to write an update at the 1 week and 2 week marks, fingers crossed.

jdMorgan

6:23 pm on Jun 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I have faith that some engines will restore at least partial trust to the new versions and faith that others will botch the transfer or lower site trust. I suppose any site that changes uri's deserves it really.

I don't think it's likely you'll have major problems. As I inferred above, search engines actually sort of "prefer" the "/" on properly-configured sites -- for example, www.google.com/ as opposed to google.com/index.cgi or google.com/search.py

BTW, the order you did this in --links updated first, then 301 redirects-- is actually the preferred order, since the 301 is 'seen' as a 'repair step' to salvage links that *did not* get updated (e.g. inbounds from other sites).

Even if it does go temporarily bad, at least you've got this out of the way, and can move forward from a much
stronger technical base -- one less thing to worry about in the future. Take the hit now, and then grow, grow, grow.

Jim

g1smd

6:33 pm on Jun 5, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can confirm that having the redirect applied the very same day as the internal links were altered can go wrong.

I am looking at just such a mess at the moment; a mess that might have been averted if the redirect implementation had been deferred for a week or two.

Sgt_Kickaxe

12:46 am on Jun 6, 2010 (gmt 0)



1 day in, some interesting changes...

site:example.com returns all pages, some with index.php removed, most of them being high traffic/rank pages.
searching for the exact match title of those updated pages returns the same ranking but displays the index.php version.

I thought that was odd at first but it makes sense, Google is indexing both versions of every page but can only return one version in any search for keywords. The index.php versions are still given the most internal link boost so they are rankings for indivisual keywords.

No loss in traffic, yet, and no errors/missing pages/broken links in webmaster tools.