Forum Moderators: phranque
This post my get a little lengthy, but I want to be as descriptive as possible so maybe some kind soul can help a newby out.
My issue, I think, is that my .htacces file is screwed up and in turn, is effecting the search bots, mainly Google and Yahoo, from crawling my site properly. Below is a copy of what I have in it right now. I put letters in front of each so that you can better follow what I did (The letters are not in the actual .htaccess file):
A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
C)Options +FollowSymLinks
D)RewriteEngine on
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]
J)ErrorDocument 404 /404page.php
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php
I did a lot of research before doing the above, but I have a feeling I did things really badly, which has created problems with both Yahoo and Google.
To give a brief background, I initially had my site in html and switched it all to php about four months ago.
What I was trying to do with the commands above was:
(A & B) - Add the Handler so that I could do stuff with php5
(E THROUGH H) This will redirect the different types of home pages to: http://www.example.com/ and also redirect http://www.example.com/index.php to http://www.example.com/
(I) - Rewrite all the old html files to php
(J) - If a page is not found, then go to my 404 page
(k) - This is just one entry of about 100 that I have that redirects the old page to a new page. (I redid my structure to better organize)
So there it is. It took me a while to refresh my memory on what I did above, fortunately I kept my notes. I should explain what I'd like to do so maybe I can get some direction.
Summary of what I'd like to do:
1.) Redirect all the different types of home pages to one...http://www.example.com/
2.) Since Google and Yahoo still have some of my html pages in their directories, I need to properly redirect to the new php pages.
3.) I need to redirect all the old pages to the new page which is located in a more logical document structure (new directory)
4.) My ultimate wish - to drop all the php extension and just have http://www.example.com/directory name/widget
I'm trying to be logical with my post, but as I keep writing, my head is starting to buzz with confusion.
I guess my biggest hangup is that I switched from html to php, so redirecting the pages to the new php page was easy enough and straight forward, but the confusion is when I redid the website structure and now I am redirecting again to the new page path.
I should note that everything I did has tested to work, but I am concerned that Google and Yahoo are getting confused with all the different commands.
I hope someone can help me out with this, it's become really stressful for me and a cause for a lot of worry.
Thanks in advance,
Michael
A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
#
J)ErrorDocument 404 /404page.php
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
C)Options +FollowSymLinks
D)RewriteEngine on
#
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
#
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]
#
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]
The first 'real' problem is that your redirects are not in the proper order. It's a bit difficult to explain, but you need to put your rules in order from most-specific to least-specific in order to avoid 'chained redirects' -- one requested URL resulting in multiple sequential redirects. So, for example, you'll want to redirect the specific "index.php" URLs before applying the more-general catch-all, domain canonicalization redirect last.
So your code should probably be re-arranged like this:
A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
#
J)ErrorDocument 404 /404page.php
#
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
C)Options +FollowSymLinks
D)RewriteEngine on
#
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
#
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]
#
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]
Next, there may be a problem with the RewriteCond shown as "E". The "?" acts as a quantifier meaning "zero or one of the previous character or parenthesized group of characters." So in this case, the pattern matches the requested URL-paths "/<anything>index.php" or "/<anything>index.ph". It doesn't seem likely to me that this is what you wanted. Perhaps you wanted to accept either "php" or "php4" here, in which case, this part of the pattern should read "php4?", not "php?".
Leaving that possible change out, but cleaning up some minor regex efficiency and consistency problems, I'd recommend:
AddHandler x-httpd-php5 .php
AddHandler x-httpd-php .php4
#
ErrorDocument 404 /404page.php
#
Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
Options +FollowSymLinks
RewriteEngine on
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.php?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index.php?$ http://www.example.com/$1 [R=301,L]
#
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]
#
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
The index page canonicalization rule bug may also have caused you some trouble, but it's hard to say, since I don't know your original intent.
The regular expressions patterns shown should be much more efficient than the easy but greedy and promiscuous ".*" patterns previously used; They allow a straight left-to-right, single-pass evaluation of the input string against the pattern, and avoid the potentially many 'back-off-and-retry" passes needed when ".*" patterns are used. Note the conspicuous absence of ".*" patterns in the Apache URL Rewriting Guide.
For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].
Jim
I appreciate your timely response, and I am actually surprised to receive such a thorough examination and response. It is greatly appreciated.
You made some very interesting points in your response, things I have not found anywhere else in my search for proper .htaccess setup.
Mainly the ordering of commands within the document. I had always wondered if order mattered and how it should be properly listed.
One question I have, since I am copying the rewritten code that you provided, is should I keep the #'s in the file?
Secondly, I just wanted to expand upon my intent with this file:
1.) Solve the Canonicalization issue so that Google doesn't treat each as a separate page, therefore penalizing me with duplicate content. (Which I find hard to believe that they would, but felt I should do it anyway.)
2.) Rewrite old html files to the new PHP files.
3.) Rewrite the new PHP files in #2 to there new home, located in new, better organized directories.
4.) Have a rewrite condition for bad urls to go to my customized 404 page.
***5.) Having the pages redirect and mask the .php extension on the end. This one I haven't figured out yet, seen some examples and tried them out, but for some reason it conflicts with the Canonicalization rewrite. (Not sure if this is necessary for better search engine ranking, so I have put it on the back burner.
I guess my main concern has been that I have noticed in my analysis of Google and Yahoo utilizing the "site:" command, that both Google and Yahoo have old pages in there index and that they are not completely indexing my site's updated pages. What's most alarming is Yahoo not having updated my site. I did all these changes months ago, so it should of dropped the old pages by now. I also took the right step of submitting an xml sitemap to both, so that should of clearly layed out where to go, one would think.
The other interesting issue with all this is trying to access /stats through godaddy. I tried getting in but it doesn't work. So i called godaddy and they said that it was a problem with my htaccess file and the index operation. I experimented by taking out the line:
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]
and then it works. Should I leave this out and let Yahoo and google hit my 404 page and then they will clearly see all the links to my new pages and therefore drop the old one's or should I keep that line in because it gives the 301 code.
Any thoughts?
Thanks again,
Michael
Rule number one of programming successfully -- Do not add features until what you have now is thoroughly debugged and tested for several days. We will get to extensionless URLs in good time, but doing so now only complicates things unnecessarily.
To fix the /stats problem, you simply need to inhibit the rule if any URL in the /stats path is requested:
# If not a request for stats
RewriteCond %{REQUEST_URI} !^/stats/
# Then redirect html pages to php pages
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]
Jim
[edited by: jdMorgan at 9:43 pm (utc) on Nov. 12, 2007]
Answer:
I actually found that code in a tutorial (at bottom):
<snip>
quoted text:
If you need to redirect h**p://www.domain.com/index.html to h**p://www.domain.com/ place this code in your .htaccess file.
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html?\ HTTP/ [NC]
RewriteRule ^(.*)index.html?$ h**p://www.domain.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^domain\.com
RewriteRule ^(.*)$ h**p://www.domain.com/$1 [R=permanent,L]
I just replaced the .html? with .php?
I guess I shouldn't of done that without really understanding it.
[edited by: jdMorgan at 9:53 pm (utc) on Nov. 12, 2007]
[edit reason] No URLs, please. See TOS. [/edit]
I'm trying so hard to learn how to do this stuff more efficiently, but it's hard when you are trying to learn how to build a site, learn new ways of doing graphic design...i.e. CSS, work with programming like php and .htaccess, and trying to keep up on SEO. I feel like my head is going to explode now. "Breath Michael, breath :)"
Anyway, thanks so much for the help on this.
One last question for you. Do you think my previous .htaccess file might be the cause for poor indexing by both Google and Yahoo? I know it might be to broad of a question without a true analysis of the website, but just thought I'd throw it out there for your thoughts.
Anyways, once again, thank you so much.
Hopefully someday I can somehow return the favor.
Michael
All-in all, the only major problem was the RewriteRule order -- Search engines seem to follow a single redirect fairly reliably, but things gets "iffy" when you give them two or more in a row.
This could be maybe 30% of it... I'd be more concerned about the timeframe over which you made changes -- That is, how long it was after redirecting html to php before adding the old.php to new.php URL redirects. If you change too much too fast, the search engines basically have to start over with your site, and it may take a year to recover. Take-home lesson: URLs are not file names -- Using mod_rewrite to internally (silently) rewrite old URLs to new filenames illustrates that clearly. Set up your file structure to be easy to maintain, expandable, easy to 'partition' among employees with different access privileges, etc. Then set up the URL structure to use short, memorable URLs. Then map the URLs to filenames as needed. And most important to preventing similar problems in the future, pick URLs that you will never, ever, have to change [w3.org] again. Ever. :)
This is one reason to go with extensionless URLs; There's no need to change your URLs if you change from .html to .asp to .jsp to .php -- There's simply no reason to 'publicize' the technology of your site in its URLs -- especially if that technology may change over time.
Jim
I actually did the redirects from .html to .php about 6 months ago. The new redirects, from the old location to the new location of the .php I just did recently, maybe a month ago.
Your insight is dead on. I wish I would of planned better from the start, but it was my first website. It's been a learning experience.
I would like to learn how to do extension less urls, so I can mask the fact that I am using .php
Is that hard to do?
Thanks
each forum page has a link to a charter page and a library page which will typically lead to excellent resources...
# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule ^(.+)/$ /$1 [R=301,L]
#
# REDIRECT to canonical url
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]
Previously those two RewriteRules were the other way around. Now, when I browse http://example.com/page/ Firefox live headers does indeed show only one redirect is being done by the server, to http://www.example.com/page all in one go.
Great, but I must say I don't understand why, if there are still two independent redirection rules.
# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)/$ http://www.example.com/$1 [R=301,L]
Jim