Forum Moderators: phranque

Message Too Old, No Replies

.htaccess file and Search Engine Problem

Issues with my .htaccess file

         

mkingsle

4:36 pm on Nov 12, 2007 (gmt 0)

10+ Year Member



O.K., I am brand new here so be gentle.

This post my get a little lengthy, but I want to be as descriptive as possible so maybe some kind soul can help a newby out.

My issue, I think, is that my .htacces file is screwed up and in turn, is effecting the search bots, mainly Google and Yahoo, from crawling my site properly. Below is a copy of what I have in it right now. I put letters in front of each so that you can better follow what I did (The letters are not in the actual .htaccess file):

A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
C)Options +FollowSymLinks
D)RewriteEngine on
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]
J)ErrorDocument 404 /404page.php
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php

I did a lot of research before doing the above, but I have a feeling I did things really badly, which has created problems with both Yahoo and Google.

To give a brief background, I initially had my site in html and switched it all to php about four months ago.

What I was trying to do with the commands above was:

(A & B) - Add the Handler so that I could do stuff with php5
(E THROUGH H) This will redirect the different types of home pages to: http://www.example.com/ and also redirect http://www.example.com/index.php to http://www.example.com/
(I) - Rewrite all the old html files to php
(J) - If a page is not found, then go to my 404 page
(k) - This is just one entry of about 100 that I have that redirects the old page to a new page. (I redid my structure to better organize)

So there it is. It took me a while to refresh my memory on what I did above, fortunately I kept my notes. I should explain what I'd like to do so maybe I can get some direction.

Summary of what I'd like to do:

1.) Redirect all the different types of home pages to one...http://www.example.com/
2.) Since Google and Yahoo still have some of my html pages in their directories, I need to properly redirect to the new php pages.
3.) I need to redirect all the old pages to the new page which is located in a more logical document structure (new directory)
4.) My ultimate wish - to drop all the php extension and just have http://www.example.com/directory name/widget

I'm trying to be logical with my post, but as I keep writing, my head is starting to buzz with confusion.

I guess my biggest hangup is that I switched from html to php, so redirecting the pages to the new php page was easy enough and straight forward, but the confusion is when I redid the website structure and now I am redirecting again to the new page path.

I should note that everything I did has tested to work, but I am concerned that Google and Yahoo are getting confused with all the different commands.

I hope someone can help me out with this, it's become really stressful for me and a cause for a lot of worry.

Thanks in advance,

Michael

jdMorgan

5:21 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the same code re-arranged to correspond more closely with how it is actually most likely to be processed. Your code will be scanned by each Apache module in turn, in the reverse order specified by the LoadModule list on Apache 1.x, or according to an internal priority scheme on Apache 2.x. The result of this is that directives handled by any given module will be processed in the order you specify, but you cannot control the module execution order in .htaccess.

A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
#
J)ErrorDocument 404 /404page.php
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
C)Options +FollowSymLinks
D)RewriteEngine on
#
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
#
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]
#
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]

I'm pointing this out because understanding this point may clarify further points.

The first 'real' problem is that your redirects are not in the proper order. It's a bit difficult to explain, but you need to put your rules in order from most-specific to least-specific in order to avoid 'chained redirects' -- one requested URL resulting in multiple sequential redirects. So, for example, you'll want to redirect the specific "index.php" URLs before applying the more-general catch-all, domain canonicalization redirect last.

So your code should probably be re-arranged like this:


A)AddHandler x-httpd-php5 .php
B)AddHandler x-httpd-php .php4
#
J)ErrorDocument 404 /404page.php
#
K)Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
C)Options +FollowSymLinks
D)RewriteEngine on
#
E)RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.php?\ HTTP/ [NC]
F)RewriteRule ^(.*)index.php?$ http://www.example.com/$1 [R=301,L]
#
I)RewriteRule ^(.+)\.html$ http://www.example.com/$1.php [R=301,L]
#
G)RewriteCond %{HTTP_HOST} ^example\.com
H)RewriteRule ^(.*)$ http://www.example.com/$1 [R=permanent,L]

If internal rewrites are added later, they must all be placed after the external redirects; Otherwise, the external redirects may 'expose' the internally-rewritten URL, and this is not usually desirable.

Next, there may be a problem with the RewriteCond shown as "E". The "?" acts as a quantifier meaning "zero or one of the previous character or parenthesized group of characters." So in this case, the pattern matches the requested URL-paths "/<anything>index.php" or "/<anything>index.ph". It doesn't seem likely to me that this is what you wanted. Perhaps you wanted to accept either "php" or "php4" here, in which case, this part of the pattern should read "php4?", not "php?".

Leaving that possible change out, but cleaning up some minor regex efficiency and consistency problems, I'd recommend:


AddHandler x-httpd-php5 .php
AddHandler x-httpd-php .php4
#
ErrorDocument 404 /404page.php
#
Redirect 301 /widget.php http://www.example.com/directory/widget.php
#
Options +FollowSymLinks
RewriteEngine on
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.php?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index.php?$ http://www.example.com/$1 [R=301,L]
#
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]
#
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

All-in all, the only major problem was the RewriteRule order -- Search engines seem to follow a single redirect fairly reliably, but things gets "iffy" when you give them two or more in a row. BTW, an example URL that would cause your original code to issue two sequential redirects is example.com/foo.html -- This URL would have been first redirected to www.example.com/foo.html, and then would have subsequently been redirected to www.example.com/foo.php

The index page canonicalization rule bug may also have caused you some trouble, but it's hard to say, since I don't know your original intent.

The regular expressions patterns shown should be much more efficient than the easy but greedy and promiscuous ".*" patterns previously used; They allow a straight left-to-right, single-pass evaluation of the input string against the pattern, and avoid the potentially many 'back-off-and-retry" passes needed when ".*" patterns are used. Note the conspicuous absence of ".*" patterns in the Apache URL Rewriting Guide.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim

mkingsle

8:41 pm on Nov 12, 2007 (gmt 0)

10+ Year Member



Thank you so much Jim:

I appreciate your timely response, and I am actually surprised to receive such a thorough examination and response. It is greatly appreciated.

You made some very interesting points in your response, things I have not found anywhere else in my search for proper .htaccess setup.

Mainly the ordering of commands within the document. I had always wondered if order mattered and how it should be properly listed.

One question I have, since I am copying the rewritten code that you provided, is should I keep the #'s in the file?

Secondly, I just wanted to expand upon my intent with this file:

1.) Solve the Canonicalization issue so that Google doesn't treat each as a separate page, therefore penalizing me with duplicate content. (Which I find hard to believe that they would, but felt I should do it anyway.)

2.) Rewrite old html files to the new PHP files.

3.) Rewrite the new PHP files in #2 to there new home, located in new, better organized directories.

4.) Have a rewrite condition for bad urls to go to my customized 404 page.

***5.) Having the pages redirect and mask the .php extension on the end. This one I haven't figured out yet, seen some examples and tried them out, but for some reason it conflicts with the Canonicalization rewrite. (Not sure if this is necessary for better search engine ranking, so I have put it on the back burner.

I guess my main concern has been that I have noticed in my analysis of Google and Yahoo utilizing the "site:" command, that both Google and Yahoo have old pages in there index and that they are not completely indexing my site's updated pages. What's most alarming is Yahoo not having updated my site. I did all these changes months ago, so it should of dropped the old pages by now. I also took the right step of submitting an xml sitemap to both, so that should of clearly layed out where to go, one would think.

The other interesting issue with all this is trying to access /stats through godaddy. I tried getting in but it doesn't work. So i called godaddy and they said that it was a problem with my htaccess file and the index operation. I experimented by taking out the line:
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]

and then it works. Should I leave this out and let Yahoo and google hit my 404 page and then they will clearly see all the links to my new pages and therefore drop the old one's or should I keep that line in because it gives the 301 code.

Any thoughts?

Thanks again,

Michael

jdMorgan

9:38 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's still not clear what the intent of the subpattern "php?" was. Please expand on that.

Rule number one of programming successfully -- Do not add features until what you have now is thoroughly debugged and tested for several days. We will get to extensionless URLs in good time, but doing so now only complicates things unnecessarily.

To fix the /stats problem, you simply need to inhibit the rule if any URL in the /stats path is requested:


# If not a request for stats
RewriteCond %{REQUEST_URI} !^/stats/
# Then redirect html pages to php pages
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.php [R=301,L]

I added the "#" characters --indicating comment lines-- to enhance readability. You can use either "#" or blank lines, but using blank lines on this forum messes up the formatting, so I used "#".

Jim

[edited by: jdMorgan at 9:43 pm (utc) on Nov. 12, 2007]

mkingsle

9:49 pm on Nov 12, 2007 (gmt 0)

10+ Year Member



" It's still not clear what the intent of the subpattern "php?" was. Please expand on that."

Answer:

I actually found that code in a tutorial (at bottom):
<snip>

quoted text:
If you need to redirect h**p://www.domain.com/index.html to h**p://www.domain.com/ place this code in your .htaccess file.

RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /.*index\.html?\ HTTP/ [NC]
RewriteRule ^(.*)index.html?$ h**p://www.domain.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^domain\.com
RewriteRule ^(.*)$ h**p://www.domain.com/$1 [R=permanent,L]

I just replaced the .html? with .php?

I guess I shouldn't of done that without really understanding it.

[edited by: jdMorgan at 9:53 pm (utc) on Nov. 12, 2007]
[edit reason] No URLs, please. See TOS. [/edit]

jdMorgan

9:54 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "?" has a function in that code -- It makes the pattern match either "html" or "htm" -- common file extensions. So it appears you should remove it from "php"

Jim

mkingsle

10:23 pm on Nov 12, 2007 (gmt 0)

10+ Year Member



Thanks Jim for all your help. It is REALLY appreciated.

I'm trying so hard to learn how to do this stuff more efficiently, but it's hard when you are trying to learn how to build a site, learn new ways of doing graphic design...i.e. CSS, work with programming like php and .htaccess, and trying to keep up on SEO. I feel like my head is going to explode now. "Breath Michael, breath :)"

Anyway, thanks so much for the help on this.

One last question for you. Do you think my previous .htaccess file might be the cause for poor indexing by both Google and Yahoo? I know it might be to broad of a question without a true analysis of the website, but just thought I'd throw it out there for your thoughts.

Anyways, once again, thank you so much.

Hopefully someday I can somehow return the favor.

Michael

jdMorgan

10:58 pm on Nov 12, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From above:
All-in all, the only major problem was the RewriteRule order -- Search engines seem to follow a single redirect fairly reliably, but things gets "iffy" when you give them two or more in a row.

This could be maybe 30% of it... I'd be more concerned about the timeframe over which you made changes -- That is, how long it was after redirecting html to php before adding the old.php to new.php URL redirects. If you change too much too fast, the search engines basically have to start over with your site, and it may take a year to recover. Take-home lesson: URLs are not file names -- Using mod_rewrite to internally (silently) rewrite old URLs to new filenames illustrates that clearly. Set up your file structure to be easy to maintain, expandable, easy to 'partition' among employees with different access privileges, etc. Then set up the URL structure to use short, memorable URLs. Then map the URLs to filenames as needed. And most important to preventing similar problems in the future, pick URLs that you will never, ever, have to change [w3.org] again. Ever. :)

This is one reason to go with extensionless URLs; There's no need to change your URLs if you change from .html to .asp to .jsp to .php -- There's simply no reason to 'publicize' the technology of your site in its URLs -- especially if that technology may change over time.

Jim

mkingsle

10:43 pm on Nov 13, 2007 (gmt 0)

10+ Year Member



Thanks for the reply again:

I actually did the redirects from .html to .php about 6 months ago. The new redirects, from the old location to the new location of the .php I just did recently, maybe a month ago.

Your insight is dead on. I wish I would of planned better from the start, but it was my first website. It's been a learning experience.

I would like to learn how to do extension less urls, so I can mask the fact that I am using .php

Is that hard to do?

Thanks

jdMorgan

2:59 am on Nov 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not hard to do... Have you tried searching WebmasterWorld? :)

Jim

mkingsle

2:36 am on Nov 15, 2007 (gmt 0)

10+ Year Member



going at it right now. There is so much information on this site that I think my head's going to explode:)

Thanks again

Michael

phranque

4:07 am on Nov 15, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], mkingsle!

each forum page has a link to a charter page and a library page which will typically lead to excellent resources...

Patrick Taylor

6:36 pm on Nov 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Picking up on Jim's point about preventing sequential redirects, I've altered the rule order in an .htaccess file to:

# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule ^(.+)/$ /$1 [R=301,L]
#
# REDIRECT to canonical url
RewriteCond %{HTTP_HOST} ^example\.com [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

Previously those two RewriteRules were the other way around. Now, when I browse http://example.com/page/ Firefox live headers does indeed show only one redirect is being done by the server, to http://www.example.com/page all in one go.

Great, but I must say I don't understand why, if there are still two independent redirection rules.

jdMorgan

2:03 am on Nov 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The reason is that the first rule is also a redirect, and luckily, the canonical name of your server appears to be set correctly and/or UseCanonicalName is set to 'off'. The proper form for that rule to avoid possible future problems when there is a canonical name mismatch (say after moving to a new host) is:

# REDIRECT to non trailing slash if not real directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)/$ http://www.example.com/$1 [R=301,L]

So now it should be clear why you only get one redirect, because if the first rule redirects to remove a trailing slash, it also redirects to the canonical domain, so the second rule won't be invoked.

Jim

Patrick Taylor

8:56 am on Nov 16, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim, that's clear. Many thanks.

Patrick