homepage Welcome to WebmasterWorld Guest from 54.163.91.250
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Are 301 redirects to 404 error pages detrimental to rankings?
Sgt_Kickaxe




msg:4527238
 6:00 am on Dec 12, 2012 (gmt 0)

My htaccess file removes index.php/ from urls.

example.com/index.php/some-random-junk

That request results in a 301 to example.com/some-random-junk which doesn't exist and returns 404. Is there a way to avoid this?

 

lucy24




msg:4527241
 6:49 am on Dec 12, 2012 (gmt 0)

Yes, lots of ways, but you haven't given enough information. What is supposed to happen? Where are those URLs with "index.php" in the middle coming from? What is the correct URL for pages with names in this form? Are you using a CMS that does any kind of behind-the-scenes rewriting? Or do you really have ::shudder:: pages named "stuff/index.php/morestuff"?

And didn't we only just get through the same question [webmasterworld.com]?

:: detour to check ::

Nope, we're OK, that was a different issue. At least I think it was.

In normal situations, "/index.xtn" only gets redirected to plain / when it comes at the end of the request. So your problem would never arise, because there would be an ending anchor in the Rule, and further restrictions in the Condition.

Sgt_Kickaxe




msg:4527254
 7:52 am on Dec 12, 2012 (gmt 0)

Hi again Lucy :) Let me elaborate a little more.

What is supposed to happen?
urls that don't exist should always end up with 404 on this site, not redirect to remove the index.php/ before doing so.

Where are those URLs with "index.php" in the middle coming from?
Incoming links from various spam sites contain .com/index.php/yadda-yadda in them and, unfortunately, googlebot sees fit to keep checking them. I don't want to tell googlebot that they exist by first redirecting them before giving the 404.

What is the correct URL for pages with names in this form?
www.example.com/index.php/somepage
should be
www.example.com/somepage
and it is performing that redirect correctly BUT in this case it doesn't matter, it should be 404 in either case since the pages in question don't exist. The 404 needs to come first but right now it's not.

Are you using a CMS that does any kind of behind-the-scenes rewriting?
In this case lets assume no, I'm fixing the htaccess file to replace any changes made by any cms.

Or do you really have ::shudder:: pages named "stuff/index.php/morestuff"?
No, I don't. My index.php nightmare began several years ago and links containing it have continued to spread with no help from me. The site DID have index.php/ in the urls for the first couple of months of its life in 2005 due to the host not allowing htaccess use, I found a new host right away. That was, unfortunately, long enough. I do not have even a single link on any page that contains index.php/

And didn't we only just get through the same question
No, this is a continuation of the fixes we covered. Now that redirects are all working as intended this last pesky little problem needs to be stomped. I'm not sure it's possible to do a check to see if a url exists before removing the index.php/ though. Here is the current index.php removal htaccess code.

p.s. an example of a url that needs to return 404 and not 301 to a 404 page is.

www.example.com/index.php/this-page-does-not-exist

and it needs to die miserably with a 404 before the following htacces rules can take effect, but I'm not sure that's even possible, let alone how. While I know I could 404 that single url easily there is an infinite number of variations on non-existent urls possible so covering all possibilities of 404 ... yeah.

# REDIRECT REQUESTS CONTAINING INDEX.PHP TO THEIR NON INDEX.PHP VERSION
rewritecond %{THE_REQUEST} ^[A-Z]+\ /index\.php(/[^\ ]*)?\ HTTP/
rewriterule ^index\.php(/(.*))?$ http://www.example.com$1 [R=301,L]

# REDIRECT TO WWW IF IT IS MISSING
rewritecond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
rewriterule (.*) http://www.example.com/$1 [R=301,L]

Right now I'm envisioning a whitelist of urls that DO exist needs to be created to deal with the problem but that's going to be some 900 rewriterules long, one for each real page... unless there is a way to avoid 301 to 404 in this situation.

g1smd




msg:4527271
 8:44 am on Dec 12, 2012 (gmt 0)

htaccess rules execute way before anything happens with your PHP scripting.

When a site consists of individual .html files for each page, htaccess can check which URL requests will be successful and the rules can be crafted to not redirect when there will be no file to fetch at the end of the process.

When the site content is actually in a database and a single index.php file is responsible for generating the pages htaccess cannot possibly know whether any particular URL request will ultimately be successful.

In this case, you have to move the redirecting functions from htaccess rules to PHP scripting. This will slow the site down a little because mod_rewrite is blisteringly efficient and PHP is not so fast.

So to proceeed, you rewrite all requests for pages (this is where it is really helpful that the site uses extensionless URLs for pages and only requests for files have an extension) to the index.php file. The PHP script looks at the URL and extracts the page name part. It then looks in the database to see if it can fulfil the request. If it cannot, then irrespective of whether the URL request was www or non-www, or was index.php/ prefixed or not, the PHP script sends HEADER 404 and "includes" the error404 file so that the user sees an error message. If the URL request can be fulfilled with content from the database, the PHP script looks at the URL request and if the www is missing or there's an index.php/ part present it instead sends HEADER 301 and redirects the browser to make a new request for the correct URL. If the URL request can be fulfilled from the database and the URL is of the correct form then the PHP script assembles the HTML page and sends it to the browser.

htaccess is a pre-processor that can sort out a whole load of stuff before the PHP kicks in. You can dispense with htaccess and move most of that functionality to PHP but it will be a lot less efficient - partly because you're invoking the PHP engine instead of the fast mod_rewrite process, but mostly because you're hitting the database for valid and for non-valid requests and especially that for non-valid requests that will ultimately be successful when the URL has been tidied, you're hitting the database twice. The database access is the slowest part of the system.

You'll still need non-www to www redirecting for image requests, but it's not so important for stylesheets and javascript files. You set up the standard non-www to www redirect in htaccess but instead of (.*) for the pattern, which matches "all" requests, you set the non-www to www redirecting rule so that only requests with an extension (that's why using extensionless URLs for pages really helps here!) are redirected. You still have the problem that for images that don't exist, there will be a redirect when non-www is requested and the 404 will be served only when www is requested. You get round this by using a preceding RewriteCond and the -f test. This drops the server efficiency even more because htaccess will have to hit the filesystem to see if a file exists. Luckily this happens only when the request is for non-www. Make sure the -f test is NOT the first test in the non-www/www redirecting rule, test the HTTP_HOST variable first. If you accidentally put the -f test first, the filesystem will be read for www requests too, only for the next test (HTTP_HOST) to then say that this entire rule can be skipped.

So, to answer the question: everything you envisage can be done but will be less efficient in terms of server overhead and speed of response. You will have to make the decision as to whether a slower site for all requests (valid or otherwise) or an unwanted redirect before 404 (for non-www and/or index.php/ requests) is the way to go.

lucy24




msg:4527312
 11:00 am on Dec 12, 2012 (gmt 0)

Whoa. You mean #1 you don't have pages with "index.php" in the middle and #2 you don't have desirable links that somehow got "index.php" in the middle?

If so, it really is a non-problem.

Don't redirect anything that contains index.php. Only redirect the ones with index.php at the end. Tweak your existing rule to something like

RewriteCond %{THE_REQUEST} [A-Z]{3,9}\ /([^/.]+/)*index\.php\ HTTP
RewriteRule ^(([^/.]+/)*)index\.php http://www.example.com/$1 [R=301,L]

If your requests contain, or might contain, queries, the Condition needs to say

{blahblah as before} index\.php(\?\S+)?\ HTTP et cetera

The Rule itself is unchanged, because only Conditions can "see" the query.

g1smd




msg:4527322
 11:40 am on Dec 12, 2012 (gmt 0)

No. I read the question as meaning this:

www.example.com/index.php/this-page has a new URL and requests for this should be redirected to
www.example.com/this-page

www.example.com/junk-foo-bar-quux does not exist and serves as 404. Requests for
example.com/junk-foo-bar-quux,
example.com/index.php/junk-foo-bar-quux and
www.example.com/index.php/junk-foo-bar-quux get redirected and then get served as 404 after the redirect.

Site owner wants the duff URL requests to directly serve 404, not go through a redirect first - hence the previous long post on how to do it.

Sgt_Kickaxe




msg:4527337
 12:40 pm on Dec 12, 2012 (gmt 0)

Exactly g1smd, and I do have some really good incoming links to some pages with index.php/ Lucy that I haven't been able to get the site owners to change although Googlebot is my primary concern.

I really don't want to slow down the site but some urls with index.php/ in them even bring in traffic so... hmmm. I think I'll go over the logs and figure out just how many index.php/ redirects are a must. Redirecting a few important ones in htaccess and letting all the rest, 404's included, get a proper 404 response might be the way to go.

Thanks again

lucy24




msg:4527586
 11:51 pm on Dec 12, 2012 (gmt 0)

Got it. You don't have blahblah/index.php/blahblah pages, but buried among the robots and spammers you have some bona fide blahblah/index.php/blahblah links. If there's a manageable number of them you're definitely better off with RewriteRules that name those specific files. And if the filenames are in the form

www.example.com/index.php/morestuff

it should be pretty painless because there's only one capture per page and you don't have to worry about matching up the "before" and "after" parts. The Rule will look something like

RewriteRule ^index\.php/(goodname1|goodname2|goodname3)$ http://www.example.com/$1 [R=301,L]

and you don't even need to look at the Request because no apache mod will ever insert "index.php" into the middle of a name. Then you make a separate RewriteRule for the ones with index.php at the end, where you do need a Condition.

g1smd




msg:4527596
 12:15 am on Dec 13, 2012 (gmt 0)

If there's only perhaps 20 or 30 URLs, then the list method as suggested by Lucy would be just fine.

If the list runs to hundreds, the database method is best.

There is another way, for numbers in between. It's very similar to the long method explained several posts back but with one big difference: instead of looking in the database to see if the requested pagename is good, instead maintain a list in an array.

The rest of the code around the array is just a few lines of PHP. Chuck it all in a special-script.php file and rewrite (that's rewrite, not redirect) URL requests for
example.com/index.php/something to the internal special-script.php file. You'll also need to add the various exclusions to the htaccess redirects in the same way as explained in that previous post.

The
special-script.php file returns a 301 redirect for page names listed in the array and the 404 status and error page for stuff not listed in the array.

I use both the array method and the database method: it depends on the size of the site and whether new URLs will need to be added to the list in the future. If the list is short and is static, then it's the array. If stuff needs to be added later or the list is long, then the database method always wins.

In getting your head round this lot, never think in terms of files on the server hard drive having URLs. Instead think of this in terms of requested URLs and mapping those requests to internal resources. Both mod_rewrite and Apache deal with URL requests made by browsers and bots.

Sgt_Kickaxe




msg:4527616
 2:40 am on Dec 13, 2012 (gmt 0)

My *short* list has 42 right now but I don't know about ALL incoming links to pages that are 7 years old so there might be a few more. Logs helped me find the ones that bring traffic and I'm using GWT and various SEO services to try and find *must keep* backlinks, not an easy task.

The important pages are not going to change in time, they haven't changed in years, so if the number stays under 50 can I do what Lucy suggested and then also create a static copy that will bypass the database entirely to pick up performance? I'm concerned about performance, I haven't pushed an .htaccess file to know where limits start to get crossed. Will 50 of the following be better served with the php methods you described or will htaccess handle this... e.g.

RewriteRule ^index\.php/(goodname1|goodname2|goodname3|goodname4)$ http://www.example.com/$1 [R=301,L]
RewriteRule ^index\.php/(goodname5|goodname6|goodname7|goodname8)$ http://www.example.com/$1 [R=301,L]


followed by

RewriteRule ^goodname1$ /cache/goodname1.html [L]
RewriteRule ^goodname2$ /cache/goodname2.html [L]
RewriteRule ^goodname3$ /cache/goodname3.html [L]


etc. Any loss due to htaccess should be offset by much faster load times... or is this already pushing it?

lucy24




msg:4527654
 5:30 am on Dec 13, 2012 (gmt 0)

Urk. I first pictured a single set of fifty pipe-separated options. That's definitely getting into php-script territory.

UNLESS--
:: insert boilerplate about yawn-provoking coincidence because I've just finished saying something similar in another thread ::
--unless many of your good names have elements in common, so it's really ^index\.php/goodname(\d+)$ or ^index\.php/goodname(foo|bar|sludge)$ redirected to goodname$1.

a static copy that will bypass the database entirely

I missed a step here. What database? The one that was originally involved in generating the pages?

I'm not sure you need the rewrite. Why not rename all the old non-cache versions-- assuming you need to keep them on the server at all-- and let the new static versions have the filename that you've just finished redirecting to?

g1smd




msg:4527692
 8:23 am on Dec 13, 2012 (gmt 0)

The redirects look good, especially with the OR construct arranged in groups of four (and that is the same answer that I saw lucy add in another thread mere minutes ago).

If the URL requests are extensionless, then the rewrite to a static "cache" folder looks good. With some fancy PHP scripting it's also possible to automate the generation/creation of those static files each time the database copy is updated/edited/changed.

Once you know PHP the features you can add are unlimited.

Don't worry about the number of rules in htaccess. As long as the code is efficient, this stuff is blisteringly fast. The one thing to avoid for sure is (.*) at the beginning or in the middle of a RegEx pattern.

The biggest problem with having a lot of redirects in htaccess is not so much the performance but that it becomes difficult to maintain a big file. If you are editing it often, the likelyhood of making a typo that brings the whole site down (if you are lucky, so that you spot it quickly) or a part of the site down (which you don't spot for ages until your rankings are drying up) increases rapidly.

lucy24




msg:4527870
 7:07 pm on Dec 13, 2012 (gmt 0)

it becomes difficult to maintain a big file

I hope that was a typo for "big site" because otherwise I am even more confused than my natural state ;)

It's not that long since I followed-up every htaccess edit with a quick view of some random page just to make sure I hadn't introduced a 500-class typo. Now I only do it with large complicated changes.

g1smd




msg:4527918
 8:57 pm on Dec 13, 2012 (gmt 0)

You should run a couple of tests after every change. Once full stop or bracket out of place can bring a whole site down.

I did mean "big file". I have seen a htaccess file with more than 6000 directives in, some time ago. There was little chance of making sure all the code was correct nor of modifying it without breaking something.

lucy24




msg:4527951
 10:26 pm on Dec 13, 2012 (gmt 0)

Oh, OK, I was thinking "file" as in page. So we meant the same thing.

Sgt_Kickaxe




msg:4527974
 12:45 am on Dec 14, 2012 (gmt 0)

You should run a couple of tests after every change. One full stop or bracket out of place can bring a whole site down.


Or worse, leave it up sending all the wrong header codes but *looking* normal.

More than a couple of tests too, if your htaccess file redirects from non-www to www that's one test, if it removes index.php that's another test, then a third test to reverse order and see what happens, then a 4th, 5th and 6th test to check that non-existant pages also behave as expected. Is your host sending soft 404's on some errors? a few more gray hairs on those, err, I mean tests.

In the end it's worth it, a solid htaccess file is like tylenol to your search headaches. Get your server logs ready, a cup of strong coffee, a second computer with different browser helps, a user agent switcher... don't forget about your mobile version of the site. Are you having fun yet? :)

I was just wondering, is there software out there that can do all of the testing related to htaccess changes besides log analyzers?

g1smd




msg:4528054
 10:03 am on Dec 14, 2012 (gmt 0)

Yes, Xenu LinkSleuth will take a plain text file list of URLs and test each one and report the result. You need to set scan depth to 1 (I think) so that it doesn't then go on to traverse the whole site (do that on a second separate test starting at example.com or www.example.com).

The text file list of URLs should include URLs that don't exist, URLs with appended junk, URLs with the wrong case, URLs with incorrect stuff (e.g. in a part of the URL that should be numeric add a letter or a period or comma), URLs with parameters in various orders as well as real URLs that do exist. Each URL should be listed in www and non-www form. Testing a site then becomes a couple of clicks. Keep the text file handy and be sure to add new test URLs to it as you think of them. I have a file with about 800 URLs in for testing one large site that I occasionally work on.

For human readability, leave a blank line after every RewriteRule and add a comment before each block of code. I also number each block. Take the situation where I am adding a new redirect/rewrite pair of rules. In this case the matching rewrite will be a long way down the file. I will add the redirect as perhaps 2.14 then the matching rewrite will go in at 3.14. When I want to modify those rules at a later date, the pairing is obvious.

My htaccess files usually have at least 4 major sections for mod_rewrite code.
0.xx - setting things up
1.xx - blocking access to bots and for malicious requests
2.xx - redirects
3.xx - rewrites

Make sure you add a date and the site name to your htaccess file as a comment. Keep old copies so that you can compare changes. I develop as "htaccess.sitename.01.txt" and increment the number. Once I upload that file it's a simple matter to rename it to .htaccess on the server. It's also a simple matter to go back one version if necessary.

I can also highly recommend using version control (Subversion, Git, etc) for development.

Sgt_Kickaxe




msg:4528406
 1:15 pm on Dec 15, 2012 (gmt 0)

I'm on those as well g1smd, although I've cringed at the thought of mastering htaccess et al for years once you dig in you want to learn more - it's powerful stuff.

*update: My site, with index.php/ no longer redirecting to non-index.php/ except for root and a chosen handful of pages the 404 floodgates have opened as expected. Since I'm now relying solely on htaccess to control redirected urls instead of allowing wordpress to do it GWT is already telling me that some existing urls have up to 64(!) versions that are no longer redirecting. I'd say I had a busy spammer on my hands but rankings are holding, so far.

Thanks for the help, again, the past week I've spent almost exclusively on htaccess is already paying dividends. Knowing how htaccess works is like having an SEO sniper in your arsenal, you only get one shot to get it right but boy does it take care of business!

[edited by: Sgt_Kickaxe at 1:27 pm (utc) on Dec 15, 2012]

g1smd




msg:4528408
 1:22 pm on Dec 15, 2012 (gmt 0)

In a few dozen lines of htaccess redirects and rewrites and a few dozens lines of PHP I easily folded 85 000 duplicate content URLs into 1200 real product pages on one site earlier in the year.

Done well, it's a work of art not just a miracle of science. :)

Sgt_Kickaxe




msg:4531216
 10:14 pm on Dec 26, 2012 (gmt 0)

11 days after completing the changes to make the site function as it should Google has seen fit to reduce traffic by 80%.

- Number of indexed pages remains unchanged
- The pages indexed are the same
- Rankings: same rank as before though now only appearing 20% of the time

All of my indexed pages had a 301 from the version that contained index.php and returning a 404 for those urls instead of a 301 has decimated traffic.

I do not plan on changing anything further for Google's sake, I return 404 on non-existant urls(that aren't even indexed) as I should so it will be interesting to see if the 80% returns after a Panda update or two.

IF that's what happens, to me, it will suggest that you can make changes which hurt your rank quickly but not changes that can help it quickly.

Sgt_Kickaxe




msg:4532518
 2:31 pm on Jan 2, 2013 (gmt 0)

Three weeks later - traffic remains 80% down. The pages that don't exist all properly return a 404 error code and the previously indexed pages remain indexed.

Don't for one second believe that pages which aren't indexed can't affect your rankings. This was an eye opening exercise.

Tonearm




msg:4533062
 11:29 pm on Jan 3, 2013 (gmt 0)

Don't for one second believe that pages which aren't indexed can't affect your rankings. This was an eye opening exercise.

I'm not clear on what happened to indicate that the above is true. You had 301 to 404 and you fixed that to return a 404 without the 301 and your traffic dropped?


- Number of indexed pages remains unchanged
- The pages indexed are the same

How do you go about determining this?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved