Forum Moderators: phranque

Message Too Old, No Replies

CMS software uses 404 redirect for friendly URLs

Site not picked up by Yahoo. I probably found out why

         

lammert

8:31 am on Jul 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Being an HTML adict, I have always coded my websites in plain HTML in a text editor. Some time ago I started a new site using well known CMS software. I was surprised by the speed I could add new pages and structure the layout, so I added quite a lot information.

MSN picked up the site and now daily spiders for new information. Google sometimes comes by, but didn't index all pages and Yahoo only indexed the homepage. Because the site is almost one year on-line I started an investigation why one search engine had no problems spidering the whole site where another stopped at the index.

To do this investigation I used an on-line server header checker. I was really surprised to see that all my content pages returned a HTTP 404 file not found. Why? I was puzzled, because I had no problems viewing the pages in my browser.

Then I remembered that I had configured the CMS system to use "search engine friendly" URLs. The default naming of URLs was something like www.example.com/index.php?param1&param2&param3&param4, but with SE friendly URLs it becomes something like www.example.com/param1/param2/param3/param4/ At that time a small change in the .htaccess was necessary to make everything working. This change was:

RewriteCond %{REQUEST_FILENAME}!-f
RewriteCond %{REQUEST_FILENAME}!-d
RewriteRule ^(.*) /index.php

Because of the 404 code, I started reading the mod_rewrite manual of the Apache server. There I discovered, that the first two lines in the .htaccess file mean something like: "If there is no physical file or directory associated with this name, then rewrite to /index.php". Probably the CMS writers had written a very intelligent frontend in the index.php file to do the translation from SE friendly URLs to parameter values internally. Then the requested page was returned.

As my browser doesn't check the 404 return code, it normally displays the page. But because search engines like to check the HTTP codes, Yahoo, and maybe also Google decided that the content pages didn't exist and indexing was not necessary.

I have now manually added a view proper rewrite rules to my .htaccess file and all pages return the 200 OK code now. I hope that the remaining pages are picked up by the SE's soon.