Forum Moderators: phranque

Message Too Old, No Replies

Properly configuring 404 to escape dupe content disaster

         

Asia_Expat

5:07 pm on Apr 16, 2006 (gmt 0)

10+ Year Member



Some time ago, I HAD to change my URL's from www.widgets.com/index.php
to
www.widgets.com/forum/index.php
and I'm starting to roll out my standard HTML content in addition to the forum.

I thought I'd left things long enough to rid Google of all the old URL's but I was wrong. There are now thousands and thousands of old URL's that show as unique topics in the SERPS but when you click on the link, my server just produces to the main main www.widgets.com mainpage... thousands and thousands of URL's all producing the same content. It's a dupe disaster.
(NOTE: my main page still ranks well in the SERPS (position 3 for main search term out of 2.5 million results)

How on Earth do I configure my server to produce genuine 404 headers for all those useless URL's? :(

jdMorgan

1:13 pm on Apr 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your server should be doing that now. But something else you have done is interfering with the default 404 behaviour. This could be because of some error-handling code that isn't right, or simply because you have redirected all requests to a script on a dynamic site, for example.

Jim

Asia_Expat

3:29 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



[widgets.com...]

... is an example of a non existant URL that Google won't drop. Anything after the question mark, no matter what you type, produces the homepage.
I had a long chat with my server support team. They explained to me that if this is a default of PHP rather than Apache and it's something a lot of people struggle to configure. He was not a PHP specialist and had no answer to my problem.

I need these old URL's to produce a genuine 404 error or I'm never going to be rid of these old URL's

Asia_Expat

3:30 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



Note, there is no CMS config file making this problem. The only config file on the server is in the forum software but that's in a different directory now anyways and so not the problem.

Asia_Expat

3:46 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



OK, what if I was to remame the home page from index.php to index.html and use htaccess to run php scripts in html pages. The old non existant URL's would then produce the multiple choice extension error page... how would Google react to that?

Asia_Expat

4:21 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



Maybe I shouldn'e be worrying about this at all? Perhaps Google will eventually drop all those old pages... but will they cause me dupe problems in the meantime?
... Your further comments would be most appreciated.

jdMorgan

5:42 pm on Apr 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Assuming that *any* query string appended to the index.php URL is to return a 404, then all you'd need is something like:

RewriteCond %{QUERY_STRING} .
RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

Any request for index.php that has a query string appended will get rewritten to does_not_exist.html, and thus produce a 404.

However, if you are also rewriting static URLs to dynamic URLs involving index.php, then you would need to make sure that only client requests --and not previously-rewritten requests-- are rewritten:


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php\?[^\ ]+\ HTTP/
RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

Jim

Asia_Expat

6:23 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



Thanks Jim. That seems to work and I hope it's the right thing for me to do in the long run. 404 errors are now produed for all those old URL's
My htaccess file now looks like this...

Options +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_HOST} ^example\.info$ [NC]
RewriteRule ^(.*)$ http://www.example.info/$1 [R=301,L]
RewriteCond %{QUERY_STRING} .
RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

... Incidentally, a discovered a very strange thing during this. I removed the htaccess file completely for a few moments while I changed it and tested the canonical thing. I was amazed to see that even without the htaccess file, non www url's were still being redirected to www ones!?... very odd.

[edited by: jdMorgan at 8:09 pm (utc) on April 17, 2006]
[edit reason] Examplified. [/edit]

Asia_Expat

8:06 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



I've tested the canonical thing on another website I'm surprised to see that after adding the 301 rewriterule to fix the canonical issue for the first time and then removing the htaccess file, the rewrite to www still happens!... what does this mean? Why is this happening?

jdMorgan

8:11 pm on Apr 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It means that you probably forgot to flush your browser cache after changing the configuration file, so your browser returned the previously-cached response for that URL, which was a redirect.

Always flush your browser cache before testing any change to your config files.

Jim

Asia_Expat

8:45 pm on Apr 17, 2006 (gmt 0)

10+ Year Member



Oh I see... I hit f5 to do that but it obviously didn't work... I feel silly now.
Thanks anyways for your help Jim... I learned a bit more today :-)

jdMorgan

11:38 pm on Apr 17, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you are still using IE (shudder), then try Control-F5 to force a reload from the server.

Jim