Forum Moderators: phranque
Sorry I know nothing on this subject so I hope I am asking the right question...
Google has somehow indexed hundreds or maybe thousands of incorrect urls from my site and the incorrect urls return a status code of 200 ok. I have been told that I can get the bad urls back out of the index by having them return a 404 instead.
One of the problems is the double //
This would be a correct url:
example.com/directory/
Then the bad url that I would want to 404:
example.com//directory/
The other major problem is the extra /
This url is correct:
example.com/page.html
and an incorrect url:
example.com/page.html/
So how would I do that? And should I do that? I don't want to cause more problems than I fix :)
Advice appreciated. Thanks!
For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].
I've seen this problem before, so I'll go look and see if I can find the code. But generally, you won't be able to just copy it and expect it to work, so check out those links.
Jim
[edited by: jdMorgan at 2:41 am (utc) on July 18, 2006]
So is it possible, using a perl snippet in my perl program, the return a 404 page? I can spit out 200 OK pages with no problem - I just can't master the tricks to get perl to output a 404 when I need to. Any particular info I should look up?
Thanks again for your time and help!
mr_lumpy
[b]print ("Status: 404 Not Found\n");[/b]
print ("Cache-Control: no-store\n");
print ("Content-Type: text/html[b]\n\n[/b]");
# Note two linefeeds on line above required to mark end of HTTP headers.
print ("<html>\n<head>\n");
print ("<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">\n");
print ("<meta http-equiv=\"Content-Language\" content=\"en-US\">\n");
print ("<title>Resource Not Found</title>\n");
print ("</head>\n");
print ("<body text=\"#000000\" bgcolor=\"#FFFFFF\" link=\"#000099\">\n");
print ("<center><h1><font face=\"Arial,Helvetica\" color=\"#CC0000\">Requested page cannot be found</font></h1></center>\n");
print ("<p><font face=\"Arial,Helvetica\">The page you are looking for is one that has been removed or replaced.</font></p>\n");
print ("<p><font face=\"Arial,Helvetica\">Please visit our <a href=\"/site-map.html\">Site Map</a> or ");
print ("our <a href=\"/\">Home Page</a> to find your way around our site.</font></p>\n");
print ("</body>\n</html>\n");
Jim
Here's what I'm using in my .htaccess (I got this from this forum, I am in no way an expert or even remotely literate in this stuff.
Oh, and this code won't work if you're on a windows server, just Unix.)
This goes in .htaccess:
Options +FollowSymLinks
RewriteEngine on
# Remove multiple slashes anywhere in URL
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]
It's taking a while but the bots are following the redirects to the propper urls on my site and slowly the bad urls are dropping out of the search results, the good ones are staying in.
If you try this make sure you test the headers to be sure they really are returning a 301 permanent redirect (you don't want a 302 temporary redirect). Test, test and re-test every variation you can think of... this is hugely important... you want a 301 response.
If you're a Firefox user try the 'Live http headers' extension. It give the most detailed results I've found. Be sure to clear your cache before testing, otherwise it will still show the old results.
If you're not a Firefox user search for header checker and you'll find a few online.
Other ones to watch out for is being indexed with and without the www (ie. www.example.com and example.com) Google has been having issues with that for ages, and example.com/ and example.com/index.html. Those can both cause indexing troubles to.