Forum Moderators: Robert Charlton & goodroi
Google seems to visit old URLs years after they have gone. We are regularly spidered for pages that were removed 3+ years ago.
If you have an old site you may have inbound links using old addresses - in this case you might consider using 301 redirects to retain some link juice.
Not something to worry too much about IMO.
Cheers
Have you deleted the old pages or just orphaned them by removing internal links in your site?
Our old website ran OSCommerce and all the pages where generated from a database and it was all on our own server. A few weeks ago we switched over to our new website which uses Yahoo Store and Yahoo's server. I didn't think there was a way to reference all of the dead links from the old website, so I just let it go. I figured Google would figure it out.
Is there something I can/should do at this point?
Oh and another thing...I am now using Google Webmaster Tools. I've never used it before. Under URLs Not Found, it is listing 342 URLs. I never realized how "anti-SEO" the old site was. Alot of these URLs Google is saying it can't find are actually of Googlebot using our site search :o No wonder why that old site placed so poorly.
Anyway, I'm just trying to fix what I can now and move on. Any advice is much appreciated!
A lot of these URLs Google is saying it can't find are actually of Googlebot using our site search
We've got a coupele current threads discussing exactly that:
Google Indexed My Site Search Results Pages [webmasterworld.com]
Google indexing large volumes of (unlinked?) dynamic pages [webmasterworld.com]
You're fortunate that these urls are now 404 for you. Once you make sure that old urls either return a 404 status or redirect as you intend, you've done what you can. 404 urls can hang out for quite a while, but thay're not going to do you any damage.
Note this sentence, from within a Webmaster Tools account:
...if unrecognizable URLs appear in the 404 report, it's fine to ignore them. However, it's still worthwhile to review any errors you receive, and to examine any affected pages for problems.
redirect 301 /olddirectory/oldpage.html http://www.example.com/newpage.html
Now one question I have is, do I want to do a "301 redirect" for these old urls? They don't have a 'replacement' url on the new site. I can just redirect them to index.html, but will doing that just keep them alive longer? Would that confuse Googlebot? I don't want to end up making things worse.
[edited by: tedster at 9:19 pm (utc) on Mar. 7, 2008]
[edit reason] use example.com - it can never be owned [/edit]
Is it correct to believe that if GWT has these links marked as 404 that my webserver is indeed reporting them as 404?
Is there a way I can test this? Perhaps there is a feature on one of the web browsers that will show me the code that the webserver returns?
Okay, so here is what we get when I enter in a URL that doesn't exist...
http://www.example.com/urldoesnotexisttest
GET /urldoesnotexisttest HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: __utmz=1.1205003081.1.1.utmccn=(direct)Šutmcsr=(direct)Šutmcmd=(none); __utmb=1; __utma=1.2101287198.1205003081.1205003081.1205003219.2; __utmc=1
HTTP/1.x 302 Found
Date: Sat, 08 Mar 2008 19:09:10 GMT
P3P: policyref="http://aaa.testtesttest.com/aaa/aaa.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
Cache-Control: private
Location: http://www.example.com/
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
Expires: Sat, 08 Mar 2008 19:09:10 GMT
[edited by: tedster at 9:22 pm (utc) on Mar. 8, 2008]
[edit reason] switch to example.com [/edit]
OK. You did that while I was reading another post with this one in another tab...
You have a MAJOR problem. The status is "302". It should be "404". You have a mis-configuration of either the server or something in your scripts. It does need to be fixed.
The 302 is a disaster. What it is saying, is "Index this URL that doesn't actually exist, and use the content from my root page - and treat it as if it also exists at this address too. That is, you potentially now have an infinite number of Duplicate Content copies of your root page up for indexing.
Ok, hang on one second...as I write this I am still investigating...it appears to me that the Firefox plugin Live HTTP headers is getting a 302 code. And the 302 code means "Moved Temporarily". This isn't what I want, right?
Ignore what the on-page text says about the status of the page, use an HTTP Header Checker to carefully check all the HTTP Header information.
OK. You did that while I was reading another post with this one in another tab...You have a MAJOR problem. The status is "302". It should be "404". You have a mis-configuration of either the server or something in your scripts. It does need to be fixed.
The 302 is a disaster. What it is saying, is "Index this URL that doesn't actually exist, and use the content from my root page - and treat it as if it also exists at this address too. That is, you potentially now have an infinite number of Duplicate Content copies of your root page up for indexing.
GWT states that all the old website's links are 404. So should I assume that Google was smart enough to realize that when my webserver returned a 302 and redirected to the home page, that for all intensive purposes, it was my site's way of saying that the page you are looking for doesn't exist any longer?
Well, either way, you say that the 302 is indeed a bad way to handle non-existing links, correct? If so, what should be done instead?
GWT states that all the old website's links are 404. So should I assume that Google...
That is not a good assumption to make - in fact, I'd say it's dangerous. If your server is currently returning a 302 status redirect, then the url is no longer returning a 404 status. The http status for any given request can only be one thing.
The best practice is to return either a 301 status redirect to a new url that is really a replacement for that specific page, or let the old urls return a true 404.
Unless the old urls have lots of direct entry traffic, search traffic, or many backlinks, I find the best fix is to let a removed page return a 404 Not Found status. Or some people use 410 Gone, since Google treats that the same as a 404. Google can sort out 404/410 issues quickly. But 301 redirects have been subject to abuse, misuse and manipulation, so Google needs to check them out for trust issues.
<added>
If you 302 redirect all "bad" urls to the domain root (or any other page, such as a custom error message page) what your server is saying is that ALL your bad urls actually exist and they have the same content. Over time, that can become a huge pile of duplicate content!
Many servers are misconfigured like this, and over the past year I have seen Google trying to test for the situation and accommodate it in some way.
But we should not hand over that responsibility to Google - there's every chance that they might get it wrong in any single case. We should make sure our servers are properly configured - it's in our own best interests to do so.
[edited by: tedster at 10:10 pm (utc) on Mar. 8, 2008]
Go with this for a while.
When you sign up for GWT, Google asks you to put an empty verification file on your server named something like google12a3b4c5d6e7f89.html which is used to verify your account. Google tells you the name to use, and it is different for each account.
Google then comes back every few weeks and asks for a URL like example.com/noexist_12a3b4c5d6e7f89.html which looks like it is testing what your 404 response really looks like.
If they trust the response code and content that is returned, then why not use that as a basis for all the other URLs that they encounter on your site, through internal linking and links from other sites?
In this way, Google can now know that your 302 is really a misconfigured 404. At least I guess that is what they are using it for?
However, you still need to fix this problem, because Yahoo, Micrososoft, Ask, and all the others are still getting the wrong message from your site.
One more thing, do you think I should just make a static 404 page that has the link to our home page? Or, would it be ok with Google if we had the 404 page auto-forward the visitor automatically to the home page after five seconds? (If the latter might possibly cause confusion for Googlebot then I will definitely scratch that idea.)
Don't try to be 'clever' by making the content the same has the root index page, and especially don't feed redirects back to the browser or bot.