Why are my old URLs still indexed?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why are my old URLs still indexed?

budbiss

4:05 pm on Feb 25, 2008 (gmt 0)

We launched a completely redesigned version of our website just over two weeks ago. While Google has indexed some of our new site, it still seems to have alot of the old site in its index. Do I need to contact Google about this? Is there anything I can do or do I have to simply wait for Googlebot to mark those pages as no longer valid?

pavlovapete

10:45 pm on Feb 25, 2008 (gmt 0)

Hi budbiss,

Google seems to visit old URLs years after they have gone. We are regularly spidered for pages that were removed 3+ years ago.

If you have an old site you may have inbound links using old addresses - in this case you might consider using 301 redirects to retain some link juice.

Not something to worry too much about IMO.

Cheers

piatkow

11:03 pm on Feb 25, 2008 (gmt 0)

Have you deleted the old pages or just orphaned them by removing internal links in your site?

budbiss

5:01 pm on Feb 28, 2008 (gmt 0)

Have you deleted the old pages or just orphaned them by removing internal links in your site?

I supposed you can say we just orphaned them. Let me explain:

Our old website ran OSCommerce and all the pages where generated from a database and it was all on our own server. A few weeks ago we switched over to our new website which uses Yahoo Store and Yahoo's server. I didn't think there was a way to reference all of the dead links from the old website, so I just let it go. I figured Google would figure it out.

Is there something I can/should do at this point?

Oh and another thing...I am now using Google Webmaster Tools. I've never used it before. Under URLs Not Found, it is listing 342 URLs. I never realized how "anti-SEO" the old site was. Alot of these URLs Google is saying it can't find are actually of Googlebot using our site search :o No wonder why that old site placed so poorly.

Anyway, I'm just trying to fix what I can now and move on. Any advice is much appreciated!

tedster

5:10 pm on Feb 28, 2008 (gmt 0)

A lot of these URLs Google is saying it can't find are actually of Googlebot using our site search

We've got a coupele current threads discussing exactly that:
Google Indexed My Site Search Results Pages [webmasterworld.com]
Google indexing large volumes of (unlinked?) dynamic pages [webmasterworld.com]

You're fortunate that these urls are now 404 for you. Once you make sure that old urls either return a 404 status or redirect as you intend, you've done what you can. 404 urls can hang out for quite a while, but thay're not going to do you any damage.

Note this sentence, from within a Webmaster Tools account:

...if unrecognizable URLs appear in the 404 report, it's fine to ignore them. However, it's still worthwhile to review any errors you receive, and to examine any affected pages for problems.

budbiss

1:59 pm on Mar 7, 2008 (gmt 0)

Google Webmaster Tools is still reporting that 177 of the links from my old site are not found (404). It appears that Googlebot just keeps on visiting my site and looking for these urls. I assume that this is because Google isn't sure if I intentionally removed those urls or if they're just temporarily not available. Is there a way that I can identify each of these urls (perhaps in the robots.txt?) and mark them as intentionally removed?

budbiss

2:08 pm on Mar 7, 2008 (gmt 0)

I just came across a website that talks about this issue and it recommends that you go into ".htaccess" and add the following text:

redirect 301 /olddirectory/oldpage.html http://www.example.com/newpage.html

Now one question I have is, do I want to do a "301 redirect" for these old urls? They don't have a 'replacement' url on the new site. I can just redirect them to index.html, but will doing that just keep them alive longer? Would that confuse Googlebot? I don't want to end up making things worse.

[edited by: tedster at 9:19 pm (utc) on Mar. 7, 2008]
[edit reason] use example.com - it can never be owned [/edit]

budbiss

3:10 pm on Mar 7, 2008 (gmt 0)

Update, I am using the Yahoo store platform and I just discovered that apparently Yahoo doesn't alow me to access the .htaccess. I don't understand why and I'm not happy about that, but moving on, what else can I do? Could I create each of these pages physically and put a 301 into the header perhaps?

g1smd

8:17 pm on Mar 7, 2008 (gmt 0)

If the pages return a 404 status then you have nothing further to worry about.

Google will check those URLs almost forever just in case content does one day reappear there.

You do not need to worry about that. At all.

budbiss

3:14 pm on Mar 8, 2008 (gmt 0)

Thanks g1smd, it is reassuring to know that.

Is it correct to believe that if GWT has these links marked as 404 that my webserver is indeed reporting them as 404?

Is there a way I can test this? Perhaps there is a feature on one of the web browsers that will show me the code that the webserver returns?

tedster

6:42 pm on Mar 8, 2008 (gmt 0)

Yes - you can use Firefox and install the LiveHTTPHeaders add-on.

budbiss

7:17 pm on Mar 8, 2008 (gmt 0)

Tedster, that is EXACTLY the type of thing I was looking for! Thanks!

Okay, so here is what we get when I enter in a URL that doesn't exist...

http://www.example.com/urldoesnotexisttest

GET /urldoesnotexisttest HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: __utmz=1.1205003081.1.1.utmccn=(direct)�utmcsr=(direct)�utmcmd=(none); __utmb=1; __utma=1.2101287198.1205003081.1205003081.1205003219.2; __utmc=1

HTTP/1.x 302 Found
Date: Sat, 08 Mar 2008 19:09:10 GMT
P3P: policyref="http://aaa.testtesttest.com/aaa/aaa.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
Cache-Control: private
Location: http://www.example.com/
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html
Expires: Sat, 08 Mar 2008 19:09:10 GMT

[edited by: tedster at 9:22 pm (utc) on Mar. 8, 2008]
[edit reason] switch to example.com [/edit]

g1smd

7:24 pm on Mar 8, 2008 (gmt 0)

Ignore what the on-page text says about the status of the page, use an HTTP Header Checker to carefully check all the HTTP Header information.

OK. You did that while I was reading another post with this one in another tab...

You have a MAJOR problem. The status is "302". It should be "404". You have a mis-configuration of either the server or something in your scripts. It does need to be fixed.

The 302 is a disaster. What it is saying, is "Index this URL that doesn't actually exist, and use the content from my root page - and treat it as if it also exists at this address too. That is, you potentially now have an infinite number of Duplicate Content copies of your root page up for indexing.

budbiss

7:34 pm on Mar 8, 2008 (gmt 0)

I'm confused about the results from "Live HTTP headers", so I did a search for other HTTP header checker utilities and I came across a website that offers the service for free. I typed in a fake URL and it immediately redirected me to the home page and of course the home page returned a 200.

Ok, hang on one second...as I write this I am still investigating...it appears to me that the Firefox plugin Live HTTP headers is getting a 302 code. And the 302 code means "Moved Temporarily". This isn't what I want, right?

g1smd

7:35 pm on Mar 8, 2008 (gmt 0)

Is this an Apache or IIS server.

The problem may be easier to fix if you're using Apache.

budbiss

7:40 pm on Mar 8, 2008 (gmt 0)

Good question. I don't know to be honest. It's built on the Yahoo store platform. I will see if I can find out.

budbiss

7:51 pm on Mar 8, 2008 (gmt 0)

Ignore what the on-page text says about the status of the page, use an HTTP Header Checker to carefully check all the HTTP Header information.
OK. You did that while I was reading another post with this one in another tab...
You have a MAJOR problem. The status is "302". It should be "404". You have a mis-configuration of either the server or something in your scripts. It does need to be fixed.
The 302 is a disaster. What it is saying, is "Index this URL that doesn't actually exist, and use the content from my root page - and treat it as if it also exists at this address too. That is, you potentially now have an infinite number of Duplicate Content copies of your root page up for indexing.

GWT states that all the old website's links are 404. So should I assume that Google was smart enough to realize that when my webserver returned a 302 and redirected to the home page, that for all intensive purposes, it was my site's way of saying that the page you are looking for doesn't exist any longer?

Well, either way, you say that the 302 is indeed a bad way to handle non-existing links, correct? If so, what should be done instead?

tedster

9:35 pm on Mar 8, 2008 (gmt 0)

GWT states that all the old website's links are 404. So should I assume that Google...

That is not a good assumption to make - in fact, I'd say it's dangerous. If your server is currently returning a 302 status redirect, then the url is no longer returning a 404 status. The http status for any given request can only be one thing.

The best practice is to return either a 301 status redirect to a new url that is really a replacement for that specific page, or let the old urls return a true 404.

Unless the old urls have lots of direct entry traffic, search traffic, or many backlinks, I find the best fix is to let a removed page return a 404 Not Found status. Or some people use 410 Gone, since Google treats that the same as a 404. Google can sort out 404/410 issues quickly. But 301 redirects have been subject to abuse, misuse and manipulation, so Google needs to check them out for trust issues.

<added>
If you 302 redirect all "bad" urls to the domain root (or any other page, such as a custom error message page) what your server is saying is that ALL your bad urls actually exist and they have the same content. Over time, that can become a huge pile of duplicate content!

Many servers are misconfigured like this, and over the past year I have seen Google trying to test for the situation and accommodate it in some way.

But we should not hand over that responsibility to Google - there's every chance that they might get it wrong in any single case. We should make sure our servers are properly configured - it's in our own best interests to do so.

[edited by: tedster at 10:10 pm (utc) on Mar. 8, 2008]

g1smd

9:49 pm on Mar 8, 2008 (gmt 0)

Hang on a minute there Tedster. I have been meaning to write something about this for weeks. Errr. You added more while I was posting.

Go with this for a while.

When you sign up for GWT, Google asks you to put an empty verification file on your server named something like google12a3b4c5d6e7f89.html which is used to verify your account. Google tells you the name to use, and it is different for each account.

Google then comes back every few weeks and asks for a URL like example.com/noexist_12a3b4c5d6e7f89.html which looks like it is testing what your 404 response really looks like.

If they trust the response code and content that is returned, then why not use that as a basis for all the other URLs that they encounter on your site, through internal linking and links from other sites?

In this way, Google can now know that your 302 is really a misconfigured 404. At least I guess that is what they are using it for?

However, you still need to fix this problem, because Yahoo, Micrososoft, Ask, and all the others are still getting the wrong message from your site.

budbiss

10:30 pm on Mar 8, 2008 (gmt 0)

Thanks alot guys. I will ask the web design company who coded this site to make all non-existant URLs return a 404 (if it wasn't built on the Yahoo store platform I'd do it myself).

One more thing, do you think I should just make a static 404 page that has the link to our home page? Or, would it be ok with Google if we had the 404 page auto-forward the visitor automatically to the home page after five seconds? (If the latter might possibly cause confusion for Googlebot then I will definitely scratch that idea.)

g1smd

10:37 pm on Mar 8, 2008 (gmt 0)

Just have a 404 page that explains there is an error, and then has links that the user can click if they want to. Those links will go to the primary areas of your content.

Don't try to be 'clever' by making the content the same has the root index page, and especially don't feed redirects back to the browser or bot.

budbiss

6:55 pm on Mar 13, 2008 (gmt 0)

What do you guys think? Would a 404 page that has the main page's header and left navigation (including all the navigation links) be ok with googlebot? I assume that since this will be returning a 404 (and NOT a 302 link like before) that it should be a significant improvement.

tedster

6:59 pm on Mar 13, 2008 (gmt 0)

This kind of thing is a common practice for most of the sites I work with - it's a good thing. As you noted, the key is returning the 404 http header. From there, you can customize the error page however is best for your users.