More dead links than ever

Forum Moderators: open

Message Too Old, No Replies

Just where are the results coming from?

Arnett

1:36 am on Sep 5, 2003 (gmt 0)

I don't know about anyone else,but since the "new and improved" Google was rolled out I'm finding more and more dead links in the top 100 search results than ever.

Here's something that worries me more than that. Last September I deleted a bunch of files for an affiliate program from the server. After Esmerelda I was surprised to see the index page in the search results. I uploaded the files back to the server. Now the index page is #1 for its targeted search term.

Just where is Google getting its database from? If the data in the search database was gathered by Googlebot as they would have us believe then my pages should have "faded out" of Googles' index months ago. If Googlebots are busy travelling the internet indexing and updating pages why are there so many more dead links in the SERPS?

ciml

6:17 pm on Sep 5, 2003 (gmt 0)

If the Web server is properly configured, and if there's a link to the page, then a 404 should disapear quite quickly.

This is only if Google fetches the page. URL-only listings indicate URLs that haven't been fetched, so Googlebot never sees the 404.

Arnett

11:48 pm on Sep 5, 2003 (gmt 0)

That brings up an issue that I've seen a lot of questions about lately. 404 handling. I redirect 404s using .htaccess. If I delete a page from a site and it redirects to the home page the original url will never be deleted from Google's index even though I want it gone.

PatrickDeese

1:38 am on Sep 6, 2003 (gmt 0)

That brings up an issue that I've seen a lot of questions about lately. 404 handling. I redirect 404s using .htaccess. If I delete a page from a site and it redirects to the home page the original url will never be deleted from Google's index even though I want it gone.

let your site 404 missing files, but make your "missing" page a copy of your homepage - use custom 404 pages.

jdMorgan

1:45 am on Sep 6, 2003 (gmt 0)

Use any method you like. As long as you return 404-Not Found or, for HTTP/1.1 and above, 410-Gone, you'll be OK. Check your server's response to make sure your 404/410 is correct using the Server Headers Checker [webmasterworld.com].

Jim

Arnett

2:00 am on Sep 6, 2003 (gmt 0)

let your site 404 missing files, but make your "missing" page a copy of your homepage - use custom 404 pages.

Now I'm confused. If I delete obsolete-page.html and replace it with a copy of my home page the 404 error won't happen. That could also result in a "duplicate content" penalty from Google.

PatrickDeese

2:02 am on Sep 6, 2003 (gmt 0)

[google.com...]

Arnett

8:47 am on Sep 6, 2003 (gmt 0)

Thanks a BUNCH PatrickDeese!

I've had .htaccess set to send all 404s to my index page. Over the years I have created and deleted hundreds of pages not knowing that Google would follow the 404,index the home page with the missing url and then PENALIZE ME for duplicate content considering all of the pages dupes.

I've created custom 404 pages for all of my domains now. Now Google will follow all of my dead links to the custom page and index them all to the same content and draw the same penalty for the domain.

kaled

9:23 am on Sep 6, 2003 (gmt 0)

I don't consider myself an expert on these matters, however, a duplicate content penalty for 404 pages seems like total nonsense to me. If this existed, all a competitor would have to do in order to ensure you got a penalty from Google would be to create a load of links on a dummy site (for Google to follow) that all point to non-existent pages. Google would then see all these links as resulting in the same page and apply a penalty to the site.

I don't have an especially high opinion of Google technology, however, this scenario is most unlikely to be true.

Kaled.

PatrickDeese

2:22 pm on Sep 6, 2003 (gmt 0)

Um. Arnett in order to comply with the terms of service of the site I will simply state that your opinion is uninformed and utterly incorrect.

have a nice day.

Arnett

2:42 pm on Sep 6, 2003 (gmt 0)

Um. Arnett in order to comply with the terms of service of the site I will simply state that your opinion is uninformed and utterly incorrect. have a nice day.

Sure it is. I deleted a bunch of files from the server last September. I was surprised to see them in the index. When I checked the cache the page shown was the index page of the directory. The listing showed text from that page. That page was the target of the 404 handler. I didn't check all 40 pages from the server. I just uploaded them back to the server in order to avoid a duplicate content penatly which was sure that I already had. Now the index page that was replaced is #1 for its search term. YOU explain it. It's not the only site that I've noted the same thing happening to. I had a parked domain that wound up getting spidered as the 404 target. Same destination. That's one MORE duplicate page getting indexed.

GoogleGuy

3:01 pm on Sep 6, 2003 (gmt 0)

Hi Arnett, a 404 page isn't going to trigger a duplicate content penalty--I wouldn't worry about it. If a page is well and truly gone, I would use a 404 to let spiders know about that. It's fine to show custom content on the page, but your HTTP headers should return 404 so the spider knows the page it gone.

If you just return a 404 for pages that are gone, you'll be fine. It can take a while for spiders to discover the 404, but you wouldn't get any penalties anywhere in the process.

Hope that helps,
GoogleGuy

Yidaki

3:06 pm on Sep 6, 2003 (gmt 0)

Server Header Check [searchengineworld.com]

... rofl. :D

Arnett

2:55 am on Sep 7, 2003 (gmt 0)

Hi Arnett, a 404 page isn't going to trigger a duplicate content penalty--I wouldn't worry about it. If a page is well and truly gone, I would use a 404 to let spiders know about that. It's fine to show custom content on the page, but your HTTP headers should return 404 so the spider knows the page it gone. If you just return a 404 for pages that are gone, you'll be fine. It can take a while for spiders to discover the 404, but you wouldn't get any penalties anywhere in the process.

Thanks,GG. Here's an interesting problem that my webhost found regarding the setting of the header:

"I did some analysis of the HTTP response codes when using ErrorDocument and found this interesting tidbit (edited for readability):

Using this in .htaccess:
============================================================
ErrorDocument 404 [widgets.com...]
============================================================

Produces this:
============================================================
--12:38:50-- [widgets.com...]
=> `notfoundpage'
HTTP request sent, awaiting response...
1 HTTP/1.1 302 Found
Location: [widgets.com...] [following]
--12:38:50-- [widgets.com...]
=> `404.html'
HTTP request sent, awaiting response...
1 HTTP/1.1 200 OK
============================================================

And using this in .htaccess
============================================================
ErrorDocument 404 /404.html
============================================================

Produces this:
============================================================
--12:39:48-- [widgets.com...]
=> `notfoundpage'
HTTP request sent, awaiting response...
1 HTTP/1.1 404 Not Found
12:39:48 ERROR 404: Not Found.
============================================================

From an end-user point of view, the actual content is IDENTICAL, but from a robot standpoint, one is a 302 Found, and the other is a 404 Not Found -- big difference!

In the first case a web browser automatically follows the redirect to the page with the content. In the second, the page with the content is directly served as the error page."

In the first 404 setting does the Googlebot see BOTH the 302 found (which I'm guessing comes from a confirmation that the custom 404 file has been found) and the 404 not found?

I opted to go with the second case which just returns a 404 not found and displays the custom 404 error page.

jdMorgan

3:17 am on Sep 7, 2003 (gmt 0)

Arnett,

In the first case (ErrorDocument 404 [widgets.com...] the browser sees only a 302-Found response, because specifying the canonical URL in ErrorDocument invokes an external redirect, terminating the current request; A server cannot respond to one HTTP request with two status codes, so since the canonical URL must be passed to the browser, it is done with a 302-Found status.

This behaviour of ErrorDocument is documented here [httpd.apache.org].

With a 404 response, no URL is passed back to the browser, since there is (by definition) no replacement for the requested (missing) resource.

Jim

Arnett

3:20 am on Sep 7, 2003 (gmt 0)

With a 404 response, no URL is passed back to the browser, since there is (by definition) no replacement for the requested (missing) resource.

Thanks. So,by using the second case which only returns a 404 the visitor will see the custom error page,the spider will see the 404 and the search engine will delete the url from its index. Correct?

jdMorgan

4:12 am on Sep 7, 2003 (gmt 0)

Yes, Google will delete the page after a few spider visits, as will most others. Some of them never seem to get the message, though... Slurp keeps coming back for old stuff, even after six months!

But note that a 404 means "Not Found." It does not say that the page has been removed, and it does not say that the page will or won't ever return. This "hole" in the HTTP/1.0 server response codes was fixed in HTTP/1.1 [w3.org] by the addition of the 410-Gone status code.

Because of the ambiguity of the 404 response, you can't blame spiders for being conservative and checking several times before deleting a page -- The 404 might be because of a webmaster mistake or a faulty script or something. Because of this, they give you a fair chance to fix the problem before removing your pages.

In this thread [webmasterworld.com], I posted some code to return 410-Gone status for pages which are explicitly declared to be gone, but to check that the request is HTTP/1.1 before doing so; HTTP/1.0 clients won't understand a 410, so a 404-Not Found status is returned to them instead. You will have to add a RewriteCond qualifier to the code so that it discriminates between domains (I assume this thread is related to the same pages-listed-in-multiple-domains problem you've posted in another thread).

A review of the "Server Response Codes" sections (6.1.1 and 10.1 - 10.5.6) of the HTTP/1.1 document cited above is a good idea for anyone who wants to know what clients - including spiders - will do with 301, 302, 404, 410, or other server response codes. Often, the default server response is "safe", but not what you really want. A classic example being the plain-vanilla Redirect directive - People use it trying to get a spider to switch a page to a new URL. But it won't work, because it produces a 302 response by default, which means that the resource (page) has been moved to a new URL temporarily(!) Redirect 301, Redirect permanent, or RedirectPermanent is what is needed in this case.

Jim

GoogleGuy

5:15 am on Sep 7, 2003 (gmt 0)

I believe that's correct, Arnett. I'd serve up the 404 directly, not a 302 to a different page which returns a 404.

Arnett

5:36 am on Sep 7, 2003 (gmt 0)

I believe that's correct, Arnett. I'd serve up the 404 directly, not a 302 to a different page which returns a 404.

Thanks again. At last I can sleep!

abates

7:00 am on Sep 7, 2003 (gmt 0)

Inktomi keeps hitting pages which haven't existed on my site for a year or two now - Googlebot's a lot better behaved in this respect.

Arnett

7:27 am on Sep 7, 2003 (gmt 0)

Inktomi

I'm not even in Inktomi anymore. I put a NOINDEX in ONE of my pages. They deleted every page in my site. Yahoo is going to replace Google with Iktomi? ROFLMAO.

Arnett

8:41 am on Sep 7, 2003 (gmt 0)

Oh yeah GoogleGuy,

I've been meaning to mention this for awhile if I ever got hold of you. Google really should consider a Google Auction. eBay makes tons of money just taking a small percentage of each item that sells. Google could do better with an auction than with Froogle.

The big problem with eBay is,that since AOL took it over they charge too much for most people to be able to sell ther e profitably. There are too many fraudulent sellers now and now that they've bought paypal the scammers are taking over there too.

A well run Google auction with fraud protection,Google online paymenents and seller verification would go over well and make you people a TON of money.