Forum Moderators: open

Message Too Old, No Replies

Google and 404's

finally cleans up database this month

         

Visi

12:08 am on Mar 4, 2003 (gmt 0)

10+ Year Member



Some interesting things I have noticed this month. Although length of time between updates is longer than normal, activity of crawlers 11 and 12 have been extensive this month. More importantly to me is that it actually was returning 404 errors for a change. In other words, google was finally checking their database against my site. I have heard about how pages are supposed to be out of the database after 2-3 months, but for us it has been closer to 6 months. I have noticed some other posts in the past few months also noting this. Very effectivley google has updated ou links this month, removing the pages that are no lomger on our site. Call this everflux if you want but for us an update to reflect what is now on our site. The activity of the freshbot has gone deeper this month than we have ever seen before including over 80 visits in the last 2 days.

Is it possible that google is attempting to verify its databbase this time around, and finally cleaning out the 404 pages. Sure seems like that to us. We are also seeing a high degree of slurp activity, which we indicated earlier seems to be a confirmation of their database also. Activity during the past month has been even greater than googles.

Comments? from those that follow this closer than me?

Brett_Tabke

5:28 am on Mar 13, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You may have been right. There was some discussion last month about an abnormal amount of 404's in the db. I've not run into a single 404 at this point with the new index.

How did you fare since the update?

JeremyL

5:50 am on Mar 13, 2003 (gmt 0)

10+ Year Member



I did a search for "404 error" on google and came back with about 80K results. While not all will be actual 404 pages as some will be discussing 404's. but for the most part they almost all seemed to be error pages. Anybody know the number before the new index?

On a funny side note, check out [homestarrunner.com...] . Thats the funniest 404 error page Ive found so far :)

Visi

10:02 pm on Apr 9, 2003 (gmt 0)

10+ Year Member



Since the last update most of the previous 404 errors are gone, and freshbot was hitting pretty deep this month on pages we added to replace. This was early in the crawl, and since then much less activity by the freshbot, only 1-2 layers deep and few 404 errors being tabulated. Seems google cleaned up a lot last month in the database. Big changes from previous 6 months. From what I have seen recently, the last months update, at least in our area was a major refresh and verification. Whatever the reason, their databse much cleaner now:)

killroy

1:41 am on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm on one of my sites I simply set the homepage as a 404 error page. How will google deal with that? Will it honour hte 404 server response?

SN

jdMorgan

2:07 am on Apr 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are a lot of screwed-up sites out there - sites which return a 404 page, but not a 404-Not Found server response code. A common error in configuring Apache ErrorDocuments can cause this, as can incorrectly-written custom error-handling scripts. It's always a very good idea to check your server response [webmasterworld.com] to requests for missing pages after setting up custom 404 handlers. I'd be willing to bet that a lot of those 404 pages have this problem.

Jim

Visi

3:58 pm on Apr 11, 2003 (gmt 0)

10+ Year Member



Brett for your question in the casandra thread...64.68.82.78 bot is definitly rerunning database, checking the 404 pages again. These pages long gone before last 4 crawls. Started from a database, at least two crawls previous. Cycle looks like crawl, update, checks pages from previous crawl with IP's we call freshbots.

chiyo

4:33 pm on Apr 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ok jdmorgan, when i tried the tool for a non-existant file on our host where we have set up 404 custom error pages via their control panel (pair networks) we get the "302-Found" response. Does this mean that if this was once a legitimate page, but is now removed, that Google will never know. Any experts out there, (especially using Pair) who can tell me how to sort this out so google and others really DO know it's gone?

Visi

1:05 am on May 21, 2003 (gmt 0)

10+ Year Member



Knew this post was around here somewhere:)

April 11, 2003:

Have to say no at this time. Have commented before on this and am cnvinced that this is now a function of freshbot. Seems that the deep bot just crawls and then during the following month, the freshbot confirms the 404 errors. Have seen this cycle over the past couple of months, and tends to be 8-10 weeks after the deep crawl for the database to shed all the old links. On the other hand these links may not be displayed. but freshbot seems to be one cycle behind the deepcrawl (maybe even 2? cycles)as to which database it is confirming to as far as 404 errors. Just our observations on what we have been seeing in our logs from a major site update last year. Also noticed that Slurp is also rehitting the 404's extensively the last 12 weeks. Yes our 404 page is correctly serving up a 404 code. In this area since the announcement with yahoo it has become repetative and frequent. Not quite sure what that means yet, but no doubt here both engines trying ways to get database current. Just my 2 cents worth.

May 20, 2003
Sounds pretty familiar to what we are reading about dominic? See questions about 2 month old database being used. Perhaps we just didn't realize what it meant at the time. With what I have seen, IMHO think google was testing out the system prior and finally implimenting. Results were improving as far as dead pages at that time, (read better database) so perhaps dominic just an extension of this? If so this is a planned event, not a "screwup" as is being implied on a number of the threads.

So just some food for thought.

steve128

1:29 am on May 21, 2003 (gmt 0)



I am one, I have myself 36 404's and not my fault, google decides to re-include them

jdMorgan

2:54 am on May 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



chiyo,
I missed your post. On Apache, a very common mistake in setting up ErrorDocument directives can cause the server to return a 302 instead of the desired 403, 404, or 410. See the Apache ErrorDocument documentation - there is a specific warning about this.

All,
Most engines will do a lot better job of removing your obsolete pages if you tell them the pages are gone with a 410-Gone server response. 404 errors are defined very vaguely, and most search engines assume that there might just be a temporary problem with your site if they get a 404; they'll keep trying for a few weeks or months. This is a reasonable assumption, if you read the 404 server code description in the HTTP/1.1 RFC. So don't be mad at the engines for trying to give you a break. If you remove a resource intentionally and want it de-listed faster, then set up a 410-Gone response for it.

Another advantage of returning correct server headers is that your 404 error log will not be cluttered up with junk. A 404 error will once again be a call to immediate action, because it will indicate a real problem.

Jim