Forum Moderators: open

Message Too Old, No Replies

Google crawling old 404 pages again and again?

         

Dan_Vendel

10:28 am on May 24, 2003 (gmt 0)



Hi,

Perhaps wrong forum, if so, I apologize.
I'm just curious how come Google is still crawling and trying to reach pages I deleted from the site 8-10 months ago. It's getting 404's, but still coming back after a week or so.

Dan

Shak

10:34 am on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dan,

Welcome to Webmasterworld [webmasterworld.com]

Are there still backlinks "somewhere" pointing to those pages?

Shak

Dan_Vendel

12:29 pm on May 24, 2003 (gmt 0)



Thanks Shak,

No, AFAIK, there's no links anywhere to the pages in question. They were mock-ups for a design job I had last year, and I kept them on my own server for 2-3 months.

I really don't care, just wondering how many times Googlebot need a 404 until it understand that the pages are gone...

BTW: Actually, I've been here since August 2001, but not that frequent. Then I lost pass/log, so had to create a new identity.

Cheers,
Dan

dmorison

12:33 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi,

You're not serving a custom 404 page that just isn't actually a 404 are you?

That would certainly have Googlebot coming back time and time again because it has no idea it is really a 404...

Critter

12:40 pm on May 24, 2003 (gmt 0)

10+ Year Member



Actually, if the custom page is through Apache or IIS (in the configuration for error pages) the web server will return a "HTTP/1.X 404 Not Found" header.

If it's your own custom 404 page made with, for example, PHP you will need to put this header in yourself like header('HTTP/1.X 404 Not Found").

(My 404s say "HTTP/1.x 404 Page is buh-bye") :)

Peter

Dan_Vendel

12:57 pm on May 24, 2003 (gmt 0)



dmorison, Critter,

I think you might be on to something. I *think* <blush> it's Red Hat and Apache (host is <snip>). I do have a custom made 404, in plain html (but suffix .shtml) since I need a link to my site's home page.

Should I add that snippet in the head of that page? And what about the custom made 400, 401, 403 and 500?

D

[edited by: NFFC at 1:14 pm (utc) on May 24, 2003]
[edit reason] URL snipped [/edit]

Visi

12:59 pm on May 24, 2003 (gmt 0)

10+ Year Member



The purpose of the freshbot and it change to verifying 404 pages from older indexes has been noted previously. The change was seen in March and April, working from previous indexes, perhaps as old as 3 months previous. We verified the 404 pages then, and they were giving the correct response. With everything happening this month at google the purpose of the freshbot has changed IMHO to now verification of the database as much as anything else. It is not only going from the last crawl data but also previous crawl data in an attempt to verify. Discussions on how to attrack the freshbot, if it relates to PR, actual page changes etc seem off base to what we are seing in our stats. The cycle prior to this month was deep crawl, freshbot verifying older database, then freshtags on pages that existed. The tendency of freshbot to act as a deepcrawler may just be the above happening. It is not following links like the deepbot, but following the database that exists. The key question seems to be what version of the database it is following. Hence the 404 errors. At the time we noted that we noticed a marked improvement in our listings as far as accuracy (removed 404's) in the listings, and yet freshbot still wanted to check against non listed pages. Our conclusion is that after plowing through the dominic update posts the above is what we are seeing today. Google has reverted to an older database and verifying that pages still do exist in an attempt to make their results current. Late last year there was a problem with the removal of older pages in their database. From what we have seen with our site, if they have brought this strategy forward, it will be a positive result.

Dan_Vendel

1:04 pm on May 24, 2003 (gmt 0)



BTW:

<quote>
you will need to put this header in yourself
<quote />

Isn't it better I put in in the page? (Sorry, just couldn't resist) :-)

D

Critter

1:13 pm on May 24, 2003 (gmt 0)

10+ Year Member



Ack, Eek, Groan!

Peter

Clark

1:13 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



G is definitely crawling real old pages. Real old...

annej

3:53 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just put <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> on my 404 not found page and it seems to have solved the problem. I haven't had Google listing 404 pages since.

I must admit webmasterworld gets the prize for it's 404 pg. It creaks me up everytime I get it.

jdMorgan

11:40 pm on May 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you have removed a page, and have no replacement for it, then a 410-Gone is the proper server response. If you use a 404, the spider will assume that your server is having problems, and will "give you a break" by trying to retrieve that file for a few months.

404-Not Found means the file was not found for unspecified reasons, but this condition is not necessarily permanent.

410-Gone means it's really, really gone, and the condition is permanent.

Ref: RFC 2616 HTTP/1.1 [w3.org] Hypertext Transfer Protocol/1.1

HTH,
Jim

Newman

11:56 pm on May 24, 2003 (gmt 0)

10+ Year Member



My site is completely deleted from index. Only what I have is the home page and one page with 404 error.

404 Not Found
Not Found. The requested URL was not found on
this server. Apache Server at mysite.com.
www. mysite.com/page.html

But point is, my pages are in htm format not html.
I have DMOZ link and link in Google directory. Loot of fresh pages… but nothing…
I’m tired.

Please help!

Dan_Vendel

6:42 am on May 25, 2003 (gmt 0)



jd, Thanks! Will try that instead. But how do I tell the server to bring up a 401 instead of 404?

Newman, Sad hearing. But please don't hijack a thread! Start your own!

D

jdMorgan

4:51 pm on May 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dan_Vendel,

> But how do I tell the server to bring up a 401 instead of 404?

That depends on what server you are hosted on.

For Apache, you declare the page, resource, or directory name (or a wildcard), and tell the server to respond with 410-Gone. For example, using mod_rewrite:


RewriteRule ^defunct\.html$ - [G]
RewriteRule ^removed-directory - [G]
RewriteRule ^discounts/discount-(.*-)+widgets\.html$ [G]
RewriteRule \.mp3$ - [G]

Note that the second rule does not have an end-anchor; This allows any request for either that directory or its contents to match, and return a 410-Gone status. The third one returns 410-Gone for "discounts/discount-<anything>-widgets, but not for "discounts/discount-widgets", and the last will return a 410 for any request for an mp3 file. These are just examples - the point being that you don't have to specify each and every removed resource individually.

If you wish to serve a custom 410 error page, you can declare that in the same way you declare a custom 404 error page:


ErrorDocument 410 /custom410.html

On IIS and other MS-based servers, some combination of .asp scripts and control panel settings can likely be used to do the same thing.

HTH,
Jim