Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Does Not Give Up on Crawling Old URLs

Still fetching dead page after four years.

         

lierduh

3:11 am on Jul 18, 2005 (gmt 0)

10+ Year Member



[17/Jul/2005:16:19:46 +1000] "GET /links.html HTTP/1.0" 410 305 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"
[18/Jul/2005:06:48:03 +1000] "GET /links.html HTTP/1.0" 410 305 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

That page would have been 404 for number of years. According to the log, the bot even ignores 410. What can I do about this other than individually adding them to robots.txt and use the "automatic URL removal system"?

I read somewhere saying 410 is only implemented for HTTP/1.1, not HTTP/1.0. The log shows Googlebot is using HTTP/1.0. Is this the actually protocol the bots use?

I have also found sticking files and directories to robots.txt does not stop Googlebot indexing existing contents, this method is recommended by Google. They just stop visiting these pages without deindexing the old contents. This should be nothing new to you guys, but I am not sure why Google on one hand asking webmasters to put deindexing files/directories into robots.txt and wait, while no pages are deindexed.

Same goes for 301 redirect. Google has already visited the old pages and fetched the new page, yet the old url is still been indexed.

Another simple question is how long before Google considers removing duplicate contents penalty? I have already fed Google with sitemap and kill all the old links. Is this just a waiting game, or it is easier to move the whole thing to a new domain to start over again?:)

Last, will I be considered too naive to email Google asking them to remove duplicate contents penalty?:)

ciml

8:46 am on Jul 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I read somewhere saying 410 is only implemented for HTTP/1.1, not HTTP/1.0. The log shows Googlebot is using HTTP/1.0. Is this the actually protocol the bots use?

The HTTP 1.0/1.1 header is laregly irrelevant these days, due to widespread partial implementation of 1.1

I'm fairly sure that Google see all 4nn replies as dead.

> sticking files and directories to robots.txt does not stop Googlebot indexing existing contents

/robots.txt asks Google not to fetch it, but the URL can be listed due to old data or current links. You can use the URL-removal tool, or use inindex in a META robots tag and remove the /robots.txt (in that case, Google will remove it after it fetches it and sees that it shouldn't be there).

> will I be considered too naive to email Google asking them to remove duplicate contents penalty? :)

With eight billion pages, and many support emails each day, I wouldn't rely on being able to get Google to intervene to help your Web site.

Duplicate (or near duplicate) content is really material for a separate thread (we had one recently). There are quite a few things involved.

lierduh

9:23 am on Jul 18, 2005 (gmt 0)

10+ Year Member



Thanks for the reply.:)

I wondered and could not find a straight answer from the webmasterworld search. I have now confirmed using the "automatic URL removal system", the following format work:

Disallow: /product_* #this will match /product_(.*).html

While "Disallow: /product_*.html" does not work despite a usual robot visit uses wild card.

g1smd

10:37 am on Jul 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If there is an incoming link to any URL that no longer exists, Google will still continue to request that page for ever more - this is to check the status of the URL.

They keep a list of every URL that has ever existed, and every URL found in every link they have ever seen (the two things are not the same - the latter includes links with typos in them), and check the status of them all from time to time.

There is a difference between Google checking the status of a URL, and actually including that URL in the SERPs.

You would be annoyed if you actually put some content back on that page and Google then failed to find it. That is why they check every link and every URL out.

The system is broken if those URLs continue to appear in the SERPs.

claus

10:15 pm on Jul 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've seen reports that serving a 410 can make Google remove a page that was not removed in spite of a 404.

As for HTTP protocol versions, Googlebot is not a full HTTP 1.1. client, but it is not a "1.0" client either. It does "speak" HTTP 1.1 though, although it's not the mother tongue, so to speak ;)



Btw.: Welcome to WebmasterWorld lierduh :)

g1smd

10:38 pm on Jul 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Even after removing a page from the SERPs, the bot will still test your website from time to time to check that it is still gone.

To me, that is the correct behaviour. No longer listed because it has gone. Continue testing to see if it ever comes back.

lierduh

12:24 am on Jul 19, 2005 (gmt 0)

10+ Year Member



I can understand if a 404 is served. There got to be some different treatment between 404 and 410. 410 means "Gone", that means there is no point to probe. If the webmaster decides to put the old file in, then it is up to him/her to let the SE know again.

It is however understandable that Google follows the external links in. I can only put down as:

1) Google uses HTTP/1.0 which does not know what 410 menas.
2) Goolge does not, but ideally should index a 410 record for web sites, or/and to consult the sitemap if one is available. One would think it is easier to check Google's owe database then hitting the dead link for years to come.
---
Thanks claus for the welcome.:)
I have been reading here for a while and have learnt a great deal of stuff.
PS. I think one of the moderator might have added the subtitle "Still fetching deal page after four years" to my initial post. I note the "deal" should perhaps be "dead". Could someone please correct that.

g1smd

10:26 am on Jul 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> If the webmaster decides to put the old file in, then it is up to him/her to let the SE know again. <<

How would the webmaster do that? I assume by linking to it - but the "previously gone" page may already have old broken links pointing to it from years ago (some of which Google may not yet have found). So, how would Google distinguish between a old-previously-broken-link that they have only just discovered, and a newly placed link to the newly recreated page? There is no way to check the status of that link other than attempting to access the page that the link points to.

Imagine the problem if someone takes down their old website, serves every request as "410 Gone", and then sells the domain to you. If the entire site is "410 banned" then how the hell are you ever going to get it reindexed?

lierduh

11:30 am on Jul 19, 2005 (gmt 0)

10+ Year Member



Well I think,

1) If an internal link points to the previously 410 page, then it means the page is back.
2) if the sitemap says so, then remove the 410 reference from the record.
3) If [site...] returns 410, then ignore it.
4) If the said page is submitted to SE for indexing, then probe it. (this will overcome the entry page being 410)

Edit: added 3).

g1smd

11:40 am on Jul 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Rinse and repeat 8 billion times......? Google hasn't got the server power for that.

MichaelCrawford

12:25 pm on Jul 19, 2005 (gmt 0)



Why don't you put up a page at that URL that serves some useful purpose? Maybe there are some incoming links and it would get some pagerank. If it then linked to other pages on your site, the pagerank would flow onto them.

It doesn't have to have the same content or even the same purpose as the original page did.

claus

8:38 pm on Jul 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> 1) Google uses HTTP/1.0 which does not know what 410 means.

From all reports I've seen Googlebot does understand a code 410 even though it does register as HTTP/1.0 in the server logs. That as what I intended to say with the above post: Even though an 1.0 User-Agent would normally not recognize that server status code, Googlebot seems to do it anyway.

However, I have not been able to find any mention of this on official Google pages, and I don't recall it being mentioned by Google representatives either. Google always seem to recommend a 404.

So, it will recognize that the file is now "Gone" and the file will be removed from the results pages, but might still try to fetch the page sometimes (especially if there are strong external links to the page).