Ancient URLs being crawled...

Forum Moderators: open

Message Too Old, No Replies

Ancient URLs being crawled...

Googlebot feeling nostalgic?

balam

5:11 am on Oct 29, 2004 (gmt 0)

Mostly a FYI post, but the tinfoil hat crowd might find something interesting...

A little over a year ago I did a massive renaming of most the pages on one of my sites. With careful planning and judicious use of mod_rewrite, by the end of the year Google was crawling the new URLs and had (seemingly) forgotten the old ones, my rankings were unaffected (in a negative fashion), and it was virtually transparent to new & old visitors.

Over two years ago, I deleted several pages and let them die a 404 death. Google (and others) caught onto this pretty quick, and of course the pages were dropped.

Earlier today, in a real "blast from the past," Google requested all these old URLS, filling my error log with 404's.

FWIW, all the bad requests came from the "new" "Mozilla/5.0 (compatible; Googlebot/2.1;[...]" bot, coming from the 66.249.65.x range. This new bot has previously scraped the site successfully.

Critter

5:41 am on Oct 29, 2004 (gmt 0)

Yep same here.

Over 40,000 old pages crawled. Most of which are still valid (I didn't use 301's, just let the old links die a slow death), so they didn't cause 404s, but lots of old pages nonetheless.

eyezshine

6:00 am on Oct 29, 2004 (gmt 0)

Same here, Thousands of 404's from pages deleted around January this year. What is Google doing?

outrun

6:16 am on Oct 29, 2004 (gmt 0)

It did the exact same thing around the same time last month but with the googlebot user agent.

regards,
Mark

Powdork

6:40 am on Oct 29, 2004 (gmt 0)

Perhaps Google is preparing to unveil the new higher capacity index and they're getting as many indexable urls as possible (finger's crossed). Brett has made reference to the likelihood of a massive update this fall. It is about three weeks til the vegas search conference. That does fall roughly on the anniversary of Florida. Even if they were just to suddenly include all the sites they have ignored since March, the results would be tumultuous. If any of this is true we are in for some big fun!
Bring it On!

>sets his tin foil hat back on the desk.

outrun

6:51 am on Oct 29, 2004 (gmt 0)

Long Long Longshot but...maybe they finally prefected reading webpages as users do as stated by Googleguy here [webmasterworld.com...] and are building the new index with the new bot.

regards,
Mark

Kerrin

11:28 am on Oct 29, 2004 (gmt 0)

Could be Google checking to see if pages in their "Supplemental Results" index are still valid.

BillyS

11:39 am on Oct 29, 2004 (gmt 0)

I see a very similar pattern to that described in the Googlebot running hard thread(http://www.webmasterworld.com/forum3/25897.htm). Grabbing MANY pages.

I've seen this happen after an SERPS update, but no one seems to be reporting anything.

HarryM

12:07 pm on Oct 29, 2004 (gmt 0)

I have seen something similar. Quite a few of my obsolete pages now appearing in the index, usually marked "supplemental". These all had 301 redirects and had disappeared from the index. But since the last toolbar PR update they are back with caches typically dating from February.

Also my index page is showing again (as well as www.domain.com). It's just a url without a snippet but it has a PR3. I had earlier gone to a lot of trouble to get rid of it by ensuring internal links pointed to ROOT. And anyway I thought Google had fixed this double-entry problem long ago.

Critter

1:48 pm on Oct 29, 2004 (gmt 0)

Googlebot's at it again today. Surprise is the rate of crawl on my end...I'm seeing a constant 30 pages per second right now. No big whoop for my site, which isn't even breathing hard, but I've never seen such a rapid crawl.

Critter

1:53 pm on Oct 29, 2004 (gmt 0)

Make that 50 pages per second.

Hoo hoo. That's-a spicy meat-a-ball.

kosar

2:00 pm on Oct 29, 2004 (gmt 0)

i have had 70,000 hits so far this month and still going strong!

Critter

2:23 pm on Oct 29, 2004 (gmt 0)

Over 120,000 hits *today*. :)

Imaster

6:55 am on Oct 30, 2004 (gmt 0)

Am noticing tons of urls being crawled from almost all websites I monitor. Speed is mindblowing! Way to go, Googlebot...

davegee

9:13 am on Oct 30, 2004 (gmt 0)

I had noticed this strange behaviour of the googlebot as well - It was trying to dredge up long-deleted urls of all my biggest naming mistakes from when I first created my site ;-) ... (things like Capitals in folder/htm filenames, spaces in folder/htm filenames, etc...!)

I thought they were long removed from google's directory, but I wondered whether they may have just been doing a massive "spring clean" and double checking all their old urls before deleting them for good?

bumpski

10:31 am on Oct 30, 2004 (gmt 0)

Mozilla/5.0 (compatible; Googlebot/2.1;[...]" bot, coming from the 66.249.65.x range. This new bot has previously scraped the site successfully

Don't forget this version of Googlebot now requests GZIP compressed pages using HTTP 1.1 vs 1.0, so it can go typically 4 times faster. The old Googlebot did not request GZIP'd pages.

On Sep 30th, Oct 6th and Oct 28th I noticed the new bot requesting GZIP'd pages in my logs.

See thread:
[webmasterworld.com...]

Google's also got to be working on the Hijacked websites problem so perhaps this is somehow related to crawling old pages, figuring out who owned the material first?

Powdork

2:41 pm on Oct 30, 2004 (gmt 0)

I thought they were long removed from google's directory, but I wondered whether they may have just been doing a massive "spring clean" and double checking all their old urls before deleting them for good?

Hmmmm, perhaps they are on a crusade to delete as many old URL as possible as a stop gap measure to allow some newer pages into the main index.
Just another in a series of wild guesses.

webmktg

4:00 pm on Oct 30, 2004 (gmt 0)

Around 10000 old pages of my website is listed with Google. I think Google has put an old database.

For me its like Google is planning to do a major update in SERPs.

g1smd

10:43 pm on Oct 30, 2004 (gmt 0)

Google have a record of the URL of every page that they have seen, as well as a record of every link that they have ever seen (whether or not there was a typo in that link or not, and whether or not the page that it points to still exists or not) so I guess that they go through and re-check old URLs to see what their status is now. They cannot know whether a page you removed a year ago has come back or not until they ask for it again. There may be a page somewhere that still points to it too, so they will want to check that out.

If they are just asking for old pages, then it is a status check of their old data and nothing to be worried about at all. If however, they are putting pages references back into their index for pages that don't actually exist then they have a big problem.

HarryM

11:24 pm on Oct 30, 2004 (gmt 0)

If however, they are putting pages references back into their index for pages that don't actually exist then they have a big problem.

I think they have already done so, and are now trying to sort it out. Many old urls seemed to appear at the time of the last PR update. And as been noted in other threads, there were anomalies in the updated PR.

There was a massive crawl just before (or during?) the update, and now there's another massive crawl. Possibly they are attempting to repeat the process and this time get it right.

I think we should expect hiccups like this. The number of pages and links has grown enormously and is still growing. Google will probably have to continuously modify it's procedures in order to cope.

balam

8:01 am on Nov 2, 2004 (gmt 0)

> I think they have already done so [...]

Yes, they most definitely have added some of my long-gone pages back into the index as supplemental results. Three dead pages that weren't there yesterday are there today. No title or ransom note/snippet, just a URL.

This situation makes me wish I did allow Google to cache my pages. I'd love to see what date they'd report.

I can find the pages with just site:www.example.com, but trying site:www.example.com UniqueWidgetFromDeadPage fails to bring them up. That's a small comfort since I expect the dead pages will not show up in anyone's regular SERPs. (But on the other hand, they'll miss out on my nice 404 page.)