Welcome to WebmasterWorld Guest from 34.229.24.100

Forum Moderators: open

Message Too Old, No Replies

Ancient URLs being crawled...

Googlebot feeling nostalgic?

     
5:11 am on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2003
posts:503
votes: 0


Mostly a FYI post, but the tinfoil hat crowd might find something interesting...

A little over a year ago I did a massive renaming of most the pages on one of my sites. With careful planning and judicious use of mod_rewrite, by the end of the year Google was crawling the new URLs and had (seemingly) forgotten the old ones, my rankings were unaffected (in a negative fashion), and it was virtually transparent to new & old visitors.

Over two years ago, I deleted several pages and let them die a 404 death. Google (and others) caught onto this pretty quick, and of course the pages were dropped.

Earlier today, in a real "blast from the past," Google requested all these old URLS, filling my error log with 404's.

FWIW, all the bad requests came from the "new" "Mozilla/5.0 (compatible; Googlebot/2.1;[...]" bot, coming from the 66.249.65.x range. This new bot has previously scraped the site successfully.

5:41 am on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 31, 2003
posts:386
votes: 0


Yep same here.

Over 40,000 old pages crawled. Most of which are still valid (I didn't use 301's, just let the old links die a slow death), so they didn't cause 404s, but lots of old pages nonetheless.

6:00 am on Oct 29, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Oct 6, 2004
posts:216
votes: 0


Same here, Thousands of 404's from pages deleted around January this year. What is Google doing?
6:16 am on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Apr 14, 2003
posts:438
votes: 0


It did the exact same thing around the same time last month but with the googlebot user agent.

regards,
Mark

6:40 am on Oct 29, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3347
votes: 0


Perhaps Google is preparing to unveil the new higher capacity index and they're getting as many indexable urls as possible (finger's crossed). Brett has made reference to the likelihood of a massive update this fall. It is about three weeks til the vegas search conference. That does fall roughly on the anniversary of Florida. Even if they were just to suddenly include all the sites they have ignored since March, the results would be tumultuous. If any of this is true we are in for some big fun!
Bring it On!

>sets his tin foil hat back on the desk.

6:51 am on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Apr 14, 2003
posts:438
votes: 0


Long Long Longshot but...maybe they finally prefected reading webpages as users do as stated by Googleguy here [webmasterworld.com...] and are building the new index with the new bot.

regards,
Mark

11:28 am on Oct 29, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Oct 1, 2002
posts:96
votes: 0


Could be Google checking to see if pages in their "Supplemental Results" index are still valid.
11:39 am on Oct 29, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member billys is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 1, 2004
posts:3181
votes: 0


I see a very similar pattern to that described in the Googlebot running hard thread(http://www.webmasterworld.com/forum3/25897.htm). Grabbing MANY pages.

I've seen this happen after an SERPS update, but no one seems to be reporting anything.

12:07 pm on Oct 29, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 21, 2002
posts:1051
votes: 0


I have seen something similar. Quite a few of my obsolete pages now appearing in the index, usually marked "supplemental". These all had 301 redirects and had disappeared from the index. But since the last toolbar PR update they are back with caches typically dating from February.

Also my index page is showing again (as well as www.domain.com). It's just a url without a snippet but it has a PR3. I had earlier gone to a lot of trouble to get rid of it by ensuring internal links pointed to ROOT. And anyway I thought Google had fixed this double-entry problem long ago.

1:48 pm on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 31, 2003
posts:386
votes: 0


Googlebot's at it again today. Surprise is the rate of crawl on my end...I'm seeing a constant 30 pages per second right now. No big whoop for my site, which isn't even breathing hard, but I've never seen such a rapid crawl.
1:53 pm on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 31, 2003
posts:386
votes: 0


Make that 50 pages per second.

Hoo hoo. That's-a spicy meat-a-ball.

2:00 pm on Oct 29, 2004 (gmt 0)

Full Member

10+ Year Member

joined:July 9, 2003
posts:233
votes: 0


i have had 70,000 hits so far this month and still going strong!
2:23 pm on Oct 29, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 31, 2003
posts:386
votes: 0


Over 120,000 hits *today*. :)
6:55 am on Oct 30, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 3, 2003
posts:963
votes: 0


Am noticing tons of urls being crawled from almost all websites I monitor. Speed is mindblowing! Way to go, Googlebot...

davegee

9:13 am on Oct 30, 2004 (gmt 0)

Inactive Member
Account Expired

 
 


I had noticed this strange behaviour of the googlebot as well - It was trying to dredge up long-deleted urls of all my biggest naming mistakes from when I first created my site ;-) ... (things like Capitals in folder/htm filenames, spaces in folder/htm filenames, etc...!)

I thought they were long removed from google's directory, but I wondered whether they may have just been doing a massive "spring clean" and double checking all their old urls before deleting them for good?

10:31 am on Oct 30, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 13, 2004
posts:833
votes: 12


Mozilla/5.0 (compatible; Googlebot/2.1;[...]" bot, coming from the 66.249.65.x range. This new bot has previously scraped the site successfully

Don't forget this version of Googlebot now requests GZIP compressed pages using HTTP 1.1 vs 1.0, so it can go typically 4 times faster. The old Googlebot did not request GZIP'd pages.

On Sep 30th, Oct 6th and Oct 28th I noticed the new bot requesting GZIP'd pages in my logs.

See thread:
[webmasterworld.com...]

Google's also got to be working on the Hijacked websites problem so perhaps this is somehow related to crawling old pages, figuring out who owned the material first?

2:41 pm on Oct 30, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member powdork is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 13, 2002
posts:3347
votes: 0


I thought they were long removed from google's directory, but I wondered whether they may have just been doing a massive "spring clean" and double checking all their old urls before deleting them for good?
Hmmmm, perhaps they are on a crusade to delete as many old URL as possible as a stop gap measure to allow some newer pages into the main index.
Just another in a series of wild guesses.
4:00 pm on Oct 30, 2004 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 23, 2004
posts:102
votes: 0


Around 10000 old pages of my website is listed with Google. I think Google has put an old database.

For me its like Google is planning to do a major update in SERPs.

10:43 pm on Oct 30, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google have a record of the URL of every page that they have seen, as well as a record of every link that they have ever seen (whether or not there was a typo in that link or not, and whether or not the page that it points to still exists or not) so I guess that they go through and re-check old URLs to see what their status is now. They cannot know whether a page you removed a year ago has come back or not until they ask for it again. There may be a page somewhere that still points to it too, so they will want to check that out.

If they are just asking for old pages, then it is a status check of their old data and nothing to be worried about at all. If however, they are putting pages references back into their index for pages that don't actually exist then they have a big problem.

11:24 pm on Oct 30, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 21, 2002
posts:1051
votes: 0


If however, they are putting pages references back into their index for pages that don't actually exist then they have a big problem.

I think they have already done so, and are now trying to sort it out. Many old urls seemed to appear at the time of the last PR update. And as been noted in other threads, there were anomalies in the updated PR.

There was a massive crawl just before (or during?) the update, and now there's another massive crawl. Possibly they are attempting to repeat the process and this time get it right.

I think we should expect hiccups like this. The number of pages and links has grown enormously and is still growing. Google will probably have to continuously modify it's procedures in order to cope.

8:01 am on Nov 2, 2004 (gmt 0)

Preferred Member

10+ Year Member

joined:May 27, 2003
posts:503
votes: 0


> I think they have already done so [...]

Yes, they most definitely have added some of my long-gone pages back into the index as supplemental results. Three dead pages that weren't there yesterday are there today. No title or ransom note/snippet, just a URL.

This situation makes me wish I did allow Google to cache my pages. I'd love to see what date they'd report.

I can find the pages with just site:www.example.com, but trying site:www.example.com UniqueWidgetFromDeadPage fails to bring them up. That's a small comfort since I expect the dead pages will not show up in anyone's regular SERPs. (But on the other hand, they'll miss out on my nice 404 page.)