Is Google Building a new Index?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is Google Building a new Index?

Vimes

7:39 am on Sep 5, 2008 (gmt 0)

Hi,

Over the last 7 days I’ve seen a validated Googlebot requesting 1000’s of url’s that haven’t been on my website for years.

These are all returning 404’s now and have been for a long time some of these URL’s that are being requested are up to 4 years old.

Ok, some could be from external links, but the quantity of these requests leads me to believe that there is something else going on.

My assumptions are:

a)Google’s building new indices.
b)A major change coming to the algorithm in the next few weeks
c)My sites in deep do-do

We've got sitemap files so why ask for these URL's all of a sudden

Any one else report this type of activity.

Vimes.

bolognese

8:17 am on Sep 5, 2008 (gmt 0)

Hi there,

I posted the same almost the same time like you.

It's true. In my webmaster tool I can see 94 404's that have been disallowed by me long time ago and removed from the index with the url removal tool.

Jos

Vimes

8:31 am on Sep 5, 2008 (gmt 0)

Hi,
well these requests aren't disallowed URL's, either from robots text or the url removal tool.

The removal tool if i'm correct only has a limited period of time i think 6 months before google tries the re-index them.

the requests on my server haven't existed in years. it looks like they are re-crawling every known url they have ever had of my site.

I've checked the site and its clean its not coming from me.

Vimes.

bolognese

8:35 am on Sep 5, 2008 (gmt 0)

So if anybody had any information on his website he wants to completely be removed and never come back, he never will succeed, because google (or any other) will always keep some information stored internally?

santapaws

9:21 am on Sep 5, 2008 (gmt 0)

yes they must keep every record of everything they have learnt about any indexed website but that's not the same as thinking they would show that information again. I guess its possible though. The point i would think is that at some point you made that information publically avilable so they indexed it legitimately.

Vimes

10:37 am on Sep 5, 2008 (gmt 0)

if nobody else is getting this, then i can only assume its my site only.

yes its definitely a deep crawl but with deep crawls in the passed i've never had thousands of old 404 urls requested like i'm having at present.

makes me nerves to see this abnormal crawling happening.

Vimes

jdMorgan

1:56 pm on Sep 5, 2008 (gmt 0)

I'm seeing a few of these 404s as well. However, I have very very few URLs that have ever changed or been removed (maybe a couple of dozen over 12 years), so the number of possible 404s caused by "old stale links" to my sites is very low.

Anyway, there's a second "sample" for your theory -- I doubt that we're the only ones, but maybe just the among the first who post here who've noticed this.

Jim

rainborick

2:39 pm on Sep 5, 2008 (gmt 0)

I've seen this happen from time to time for at least the past year, so I don't think it's anything really new. I've been particularly aware of it because I moved a directory from one site (where the pages didn't really ever belong) to a subdomain on an appropriate site of mine that I moved to its own domain last August. Google had discovered some test pages on the original site, and a few on the old subdomain, which had been deleted a 2+ years ago. Only one or two of these pages had ever been linked to, even indirectly, but I still see the 404's from Googlebot's attempts to crawl them from time to time.
Both of these sites perform as I'd expect, so I don't worry about these stray 404's. And in general, I don't think even thousands of 404's would cause any problems if your sitemap doesn't include the URLs and there's no internal links to them. Search engines have to allow for an occasional restructuring on a large site without any negative implications.

santapaws

8:55 pm on Sep 5, 2008 (gmt 0)

well why not stick up a 410? a 404 doesn't really say a file is gone forever.

tedster

8:58 pm on Sep 5, 2008 (gmt 0)

Technically that makes some sense, but Google treats a 410 and a 404 identically. They will continue to check for the url on a declining schedule far into the future.

jdMorgan

9:05 pm on Sep 5, 2008 (gmt 0)

Right, and it appears that "they're back" checking for removed URLs that are so old that I pulled the 410s several years ago just to save a little code space. That is all that was noticeable about what I saw in my logs.

Jim

outland88

12:18 am on Sep 6, 2008 (gmt 0)

I checked out what you said Vimes and Google did do a complete roll-through of my site 12 hours ago. They also did that last month. I wondered why they did it then because nothing was updated in the caches. My theory is they are realizing their crawlers are missing a lot with incremental crawls. I’m probably wrong and go with answer C.

Vimes

10:12 am on Sep 7, 2008 (gmt 0)

Well I seem to experience a deep crawl on the site just before my GWT updates itself for links extra so I wasn’t surprised to see it as that normally happens for me at the end of each month, the only difference this time was the amount of old URL’s that were requested, that’s why I left it a week before mentioning anything, normally this activity dies off after a day or three and things return to normal and life goes on. But this activity has only just dropped off, over the last 10 days I’ve had just over 25,000 404’s requested from Googlebot. 95% of these are URL’s that disappeared off my site after we restructured the entire website just over 4 years ago, yes we had redirects on these for a while, these were probably removed about 2 years ago.

I recently added a 301 redirect to my root stopping any issues with the www.domain.com./ , I wouldn’t have thought that this would have caused any huge requests for URL’s redirecting to 404 pages as a page that isn’t there just isn’t there.
But I guess it might have channeled the Bot to recheck, I’m checking my 301 redirect logs now and so far haven’t come across any www.doamin.com./ 301 redirects landing to a 404 page.

I really get nervous when googlebot does funky stuff like this, for me its never been a good.

Vimes.

g1smd

11:34 am on Sep 7, 2008 (gmt 0)

You remove a page from you site.

Google sees the 404 and stops showing the URL in their results.

They test the URL again, from time to time, to see if it gets re-used.

Months later, they find a link to that page from a page they had never spidered before. What to do? Is this a new link to you, because your page has now come back? Is this an old link they hadn't previously noticed?

Whatever, once a URL "exists" it will be checked from time to time, forever, in case the status of the URL has changed in any way.

I don't think 410 can mean "forever".

Think about it.

I 410 www.domain.com/index.html and 5 years later let the domain lapse.

Someone else buys it a year or two later. Should the "410 Gone Forever" still apply?

No. Of course not.

Discovery of new links to a previously 404 or 410 URL may lead this process, as may change of ownership information.

This is why from Day Zero you should not let your website respond to *any* "stray" URL requests.

Now, say someone has linked to you as www.domain.com/index.hmtl, then that URL "exists", and will be internally indexed as a 404.

Google has to keep a record of that URL and the fact that it is "bad", otherwise they will have to go on discovery every time they spider the page the duff link is on.

What if that page has a large number of such duff links? Do you think they might have a routine to mark *that* page as bad instead/as well, and save some crawler work.

[edited by: Robert_Charlton at 7:44 am (utc) on Sep. 10, 2008]
[edit reason] fixed example per poster [/edit]

BillyS

2:15 pm on Sep 7, 2008 (gmt 0)

As long as a link is out there to a gone page, it's probably a good idea to return a 410 or 404. I agree with g1smd.

jdMorgan

2:28 pm on Sep 7, 2008 (gmt 0)

I recently added a 301 redirect to my root stopping any issues with the www.example.com./ , I wouldn’t have thought that this would have caused any huge requests for URL’s redirecting to 404 pages as a page that isn’t there just isn’t there.
But I guess it might have channeled the Bot to recheck, I’m checking my 301 redirect logs now and so far haven’t come across any www.example.com./ 301 redirects landing to a 404 page.

This would not have "caused" Googlebot to do anything, since Gbot would have to request example.com./ in order to "discover" this redirect. Otherwise, the addition of this redirect is invisible to Gbot, since your server only responds to client requests -- There is nothing in a server that will "send a notice" to search engine spiders about such changes; How would the server know who to notify?

More likely Gbot is just checking through its historical "dead link" data for each of our sites, and in a few cases, might have found an obsolete link out on the Web somewhere.

It's interesting to me that they're doing this all at once in a noticeably-large "batch" -- So possibly there is some kind of clean-up or archiving process taking place.

Jim

g1smd

3:28 pm on Sep 7, 2008 (gmt 0)

Google used to reindex their Supplemental database several times per year. At that time I would see a lot of changes. The fresh tags and Supplemental tags, as well as the cache date, were key to spotting what was going on - and Google has now hidden most of that data.

At the moment I don't have any sites that I could track to look for that same pattern. However, in any case, the Supplemental refresh is supposed to be a much more frequent and ongoing thing if I correctly understood what Matt Cutts said about that topic some time about a year or so ago.

caribguy

12:29 am on Sep 10, 2008 (gmt 0)

Now, say someone has linked to you as www.domain.com/index.hmtl, then that URL "exists", and will be internally indexed as a 404.

So what to do if someone links to you as www.example.com/keywor/index.html - would it be sensible to create a 301 redirect to /keyword/index.html ?

[edited by: Robert_Charlton at 7:45 am (utc) on Sep. 10, 2008]
[edit reason] updated reference to earlier example [/edit]

jdMorgan

1:57 am on Sep 10, 2008 (gmt 0)

> would it be sensible to create a 301 redirect to /keyword/index.html ?

Yes, sensible and advisable... Recover the traffic and recover the PageRank.

Jim

g1smd

6:49 am on Sep 10, 2008 (gmt 0)

I would redirect to www.domain.com/keyword/ with a trailing / on the end.

The index.html part is redundant. Omit it.

That redirect, being "specific", would be placed before all of the other redirects.