Forum Moderators: Robert Charlton & goodroi
Over the last 7 days I’ve seen a validated Googlebot requesting 1000’s of url’s that haven’t been on my website for years.
These are all returning 404’s now and have been for a long time some of these URL’s that are being requested are up to 4 years old.
Ok, some could be from external links, but the quantity of these requests leads me to believe that there is something else going on.
My assumptions are:
a)Google’s building new indices.
b)A major change coming to the algorithm in the next few weeks
c)My sites in deep do-do
We've got sitemap files so why ask for these URL's all of a sudden
Any one else report this type of activity.
Vimes.
The removal tool if i'm correct only has a limited period of time i think 6 months before google tries the re-index them.
the requests on my server haven't existed in years. it looks like they are re-crawling every known url they have ever had of my site.
I've checked the site and its clean its not coming from me.
Vimes.
Anyway, there's a second "sample" for your theory -- I doubt that we're the only ones, but maybe just the among the first who post here who've noticed this.
Jim
I recently added a 301 redirect to my root stopping any issues with the www.domain.com./ , I wouldn’t have thought that this would have caused any huge requests for URL’s redirecting to 404 pages as a page that isn’t there just isn’t there.
But I guess it might have channeled the Bot to recheck, I’m checking my 301 redirect logs now and so far haven’t come across any www.doamin.com./ 301 redirects landing to a 404 page.
I really get nervous when googlebot does funky stuff like this, for me its never been a good.
Vimes.
Google sees the 404 and stops showing the URL in their results.
They test the URL again, from time to time, to see if it gets re-used.
Months later, they find a link to that page from a page they had never spidered before. What to do? Is this a new link to you, because your page has now come back? Is this an old link they hadn't previously noticed?
Whatever, once a URL "exists" it will be checked from time to time, forever, in case the status of the URL has changed in any way.
.
I don't think 410 can mean "forever".
Think about it.
I 410 www.domain.com/index.html and 5 years later let the domain lapse.
Someone else buys it a year or two later. Should the "410 Gone Forever" still apply?
No. Of course not.
Discovery of new links to a previously 404 or 410 URL may lead this process, as may change of ownership information.
.
This is why from Day Zero you should not let your website respond to *any* "stray" URL requests.
.
Now, say someone has linked to you as www.domain.com/index.hmtl, then that URL "exists", and will be internally indexed as a 404.
Google has to keep a record of that URL and the fact that it is "bad", otherwise they will have to go on discovery every time they spider the page the duff link is on.
What if that page has a large number of such duff links? Do you think they might have a routine to mark *that* page as bad instead/as well, and save some crawler work.
[edited by: Robert_Charlton at 7:44 am (utc) on Sep. 10, 2008]
[edit reason] fixed example per poster [/edit]
I recently added a 301 redirect to my root stopping any issues with the www.example.com./ , I wouldn’t have thought that this would have caused any huge requests for URL’s redirecting to 404 pages as a page that isn’t there just isn’t there.
But I guess it might have channeled the Bot to recheck, I’m checking my 301 redirect logs now and so far haven’t come across any www.example.com./ 301 redirects landing to a 404 page.
This would not have "caused" Googlebot to do anything, since Gbot would have to request example.com./ in order to "discover" this redirect. Otherwise, the addition of this redirect is invisible to Gbot, since your server only responds to client requests -- There is nothing in a server that will "send a notice" to search engine spiders about such changes; How would the server know who to notify?
More likely Gbot is just checking through its historical "dead link" data for each of our sites, and in a few cases, might have found an obsolete link out on the Web somewhere.
It's interesting to me that they're doing this all at once in a noticeably-large "batch" -- So possibly there is some kind of clean-up or archiving process taking place.
Jim
At the moment I don't have any sites that I could track to look for that same pattern. However, in any case, the Supplemental refresh is supposed to be a much more frequent and ongoing thing if I correctly understood what Matt Cutts said about that topic some time about a year or so ago.
Now, say someone has linked to you as www.domain.com/index.hmtl, then that URL "exists", and will be internally indexed as a 404.
So what to do if someone links to you as www.example.com/keywor/index.html - would it be sensible to create a 301 redirect to /keyword/index.html ?
[edited by: Robert_Charlton at 7:45 am (utc) on Sep. 10, 2008]
[edit reason] updated reference to earlier example [/edit]