Forum Moderators: open
In the days of the Monthly update Google most of pages of Site B were successfully added into the index. Now since the "continuous update" update started, I have witnessed a gradual removal of the Site B's Pages from the index.
I can't work out what is going on. I am not using any spamming techniques as site B has been built using the same techniques I have used on a lot of other sites which are all sitting happily in Google's index.
Any ideas how I go about sorting this out? At its peak Site B had about 80 pages in the index, now it's down to about 5!
You say "I created a new site (Site B) which I linked to the home page of my PR6 site. I did this in April this year. "
did you link from the PR6 page to the new page or from the new page to the PR6 page?
I assume you mean you linked from the PR6 to the new page ....
is that the only reference link into the new site or are there also other credible links into the new site?
It seems like Google is favoring ..well, whatever.. perhaps it really is a new emphasis on fresh content, as in recently updated pages. And the oldies&goodies get pushed out or put in the background for a while.
It could have something to do with the reconfiguration of the Gbot - now that it's deepfreshbot it has to do more things, and then there's also more of them out spidering all the time. With all this fresh content constantly added it seems that other (older, but not less valuable) content is sliding out the backdoor due to limited capacity.
I've seen some posts mention that there are scalability issues being worked on and i personally hope this is something they will be looking at, as this is not really good for the quality of their searches - some of the sites dropping are valuable sources that's been around for ages. Personally i keep finding (updated but irrelevant) email-lists, discussion boards and blogs when i'm looking for quality (static) reference material and it's simply getting harder to find the good stuff.
Don't think this comment is valuable to you, but it's a thing i've been thinking a bit about lately. I admit, though, that it sounds like the usual "wait and see, it'll get better eventually" that has been aired quite a few times, sorry about that. Fact is that i haven't got a clue if it's going to be better, i just hope so.
/claus
any chance of duplicate content on site B?
also did you try adding "&filter=0" to the end of your search query you might be using to check site B's pages?
see also: [webmasterworld.com...] msg4
In any case - getting extra site external deeplinks to Site B's inner pages should also help.
Did you check if googlebot is chewing on your pages?
/claus
Now I'm beginning to wonder.
The observations people are reporting sound very similar to the old freshbot behaviour, but applied to the entire index over a longer timescale.
If the index was reaching capacity then a slow churn of pages would be one short-term workround. My guess is that this isn't a bug but a deliberately applied piece of sticking plaster.
They have recently increased the published index size too, although this size does not have to be recent just because the publishing of it is recent. This is at least a signal that they have increased capacity.
The "index" afaik is just one huge file that is already distributed across several machines, and the querying consist of identifying pointers to the relevant part(s) of this file. In principle just like when you make a zipfile that spans a couple of floppies, although somewhat more complicated and much larger. It's a very unconventional setup, but it has proved very efficient and scalable sofar.
So, it's probably not a capacity problem in "the index" as such. It's more likely a decision made somewhere - an altered or new "weight" to certain pages of the SERPS imho.
/claus
It's not helped by some of the daft things they've done in recent months... the Spamazon debacle probably being the most noticable and the most heavily criticized. I would have expected them to clean it up by now, but they don't seem to see it as a problem.
Perhaps some of the proper content pages lost above were replaced by this stuff. If there is a finite lid on the index, the logic is that something has to go to make room for it. Who knows.
document last modified. Such would give all the dynamically generated pages an edge over static pages. Problem is, you just can't trust that a page is fresh just because it's dynamically generated the minute you request it.
That's not a bug, it's just an issue solved programatically that cannot be solved programmatically. A regular error (40), that is.
/claus
The "index" afaik is just one huge file that is already distributed across several machines, and the querying consist of identifying pointers to the relevant part(s) of this file. In principle just like when you make a zipfile that spans a couple of floppies, although somewhat more complicated and much larger. It's a very unconventional setup, but it has proved very efficient and scalable sofar.
From The Anatomy of a Large-Scale Hypertextual Web Search Engine [www-db.stanford.edu]:
"Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page."
Actually.... I may have to retreat a little... they seem to be scaling back on this thank goodness!
Looking across the centers, most of them seem to have pushed it back somewhat from where it was a week or so ago. Much better quality of course, if they keep it that way, or better still, push it further. Anyone else notice the change?