| This 249 message thread spans 9 pages: < < 249 ( 1  3 4 5 6 7 8 9 ) > > || |
|Pages Dropping Out of Big Daddy Index|
Continued from: [webmasterworld.com...]
internetheaven, you said:
|I had 20,300 pages showing for a site:www.example.com search yesterday and for the past month. Today it dropped to 509 but my traffic is still pretty constant. I normally get around 4,500 - 5,000 to that site per day and today I've already got 4,000. |
So, either Google doesn't account for even a small percentage of my traffic (which I doubt) or the way Google stores information about my site has changed. i.e. the 20,300 pages are still there, Google will only tell me about 509 of them. As far as I can tell, I think the other pages have been supplemented.
That resonated with something that I was talking about with the crawl/index team. internetheaven, was that post about the site in your profile, or a different site? Your post aligns exactly with one thing I've seen in a couple ways. It would align even more if you were talking about a different site than the one in your profile. :) If you were talking about a different site, would mind sending the site name to bostonpubcon2006 [at] gmail.com with the subject line of "crawlpages" and the name of your site, plus the handle "internetheaven"? I'd like to check the theory.
Just to give folks an update, we've been going through the feedback and noticed one thing. We've been refreshing some (but not all) of the supplemental results. One part of the supplemental indexing system didn't return any results for [site:domain.com] (that is, a site: search with no additional terms). So that would match with fewer results being reported for site: queries but traffic not changing much. The pages are available for queries matching the supplemental results, but just adding a term or stopword to site: wouldn't automatically access those supplemental results.
I'm checking with the crawl/index folks if this might factor into what people are seeing, and I should hear back later today or tomorrow. In the mean time, interested folks might want to check if their search traffic has gone up/down by a major amount, and see if there are fewer/more supplemental results for a site: search for their domain. Since folks outside Google couldn't force the supplemental results to return site: results, it needed a crawl/index person to notice that fact based on the feedback that we've gotten.
Anyone that wants to send more info along those lines to bostonpubcon2006 [at] gmail.com with the subject line "crawlpages" is welcome to. So you might send something like "I originally wrote about domain.com. I looked at my logs and haven't seen a major decrease in traffic; my traffic is about the same. I used to have about X% supplemental results, and now I hardly see any supplemental results with a site:domain.com query."
I've still got someone reading the bostonpubcon email alias, and I've worked with the Sitemaps team to exclude that as a factor. The crawl/index folks are reading portions of the feedback too; if there's more that I notice, I'll stop by to let you know.
[edited by: Brett_Tabke at 8:07 pm (utc) on May 8, 2006]
> To keep it short, how come a page that is not indexed in google to have its own PR?
The process which updates the toolbar PR is always referred to as an export, suggesting that the value is stored separately from the index. So, it could get out of step.
thanx for sharing your information with us.
You said that some of the sides have spam penalty.
For us webmasters could you please do some outlining what is spam? I donīt mean in detail.
Maybe you can just answer this questions be adding a yes/no comment?
1. pages with mostly similar content about 80% is spam
2. stand alone product pages are spam
3. linking from deeper pages to top pages is spam
4. is there spam spam factor based on pagerank ( means higher pagerank, lower danger of beeing trapping into a spam penalty )?
maybe someone adding more questions?
thanx gg in advance
I believe if we webmasters know more about how spam is defined we rather can do some work and helping making your index more relevant again.
My main page do not appear on some datacenters now!
|So that would match with fewer results being reported for site: queries but traffic not changing much. The pages are available for queries matching the supplemental results, but just adding a term or stopword to site: wouldn't automatically access those supplemental results. |
That is not what is happening - the pages are gone on query matches aswell as site:domain.com matches.
However, I can believe for some of us that this maybe related to a cleanup of supplementals.
GG - what do you mean by a penalty - if you do a site:domain.com search and your homepage is not top is that likely to be caused by a penalty? - how are we supposed to know if we are penalized if we are in the serps and whenever we contact Google they return with a reply you are not penalized you can be found on a search like site:domain.com?
Is the penalty that Google applied to sites that had/have Canonical problems grouped into the above? Are Google still looking into a way of fixing that problem? (The penalty that went with the issue as much as improved Canonilzation)
i get exactly the same results from:
could someone tell me if this is good or bad?
The more I think about it the more convinced I am that the missing pages problem is being caused by a Backlink/PR issue (see Msg #15).
Tying together all of the evidence from my own experience, and that of others gleaned from the forums, erroneous or out-of-date backlinks would explain all of the missing pages.
The erroneous, or simply out-of-date, backlink information (which we cannot see) leads to insufficient PR (which we cannot see) and hence deep pages are not indexed.
We all know that a "link:www.mysite.com" does not show you the complete picture. But, since Big Daddy, it now shows just a tiny proportion of backlinks. Way less, than it used to show before Big Daddy. Why? Because either the backlink index hasn't been updated (and now dates back to mid 2005), or else because it has been updated, but the update process is buggy. Only a small handful of Google employees know which of these two possibilities is the case.
We know that the missing pages problem cannot be due to any kind of duplicate content filter, as some people are suggesting. If this were the case, then effected sites would see a proportion of their pages disapear. Some would lose 10%, some would lose 40%, and some would lose 95%. But that's not what we see. We see sites losing the vast majority of their pages or else losing no pages at all. The reason effected sites lose such high percentages of their pages is because of the hierarchical nature of a site. The number of pages increases with depth, and the artificially low PRs (based on innacurate and/or out-of-date backlink data) prevents the deeper content from being indexed.
The fact that Big Daddy was kick-started from an index dating back to the middle of last year, not only explains why the backlink data might be stale, but it also explains why ancient pages keep popping up on various data centres.
As further evidence: try a "link:www.mysite.com" and compare it to a search for "www.mysite.com". In my case, the "link:" search shows just 6 results, only one of which is external to my site. The one external backlink probably pre-dates when Big Daddy's index was seeded. The "www.mysite.com" search, on the other hand, finds hundreds of results representing hundreds of internal and external backlinks. Why aren't these showing up in the "link:" search? Is it because "link:" searches are well known for not showing you the complete picture? Or, has that well-known fact simply been obscuring the true cause of all of the problems? Namely, that the backlinks are simply missing from Google's backlink index.
Sorry for waffling on...I think I've finally run out of steam now.
>>>...search traffic has gone up/down by a major amount, and see if there are fewer/more supplemental results for a site: search for their domain...
For at least one of my sites the number of supplemental results has increased dramatically and traffic is about 50% what it was before all the pages went supplemental. This happened in the middle of April.
One of my websites has also had Big Daddy issues since early May... here's the low-down:
- Traffic post May 2 is only 30% of previous
- Pages beginning to drop on May 1-2, to now between 25% to 40% of pages left in index depending on the DC
- Supplemental Results began to show around May 5th, previously no Supplementals showed
- has had a 301 www/non-www redirect for years, doesn't appear to be an issue with the site: command
- allintext: searches have our site missing, other searches fine
- searches for unique text from our homepage has many scraper sites coming before ours for many searches, some we also top the scrapers! (sarcastic-cheer!)
Anyone else have any similar problems? I'm thinking this website's problem is due to scrapers and how G's reworking their BD index. Part of me would like to stay hands off and let it work itself out (over the coming weeks? months?), however, I'm wondering if re-writing a lot of the site's content would perhaps lift the problem that we seem to have due to the scrapers... it's a small site so a rewrite isn't out of the question.
"linking from deeper pages to top pages is spam"
I would appreciate if you can explain. Try by giving an example so I understand exactly what this refers to.
I just want to make sure I understand this correctly.
>"linking from deeper pages to top pages is spam"
I really don't see how that could be classed as spam, all of my deep pages link back to the main pages / index within the navigation on the page
i mean that if you have an hirachical architecture with 3 or more stages like a internet shop. Normaly your product pages are in last order of the architecture. Imagine you have a shop with 15.000 products and each of the side has random links to 2nd stage pages or first page . SImply just to show the user some related products or products to use with. E.G. Digicam and SmartCard! (Remember: we should build pages for users not for SEs and that in point is a good thing for users). Is it spam? cause you do the way back up of linking architecture.
The target sides gain a lot of PR from that!
Please donīt tell me that doesnīt work. I allways was wondering why one page "Basket" has higher Pagerank than all other pages. All the product pages pointed to that page. And that were the only ones.
OK, in the absense of any coherent information about what's causing the problem of my pages disappearing from the Google index, here are a couple (only slightly paranoid/conspiracy) theories:
1. Because I advertise my site with Adwords, a large number of MFA sites have managed to get pages indexed with Google that link to my site (through the cached Adwords ad). This makes it look like I've been hanging out in bad link neighborhoods.
Am I being penalized by Google for being a good Adwords customer?
2. A slight variation of (1): A good inbound link to my site appears on a well established authority site. This page has been scraped numerous times by MFAs, which repeat the link as well as the surrounding copy.
Am I being penalized because I have multiple inbound links from MFA pages with duplicate copy?
Kind of makes you wonder whether it's really worth writing original, useful pages for actual readers, since it's impossible to attract any of those readers.
It seems everyone I talk to is suffering from this problem.
My sites are basically being deindexed. Everyday about 5-10k pages are being dropped for the past 2 weeks. I am now about 10% from where I was in the beginning of April in terms of pages indexed. A friend of mine runs a small niche site that ranked number 1 for its keyword for 3 years. Its now number 7 on some DC's, without the title of the site showing, and on other DC's its 21 with the title showing. I also see my sites being displayed without a title or descrption as well but only on some DC's. Strange thing is the cahce date is May 7th.
|Am I being penalized because I have multiple inbound links from MFA pages with duplicate copy? |
That's been a theory for some time, links outside a webmasters control causing bad neighborhood associations. Yes, the association is believed to be passed from the scraper sites and there's nothing you can do, ever try to contact a scraper site?
One belief is that associations of incoming links can make or break a sites rankings, again the conspiracy theory of no-one being able to hurt another sites rankings comes into play here...
It's a common practice for scrapers to link to the good quality sites, so there can be many arguments for or against this theory. It would be pretty stupid to penalize sites for MFA page links that webmasters obviously have no control of, but the algorithm would suggest that these links would come into play since its all automated.
Clint, i once again agree with you.
It has to be about backlinks. Duplicate filter thing is bogus due to the massive page drops on relevant sites. I am so anti dup content and I am still losing pages on new sites.
EVERYONE: Are the sites/domains that are dropping pages fairly new?
Have you used link directories as your main source of link development?
I've been wondering what you guys are talking about regarding pages disappearing from the index and today, for kicks and out of curiosity, I did a check on my sites.
Guess what? I've joined your club. Gone from 150 pages or more to roughly 50 pages.
The site is quite new. Less than 1 year old, launched in August last year so close to 9 months old.
Some directory links of course but others as well have been obtained.
>EVERYONE: Are the sites/domains that are dropping pages fairly new?
>>Have you used link directories as your main source of link development?
nope most one way links from friends
but a friend that is experiencing the same problems his site is 3 years old but I don't know about his links
Do you know if your friend has changed his registar info recently? If so the domain can then again be seen as newish.
Plus even if you dont get directories links, doesn't mean there isn't a web of directories with your friends sites and therefore the links from there sites are not as powerful anymore.
[edited by: Relevancy at 5:31 pm (utc) on May 9, 2006]
My site is 16 months old--all hand-written, no dupes, no re-directs, nothing wrong I can think of. Pages started peeling away about a month ago. The dropped pages were all added within about a three-month period starting in December, but not all pages added during this period were dropped. Pages that link to my home page were not dropped. On some servers, they've begun coming back. My default server now shows 167 pages out of 789 (including forum posts). The "new results" servers mentioned earlier claim to have 789 results but I can only get to about 165 of them and the last few are supplemental. Traffic has declined only slightly but it should be shooting up at this time of the year. This is such a disheartening mess. I want to add new content but what's the point if nothing is going to show up.
For the past few weeks we had loads of supplemental results, G listed old pages that had either 301 redirects or 404 error pages.
As of today G got rid of our supplemental results.
However, they are not indexing most of our pages.
Whichever way I do site: (w/ or w/o www) it would come up 24 pages. (as of today)
Should I expect now, the fact that G got rid of the supplementals they will start indexing more pages of our site?
All my pages that are going are new ones within the last 4 months too
>Do you know if your friend has changed his registar info recently? If so the domain can then again be seen as newish
nothing has changed
>The dropped pages were all added within about a three-month period starting in December
thats about the only thing I'm picking up from this that the dropped pages are pages that have been added over the last 4 months and that applies to both our sites. It's almost like anything newish has been dropped
which shows it could be a rollback pre March 06 as discussed in the big daddy thread, my missing pages are also approx 4 - 6 months old, I have seen an improvement over the last 3 days of pages indexed, anyone else getting any back.
I hear some sites are only now starting to loose pages, mine have been gone for almost 2 months! and have been up and down over the last month about 5% each way, currently on a high 432 last week it was 375 should be 990 ish.
Seems like a stage thing to me, our moans and groans have been the same, just at different times over the last 2 months or so.
I have a 2 year old site and most of my deeper pages have not been indexed since the start of BD (lost about 1800 pages). Oddly though, it ranks better than ever for the pages that are indexed!
Another site, 8 months old, pages gone from 60,000 down to 252. Pages are coming back at a rate of 10 a week...at this rate I might have to get a job!
As I don't want to get a job I have been doing some testing. I find that I can get small new sites, to rank well within a week and stay there. OK I would have to make a lot of new sites to get over this Google indexing problem but I've got to do something. How can new sites have pages indexed before established sites? It just doesn't make any sense.
Come on Google, sort it out.
Just wanted to add that although my main site ranks better than ever (with less pages) my home page, which used to be indexed every day, now only gets indexed every two weeks! Something is definitely VERY wrong here!
Someone mentioned this before, but could be a sandbox type of thing, sandbox being the delay to calculate (or recalculate in this case) new pages since the starting checkpoint.
If thats the case, hopefully they're doing quickly!
Not only are some sites having less pages appear in the index (these are the "experimental" and "cleanup" datacentres as far as I can tell), but some sites are also falling victim to Google showing the same snippet for every page of the site in a site: search (instead of showing the meta description, or whatever) and hence getting a result like 1 to 3 of about x000. Previously a result like that would be an indication of a duplicate content problem, but in this case I guess that Google is just working on old data for the snippet. It has been long apparent that the data for the indexing and ranking, and for the snippet, and for the cache itself all come from separate databases.
The "same snippet for every page" appears to be happening in most (if not all) datacentres.
Interesting. I've had about 20k pages reindexed on half of the DC's. Numbers are actually going up, something that hasn't happeneind in about 4 weeks.
I also saw one time while doing a search a section on the side (left) breaking down where my pages where indexed. It had Web, Froogle, Images, etc. with status bars showing where the majority of my pages were indexed. I saw this once but I couldn't recreate it again.
|Not only are some sites having less pages appear in the index (these are the "experimental" and "cleanup" datacentres as far as I can tell) |
Just to be crystal clear: The missing pages problems are accross all datacentres. That is, sites that are effected by the bug, see 95%+ of their pages dropped from Google's index on all datacentres (obviously there are the usual slight variations from DC to DC).
| This 249 message thread spans 9 pages: < < 249 ( 1  3 4 5 6 7 8 9 ) > > |