Pages Dropping Out of Big Daddy Index

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages Dropping Out of Big Daddy Index

GoogleGuy

6:11 am on Apr 25, 2006 (gmt 0)

Continued from: [webmasterworld.com...]

One thing to bear in mind is that Bigdaddy will have different crawl priorities. That can account for some of it. If you've run into any spam problems in the past, you might also want to do a reinclusion request. Otherwise, please send an email to bostonpubcon2006 at gmail.com with the subject line "crawlpages" (all one word), and I'll ask someone to see if they notice any commonalities.

tigger

6:56 am on Apr 30, 2006 (gmt 0)

thats some drop! least I'm still holding onto about 30% of my site, althought the DC I'mn hitting right now seems to be showing a few less pages

reseller

7:52 am on Apr 30, 2006 (gmt 0)

youfoundjake

How about this one?

[64.233.161.104...]

Armi

1:41 pm on Apr 30, 2006 (gmt 0)

I had some improvements last week, but today, it went strongly downhill again.

However, it is very different on the different datacenters.

wanderingmind

6:06 am on May 1, 2006 (gmt 0)

Just to chip in -- I haven't lost pages, but both my sites are frozen solid. One had its last homepage refresh on March 29, and the other on 22 April. New pages on both have been crawled at one time or the other - and vanished into nothingness. Crazy thing is - I am aware how easily I could be among those who lost pages without knowing why.

tigger

6:40 am on May 1, 2006 (gmt 0)

I think we all know why, the new BD bot has a problem and the sooner G sorts it the better as I'm sure they must be aware of this, heck, if we are seeing sites loosing pages on G they must be seeing their DB getting smaller and wondering why?

Relevancy

7:05 am on May 1, 2006 (gmt 0)

One thing I noticed is that the sites I control that have participated in link directory link building are the ones losing pages. Does this mean G would penalize for link building?. NO, but... this could mean that they killed off all crap link directory power and the indexing gain once received is no longer there and therefore the pages are starting to drop until you have good link credit to get more pages indexed.

Great way to say.. get trusted links or we dont care about indexing your site any further. This kills "evil" link building, but hurts relevancy (no pun intended).

wanderingmind

12:53 pm on May 1, 2006 (gmt 0)

Relevancy,

I am not losing pages - just no new pages getting crawled at all.

And definitely no artificial link building - the nature of the site is such that natural links come by easily.

Sites with link directories may be facing the same problem, but in a greater degree perhaps...

cbartow

1:05 pm on May 1, 2006 (gmt 0)

Relevancy:

I thought the same thing, which seems like a good theory.

But the issue I have now is Googlebot is SLAMMING my sites since last week, but none of it makes it into the index. If it's old pages being re-indexed or new pages for the first page, they don't show up.

trinorthlighting

1:07 pm on May 1, 2006 (gmt 0)

I have lost 50% of my pages..

Porter5Forces

4:26 pm on May 1, 2006 (gmt 0)

A question is does links tagging like Delicious, Shadows etc are also categorised as "directory linking". Will Google penalised sites who tag themselves?

old_expat

5:16 am on May 2, 2006 (gmt 0)

6 year old original content site from 115 pages indexed down to 4,

incoming links shown as 10, down from 200+

All content pages, all static pages, all unique content, 1 free tool download .. 80% is personally written original content .. balance written by 3rd party and checked via copyscape. Site is linked to by 3 universities from departments related to topic

allinurl:www.mysite.com/ changed from several thousand to zero!

texasville

2:26 pm on May 2, 2006 (gmt 0)

Well, following the dc watch threads and seeing what certain posters have added in so many major threads on ww for the past few weeks, particularly things that g1smd has observed, I added in the changes and metamorphosis my site has made over the past 12 months and I think I now perceive what google is up to.
The changes occuring definitely are being done to resolve the piracy and canonicalization issues.
It was the observations of g1smd about the two different caches. Google is using wayback info to determine where a page originated and if it has been duplicated, where it originally came from. And it, in the future, will only be shown once. On the original site.
Watching the results for site:mysite search over the past few weeks shows the pages disappearing and reappearing (as supplemental)in almost the same order as I created and added them. Some pages that disappeared completely were pages that were almost identical in url as pages I deleted in the google console. Only differences were spaces in between words.
In short, the hammering, supplementaling, deindexing have almost identically followed my learning curve as a webmaster.
And I bet if people look at their sites and study it, all the pages that have disappeared were added AFTER AND BEFORE a certain date. Or pages that have gone supplemental were changed at a certain time. Such as changing meta description and keyword tags because they were identical on all those pages.
Or sites that went thru a 301 redirect and it wasn't correct or was corrected after a while.
Currently on my default google, all my pages still in the index are supplemental except for the main page. I wonder if this reflects the period where I asked my host to redirect non-www to www by 301 and they did a 302 thru ignorance. Two months later, I discovered this and had them correct it to 302.

I have some basis for this observation from watching my competitors. One is a shady, shady site that has not changed but a couple of pages in two years. Both of those pages are now supplemental. None of their other pages have been touched by either them or big daddy. They are still doing fine.
I have a site I launched 7 months ago that I have done NO changes to since it went live except for correcting a couple of meta tags on two pages. Those pages went briefly supplemental but now are back normal and the entire site is virtually untouched. All the learning curve was on the main site.
I am not saying that it's all my fault. I am just saying that Google is indexing and reindexing and changing my site almost identical to the changes in order I myself made to the site.
The questios are:
When will G be finished? and Will my site be normal again and fully indexed?

g1smd

2:38 pm on May 2, 2006 (gmt 0)

I am aware of a 20 000 page osCommerce site that used to show as 60 000 pages in the old Google index last year. This was probably a combination of duplicate content (the same page reachable by more than one dynamic URL), as well as mis-reported (inflated) numbers by Google.

In BigDaddy the site shows only 3500 pages for the last few months, but on 72.14.207.99 and related DCs it is now down to less than 30 pages.

The URL structure was also changed in the middle of last year, and Google doesn't seem to want to index any of it now.

Yes, maybe Google has de-indexed one set of URLs, and would add the other set back in again, but as they no longer exist, they cannot do that.

It is possible, but surely that many PhDs would have spotted that flaw and already designed it out, no?

[edited by: g1smd at 2:53 pm (utc) on May 2, 2006]

tedster

2:47 pm on May 2, 2006 (gmt 0)

In regard to the two questions that end your post -- well, you might take them to a crystal ball reader, because we just cannot know. However, your observations are quite suggestive to me.

Google is indexing and reindexing and changing my site almost
identical to the changes in order I myself made to the site.

If this pattern is widely observed, it could well mean that Google is using Big Daddy to integrate a large amount of historical data so they can better use the approaches outlined in their 2005 patent - Information retrieval based on historical data [appft1.uspto.gov].

And that would indeed be a HUGE undertaking, well worthy of the name "Big Daddy." Time will tell - but I'd love to hear from anyone who also sees (or can refute) the observation.

ClintFC

4:03 pm on May 2, 2006 (gmt 0)

I'm afraid my own experience would refute this theory, at least to some degree. There is definately an historical element to this - BigDaddy certainly seems to have been kick-started from an index built sometime last year. However, as to the madness of pages appearing and disapearing at the moment, I do not believe that this is following some historical path.

I can get any page I want into the Google index within 24 hours by simply linking to it from our Home page (PR5). If I take the link out again the page is deleted from Google's index within a couple of days.

My own conclusion from this is that Google have introduced some kind of filter/pruning mechanism that is deleting any pages that it deems to be "unworthy" because of the PR of the pages that link to it. My guess is that this pruning mechanism was designed to remove dead links, but instead is running amock and deleting anything without relatively high PR links to it. In my case an internal link from a PR5 page seems to be sufficient, whereas an internal link from a PR4 page is not.

So, if I'm right, the problem is related to depth and PR. This would certainly explain the sheer size of the number of dropped pages. Effected sites do not seem to be losing a few pages here and there, instead they are losing 95%+ of their pages (because most well designed web sites have a hierarchical structure that fans out with depth).

It is possible that Google are doing this on purpose. In which case Google's index, and hence effectively "the Web as most people know it" is set to become a whole lot smaller in the coming weeks. Hopefully, they are not this misguided and instead it is just a horrendous bug. The evidence against this being a bug is that you would have expected them to fix it by now - given that the bug was clearly introduced with Big Daddy. The fact that the problem seems to be getting worse rather than better, suggests to me that Google remain obvlivious to the problem.

Anyway...that's just my 2 cents worth.

Atomic

4:12 pm on May 2, 2006 (gmt 0)

I was really worried for a while but most of my sites are beginning to recover in the indexes.

tigger

4:12 pm on May 2, 2006 (gmt 0)

>suggests to me that Google remain obvlivious to the problem

I can't see how G is unaware of this problem, surely they must be seeing their own DB reducing in size as 1000's of pages get dumped or am I just not understanding the way a large SE works?

LuckyGuy

4:22 pm on May 2, 2006 (gmt 0)

"The evidence against this being a bug is that you would have expected them to fix it by now - given that the bug was clearly introduced with Big Daddy. "

but if it�s not a bug, then wouldn�t have allready Matt Cutts praised the new how ever unbelievable progress in se technics? Every time something new raise on google, he gave us his two pence but now? Nothing. I hope its a bug!

Freedom

4:33 pm on May 2, 2006 (gmt 0)

The fact that he hasn't said nothing, and the whole thing is seemingly being downplayed or ignored, is a strong indicator this is another Google FUBAR (bug) that they are just trying to figure out.

Cutts is purposely ignoring this problem on his blog because it's just such a huge screw up they weren't expecting. Until this problem came along, he was chatting away about the new crawler method. After this problem became obvious and webmasters pushed him into a corner to make the Original Post comment, he hasn't said a thing about either the new spider or the bug.

Their silence on the matter speaks volumes.

g1smd

4:36 pm on May 2, 2006 (gmt 0)

NO, the database doesn't necesarily get any smaller. Google keeps a copy of every page that they have ever seen, but filtering means that they don't show all of them in the results at any one time.

For instance, when you use the Google URL Removal Tool the URLs remain in their database, and are merely hidden from search results for 3 or 6 months. As soon as the time is up, the pages reappear in the SERPs even if the pages, or even the whole site, no longer exist at all.

Another example: when a robots.txt file was deleted from a particular site last month, Google added 23 000 previously disallowed pages back into their index within just a few days. All of them now had a cache date from January 2004. That data must have been IN their database for all of that time, but just marked as "do not show this" due to the robots exclusion that was applied. As soon as it was no longer marked as "do not show this", they started showing it again. As soon as the pages were noticed in the SERPs, the robots file was added back into the site. Within 48 hours of adding the robots file back on the site, Google pulled all of the disallowed URLs from the SERPs again - but they are all still in their database, and will probably remain there forever.

[edited by: g1smd at 4:37 pm (utc) on May 2, 2006]

tigger

4:37 pm on May 2, 2006 (gmt 0)

lets hope so Freedom, but that does make sense

thanks g1smd

[edited by: tigger at 4:45 pm (utc) on May 2, 2006]

Atomic

4:45 pm on May 2, 2006 (gmt 0)

I agree with Freedom, too. What's happening can't be by design.

texasville

5:17 pm on May 2, 2006 (gmt 0)

>>>>I can get any page I want into the Google index within 24 hours by simply linking to it from our Home page (PR5). If I take the link out again the page is deleted from Google's index within a couple of days. <<<

Actually, that would fit. And it isn't deleted. It's just not being shown. The pruning is definitely happening with the "shown index".
When you do a search on most sites it will return with 150 of 475 pages from www.mysite but if you click thru to the last page and click the omitted results you will see google only returns a final 220 pages or so. The rest they have in their data but just won't show it to you.
And I would bet that google isn't saying anything to prevent gaming. I think they would rather let a few webmasters hate them and this anti-google sentiment flourish than give up what they are really doing. And believe me, it's hard to admit this for me. I have gotten to really despise them for what they have evidently done to my site. Hopefully, it will all shake out in the end. And scrapers might be gone forever.
And one more thing. Remember the first sites that got hit by this? dmoz clone directories.

tedster

5:20 pm on May 2, 2006 (gmt 0)

My current assumption is that what we see is by design (mostly) -- but it's a long arc design that sometimes makes short term troubles. We tend to look at this moment's results as "what Google intends" but what they intend more than any one moment's results is where they are headed, as they actually USE their new infrastructure.

Since this morning, when I posted above about historical data, I've been thinking more and more about that possibility. How would you go about making a huge amount of already sharded historical data available for real time use? I think we're seeing exactly that, and it sometimes generates odd symptoms on the visible end.

Based on Google's reported 1Q financial outlay for new equipment, there's been some speculation about exactly what the new Big Daddy infrastructure entails. Somewhere between 100,000 and 200,000 new boxes was one educated guess.

It's still very hard to decide whether to respond to any changes we see in this moment. In fact, we know that Google does NOT want us to respond, they want us to simply build a website for visitors, not them. But when the data shows major causes for concern, it's both hard to sit and wait, and hard to know what if anything might help.

calculated

5:32 pm on May 2, 2006 (gmt 0)

In fact, we know that Google does NOT want us to respond, they want us to simply build a website for visitors, not them.

Well they are being supremely naive if they really think this is going to happen. This is business as they well know.

Relevancy

6:38 pm on May 2, 2006 (gmt 0)

So the bottom line is.. we think google is doing some major clean up of dup pages in their own index, but in this process they have cleaned out a lot of quality pages.

The question is will they re-index what was dropped or will those pages be lost in the google archives madness forever or will we have to rename this older pages to get them back?

LuckyGuy

6:41 pm on May 2, 2006 (gmt 0)

"Since this morning, when I posted above about historical data, I've been thinking more and more about that possibility. How would you go about making a huge amount of already sharded historical data available for real time use? I think we're seeing exactly that, and it sometimes generates odd symptoms on the visible end. "

What has historical data to do with the disappearance of the 3rd level pages. A company that has a reserve from million of $ should do better testing!? And did You recognice that the serps didn�t get better? Yahoo and msn doing better in relevancy terms, just because they have this 3rd level pages in their index. I don�t know what google tend to do but since october 05 they turned to the wrong way called Big Daddy.

Edit: please someone here around who can tell us that his pages didn�t dropped out of index?

mt1955

7:32 pm on May 2, 2006 (gmt 0)

Today I am seeing an upswing in page impressions from a lot of my older pages, some of which hadn't visited by anyone for weeks. Anyone else notice anything similar?

I've been following this thread daily but to be honest the net result is still confusing to me. Just to clarify, is it the consensus that one attribute of Big Daddy is in some cases older pages are appearing (or appearing higher?) in the search results than would have been the case in previous versions?

Joy6320

9:21 pm on May 2, 2006 (gmt 0)

I have a small site with around 35 pages. Went down to 13 indexed pages with many supplementals. Now up to 19 including several that were supplemental. There is no order to the dates when pages were created and those that went supplemental nor is there any order in dates to sites that went supplemental and are now back in the index. I can say; however, that my pages are coming back only 2 at a time and the googlebot appears to be crawling my site only 2 pages at a time over many weeks. Is it possible that if many sites were lost by google the bot would not crawl deep so that the most important pages of a site would be reindexed first?

dazzlindonna

9:24 pm on May 2, 2006 (gmt 0)

please someone here around who can tell us that his pages didn�t dropped out of index?

LuckyGuy, none of my pages have dropped out of the index (approx. 30 sites).

This 254 message thread spans 9 pages: 254