Forum Moderators: Robert Charlton & goodroi
Great way to say.. get trusted links or we dont care about indexing your site any further. This kills "evil" link building, but hurts relevancy (no pun intended).
incoming links shown as 10, down from 200+
All content pages, all static pages, all unique content, 1 free tool download .. 80% is personally written original content .. balance written by 3rd party and checked via copyscape. Site is linked to by 3 universities from departments related to topic
allinurl:www.mysite.com/ changed from several thousand to zero!
I have some basis for this observation from watching my competitors. One is a shady, shady site that has not changed but a couple of pages in two years. Both of those pages are now supplemental. None of their other pages have been touched by either them or big daddy. They are still doing fine.
I have a site I launched 7 months ago that I have done NO changes to since it went live except for correcting a couple of meta tags on two pages. Those pages went briefly supplemental but now are back normal and the entire site is virtually untouched. All the learning curve was on the main site.
I am not saying that it's all my fault. I am just saying that Google is indexing and reindexing and changing my site almost identical to the changes in order I myself made to the site.
The questios are:
When will G be finished? and Will my site be normal again and fully indexed?
In BigDaddy the site shows only 3500 pages for the last few months, but on 72.14.207.99 and related DCs it is now down to less than 30 pages.
The URL structure was also changed in the middle of last year, and Google doesn't seem to want to index any of it now.
.
Yes, maybe Google has de-indexed one set of URLs, and would add the other set back in again, but as they no longer exist, they cannot do that.
It is possible, but surely that many PhDs would have spotted that flaw and already designed it out, no?
[edited by: g1smd at 2:53 pm (utc) on May 2, 2006]
Google is indexing and reindexing and changing my site almost
identical to the changes in order I myself made to the site.
If this pattern is widely observed, it could well mean that Google is using Big Daddy to integrate a large amount of historical data so they can better use the approaches outlined in their 2005 patent - Information retrieval based on historical data [appft1.uspto.gov].
And that would indeed be a HUGE undertaking, well worthy of the name "Big Daddy." Time will tell - but I'd love to hear from anyone who also sees (or can refute) the observation.
I can get any page I want into the Google index within 24 hours by simply linking to it from our Home page (PR5). If I take the link out again the page is deleted from Google's index within a couple of days.
My own conclusion from this is that Google have introduced some kind of filter/pruning mechanism that is deleting any pages that it deems to be "unworthy" because of the PR of the pages that link to it. My guess is that this pruning mechanism was designed to remove dead links, but instead is running amock and deleting anything without relatively high PR links to it. In my case an internal link from a PR5 page seems to be sufficient, whereas an internal link from a PR4 page is not.
So, if I'm right, the problem is related to depth and PR. This would certainly explain the sheer size of the number of dropped pages. Effected sites do not seem to be losing a few pages here and there, instead they are losing 95%+ of their pages (because most well designed web sites have a hierarchical structure that fans out with depth).
It is possible that Google are doing this on purpose. In which case Google's index, and hence effectively "the Web as most people know it" is set to become a whole lot smaller in the coming weeks. Hopefully, they are not this misguided and instead it is just a horrendous bug. The evidence against this being a bug is that you would have expected them to fix it by now - given that the bug was clearly introduced with Big Daddy. The fact that the problem seems to be getting worse rather than better, suggests to me that Google remain obvlivious to the problem.
Anyway...that's just my 2 cents worth.
but if itīs not a bug, then wouldnīt have allready Matt Cutts praised the new how ever unbelievable progress in se technics? Every time something new raise on google, he gave us his two pence but now? Nothing. I hope its a bug!
Cutts is purposely ignoring this problem on his blog because it's just such a huge screw up they weren't expecting. Until this problem came along, he was chatting away about the new crawler method. After this problem became obvious and webmasters pushed him into a corner to make the Original Post comment, he hasn't said a thing about either the new spider or the bug.
Their silence on the matter speaks volumes.
For instance, when you use the Google URL Removal Tool the URLs remain in their database, and are merely hidden from search results for 3 or 6 months. As soon as the time is up, the pages reappear in the SERPs even if the pages, or even the whole site, no longer exist at all.
Another example: when a robots.txt file was deleted from a particular site last month, Google added 23 000 previously disallowed pages back into their index within just a few days. All of them now had a cache date from January 2004. That data must have been IN their database for all of that time, but just marked as "do not show this" due to the robots exclusion that was applied. As soon as it was no longer marked as "do not show this", they started showing it again. As soon as the pages were noticed in the SERPs, the robots file was added back into the site. Within 48 hours of adding the robots file back on the site, Google pulled all of the disallowed URLs from the SERPs again - but they are all still in their database, and will probably remain there forever.
[edited by: g1smd at 4:37 pm (utc) on May 2, 2006]
Actually, that would fit. And it isn't deleted. It's just not being shown. The pruning is definitely happening with the "shown index".
When you do a search on most sites it will return with 150 of 475 pages from www.mysite but if you click thru to the last page and click the omitted results you will see google only returns a final 220 pages or so. The rest they have in their data but just won't show it to you.
And I would bet that google isn't saying anything to prevent gaming. I think they would rather let a few webmasters hate them and this anti-google sentiment flourish than give up what they are really doing. And believe me, it's hard to admit this for me. I have gotten to really despise them for what they have evidently done to my site. Hopefully, it will all shake out in the end. And scrapers might be gone forever.
And one more thing. Remember the first sites that got hit by this? dmoz clone directories.
Since this morning, when I posted above about historical data, I've been thinking more and more about that possibility. How would you go about making a huge amount of already sharded historical data available for real time use? I think we're seeing exactly that, and it sometimes generates odd symptoms on the visible end.
Based on Google's reported 1Q financial outlay for new equipment, there's been some speculation about exactly what the new Big Daddy infrastructure entails. Somewhere between 100,000 and 200,000 new boxes was one educated guess.
It's still very hard to decide whether to respond to any changes we see in this moment. In fact, we know that Google does NOT want us to respond, they want us to simply build a website for visitors, not them. But when the data shows major causes for concern, it's both hard to sit and wait, and hard to know what if anything might help.
The question is will they re-index what was dropped or will those pages be lost in the google archives madness forever or will we have to rename this older pages to get them back?
What has historical data to do with the disappearance of the 3rd level pages. A company that has a reserve from million of $ should do better testing!? And did You recognice that the serps didnīt get better? Yahoo and msn doing better in relevancy terms, just because they have this 3rd level pages in their index. I donīt know what google tend to do but since october 05 they turned to the wrong way called Big Daddy.
Edit: please someone here around who can tell us that his pages didnīt dropped out of index?
I've been following this thread daily but to be honest the net result is still confusing to me. Just to clarify, is it the consensus that one attribute of Big Daddy is in some cases older pages are appearing (or appearing higher?) in the search results than would have been the case in previous versions?