Pages Dropping Out of Big Daddy Index

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages Dropping Out of Big Daddy Index

GoogleGuy

6:11 am on Apr 25, 2006 (gmt 0)

Continued from: [webmasterworld.com...]

One thing to bear in mind is that Bigdaddy will have different crawl priorities. That can account for some of it. If you've run into any spam problems in the past, you might also want to do a reinclusion request. Otherwise, please send an email to bostonpubcon2006 at gmail.com with the subject line "crawlpages" (all one word), and I'll ask someone to see if they notice any commonalities.

tigger

7:44 am on Apr 28, 2006 (gmt 0)

>via the email address that GoogleGuy gave in the first post of this thread

talking of "that" email address anyone had a response

ambiorix_rb

7:47 am on Apr 28, 2006 (gmt 0)

10,000 plus now down to 162 and still dropping. Doesn't seem to matter whether they have any similarity or not.

Long time I didn't post, but I guess this problem is important enough to mention.
We also suffer from the non indexing or indexed pages dropping.

We've created a SEO clean Wordpress site, not over seo'd but just clean with lot's of original press releases we get from big aviation companies. We got even in Google.news.

The first 3 weeks of February all went well, we got over 16.000 pages indexed. But from March on all went down the drain... Last 2 weeks to 180 and dropping to 144 today.

I don't panic (never done with Google) but I have the strange impression something weird is going on. Didn't send a GG mail (they will get enough of them after GG's post) but I really hope they'll find a solution...

yves1

7:59 am on Apr 28, 2006 (gmt 0)

kiwiwm, I am not seeing any specific issue with my "noarchive" pages. My old ones are still doing well in google.

For the recent ones (created since mid march) it's as weird for the "noarchive" pages as for the ones I allow google to cache: some are indexed some are not (yet?). And I can't find any reason why. Both indexed and non-indexed pages are the same kind and are linked in a similar way.

The only common point I am seeing is that the pages Google refuses to index is that they display RSS flux. So it came to my mind that the indexing problem might have something to do with content similarity.

This still needs to be confirmed with more pages.

cleanup

8:05 am on Apr 28, 2006 (gmt 0)

Kiwiwm
" I can honestly say there is a lot of material on the site which can't be found anywhere else on the web "

That implies that the -majority- of your site is not original. No suprise perhaps then if Google found your pages to be -supplemental- to the index.

The whole of my site was written by me, but even that does not protect you from going supplemental if the content is copied/scraped by others.

Blueshadow

8:33 am on Apr 28, 2006 (gmt 0)

Google can you please find out what is going on and fix it please..

kiwiwm

8:55 am on Apr 28, 2006 (gmt 0)

Thanks Tedster & Yves! Guess we'll just have to wait and see.

Cleanup you got the wrong end of the stick, the majority of our material is original and/or unique to our site - we haven't scraped it. To try and give an example of how material can be unique but not original, say there is an article from 1957 in a hard to find magazine, we will search high and low to find the magazine, buy the magazine and scan the magazine article and put it online. We didn't write the article in 1957, we weren't alive in 1957 - but I believe the article should not be "supplemental", as there are people who want to see it and can not find that magazine and it is not online anywhere but our site, so I would call it unique to our site, though not original. Hope that clarifies.

edit: sorry for going a little Off-topic. The thing is we're not even in supplementals any more, and I don't believe content is the issue - something else is up!

Aimee

9:01 am on Apr 28, 2006 (gmt 0)

Not response from that mail address.

Steph_R

11:07 am on Apr 28, 2006 (gmt 0)

Don't hold your breath either. He offered an email addy after SES in New York, but never replied to anyone's emails. Then again he offered an email addy after Boston. Still no replies. Everyone I know has experienced the same.

Not sure why he announces his email address and offers to help, but then just does nothing. No reply, nothing.

tigger

11:19 am on Apr 28, 2006 (gmt 0)

I agree why post something if they are not at least going to respond in some way - even if its just an acknowledgement of it and they will investigate further - after all its not as though it was a published email address so they won't have gotten 100's of emails

Freedom

11:36 am on Apr 28, 2006 (gmt 0)

Google/Cutts are being pretty quiet about this. He hasn't mentioned anything on his blog.

My guess: This is a FUBAR that has them befuddled and they are still trying to sort it out, and will be for a long time.

Come back in 6 months and they might have it solved, but will probably hit you with another FUBAR.

internetheaven

1:07 pm on Apr 28, 2006 (gmt 0)

... unless Google has raised their unique content filter to "must have at least 90% unique content" I don't have an explanation.

If it has become 90%, is it that unreasonable? It would be a very effective way of cleaning out pages that aren't genuinely unique.

Yes, it would be unreasonable, considering the number of web pages on the web talking about any one individual subject I'd imagine that hundreds of pages could have 70%-90% matching words whilst being totally "writer unique". It's the "give enough monkeys typewriters one will write shakespeare" scenario.

Consider 500 news articles on the same story, even though each of those stories may be written by a totally different journalist - by having a too high duplication filter the majority of them will not pass. How does Google justify banning 400 unique articles in favour of 100 "more unique" articles? I think Nabeel's comments:

Dropped from 9,000 pages to 74 pages for my main website. All of my other website pages also dropped. I write articles & my content is unique.

justify/emphasise my point. Duplication filters are wrong in my opinion, they arbitrarily remove good content based on nothing more than "some of the words match" when the decisions should be left up to the other algorithmic factors of linking, age, authority, layout etc. By allowing a software program to remove pages from the index based on duplication is verging on "negligent censorship" (a new legal term I've just made up) whereby information is being held back in favour of other information without real justification.

Steph_R

2:29 pm on Apr 28, 2006 (gmt 0)

I bet the news agencies are not happy about this either.

For example, if a history-making event occurs then, of course, each responsible news agency is going to write about it. But if those articles have *anything* that is slightly similar, they get dumped from G's index. That can't be good for the public because one of the advantages of our society is the ability read many different point of views.

mattg3

2:41 pm on Apr 28, 2006 (gmt 0)

Consider 500 news articles on the same story, even though each of those stories may be written by a totally different journalist - by having a too high duplication filter the majority of them will not pass. How does Google justify banning 400 unique articles in favour of 100 "more unique" articles? I think Nabeel's comments:

Thoughts about double content:

in my country many servers buy news content from dpa [German press association]

We have or had links from dpa reporting about us. This is somehow also duplicate content. I wonder if these inbound links are now removed as counting towards your PR.

Or lets take into the normal world. A local radiostation here buys national news from a provider and ads local news. If they would just have local news they would have a lot less listeners and are forced to have some form of duplicate content.

Today one essentially would have to have some wikipediaclone running by default and add your own stuff, so you don't get outcompeted by random wild editing with often no sense whatsoever.

Yesterday I typed into G a species name and ran over wikispecies on page 1 which has a wee picture about the animal + some nomenclature, that is all.

Since the nomenclatur is basically a tree [with many information repeated] it would resemble very much a spammer site:

Regnum: Animalia
Subregnum: Metazoa
Superphylum: Bilateria: Deuterostomia
Phylum: Chordata
Subphylum: Vertebrata
Classis: Mammalia
Subclassis: Theria
Ordo: Perissodactyla

Something like this is repeated on every page, mostly with a picture of a lion.

but it's wikipedia in some form so that why it's seems to be great.

site:species.wikipedia.org = 395,000 on g.co.uk

so why is this spam like tree still there, cause wikipedia can't do wrong? Strange.

mariella

2:07 pm on Apr 28, 2006 (gmt 0)

Hi,

I am the webmaster for a relatively new website (launched in September 2005) that and up until mid-March was correctly indexed in Google. Over the course of the last few weeks I have noticed the following:

- Most of our pages have disappeared. NB: Our website has slightly more than 400 pages and all were correctly indexed up to mid-March. Over the course of the past few weeks the number of pages indexed has declined. Today only 37 pages are listed in data center [64.233.167.104...] I have listed this center for two reasons. 1. All of the other data centers are showing a different problem as set out below. 2. The 360 odd pages no longer indexed on data center [64.233.167.104...] have disappeared from the top 1000 keyword search results independent of the data center searched.

- Old urls, notwithstanding a 301 redirect in place since December 2005, are now listed on all data centers except [64.233.167.104...] Again this change was noticed around mid-March. At first, the listing of the old urls was somewhat erratic. But for roughly the past 2 weeks they appear to have taken root. I have not cross checked the 2800 pages and various data center but given my statement in point 2 above I would hazard to guess that only 37 of the 2800 listings shown are correct.

- A sharp decline in the traffic from Googlebot. The traffic for the month of April was approximately 3000 hits or roughly half of our monthly average of 6000.

Our site is an original content website. Pages are frequently updated and new pages are slowly but continuously added. We are not a blog or forum and we have not made anything other than cosmetic changes to the site since early December 2005. I promise. I have nothing to gain by an improper description of our problem.

Of note, we deactivated the Google site map generator in early January of this year and reactivated it roughly 1 week ago. The log suggests it is working fine. Finally we have not had a similar effect in Yahoo or MSN.

Any suggestions you may have are greatly appreciated.

Lorel

4:31 pm on Apr 28, 2006 (gmt 0)

Not sure why he announces his email address and offers to help, but then just does nothing. No reply, nothing.

Just because there are no replys forthcoming from the emails doesn't mean they are doing nothing about it.

They asked for emails from anyone affected with this problem so are probably trying to figure out what's happening. And re why they never tell us what happened -- Google is not about to betray their secrets. They will probably fix it and not tell us what went wrong.

I would encourage anyone who is having this problem to write to that email. The more samples they have to work with the more likely they will find the problem.

Steph_R

5:07 pm on Apr 28, 2006 (gmt 0)

Yes, maybe. But a "got your email" response would be nice. Instead it seems like a black hole.

Pico_Train

5:11 pm on Apr 28, 2006 (gmt 0)

I agree with Lorel. Send your samples. Replying to thousands of emails "I lost my ranking or pages" is not really a productive pasttime. Rather spend the time fixing the problem than answering your email to comfort you.

Sorry if I am being rude I don't mean to be. Maybe it's the beers in me already...

tigger

5:18 pm on Apr 28, 2006 (gmt 0)

but this would be assuming 1000's of emails were sent in! considering the email address was only posted here I can't really see that many webmasters responding to it other than the odd few here that have posted here sending it in

A simple reply would have been nice - after all how many times have we supported G when they have asked for spam reports after updates and testing algo switches

Steph_R

5:19 pm on Apr 28, 2006 (gmt 0)

I agree -- definitely work on the problem. But if you invite people to email you, common courtesy dictates some type of reply. Not asking for details, just "got it".

I think a beer sounds good at this point. =)

SteveJohnston

5:21 pm on Apr 28, 2006 (gmt 0)

>>> the healthier your link development, the more appetite G has for your pages

>> so how can new sites ever get off the ground with the fear of the sandbox & now poor crawling due to low LP. plus this also doesn't explain how PR4/5 sites have dropped 1000's of pages

Hey tigger, sorry for the delay from message 47,

In my experience its not about high LP when I refer to link development, but simply 'development' as opposed to a blast of artificial links or a stale situation with nothing changing. Small sites often have slow and steady link growth, but I reckon the steady bit is what matters. Regular quality rather than irregular quantity.

Anyway, as you say, it doesn't explain how these sites have dropped 1000s of pages, although it might be a growing influence in Google's quality signals. But then you go and read something like kiwiwm's post it sounds more like a bug than anything.

Steve

steveb

8:51 pm on Apr 28, 2006 (gmt 0)

The new bot is a very dumbbot, and it seems that may be by design. The new bot certainly is no improvement over the old, and the recent comment about how it saves webmasters bandwidth (as if anything more than .01% of webmasters would prioritze that over having a site well crawled) probably reflects that.

Okay so now here is the secret to keeping pages in the Google index...

Ready?

Here it comes...

Delete them.

If you delete them Google will keep all the pages in its index forever as supplemental.

Have an active page online, Google drops it from the index. Delete a page, Google will keep it in its index forever. War is peace. 1984 is here.

Stefan

2:07 am on Apr 29, 2006 (gmt 0)

I don't think it's site specific or anything else. I think it is a meltdown in the d.c.'s. I think google lost huge chunks of data. I think it is one huge earthquake moving thru their dc's and they don't know how to stop it and they started this new bot thing as a smokescreen.
I think google has blown a fuse!

Texasville, I hope you bounce back, man, but I think you're wrong on the above. I check a fair selection of serps that are related to my field on a regular basis, and none of the "authority" sites budged, and none of them lost most of their pages. It's not a general Google meltdown; it might be category specific, or site specific, but it's not across the board. Those who are hoping for a sudden recovery could be disappointed, and it wouldn't hurt to establish a few new test domains, using entirely different methods, to see what happens.

Just so people know where I'm coming from when I post in these G threads: I'm not a big fan of G, or Y, or any of the SE's - I think they're all crap. As a user, I look forward to the day when SE's actually deliver what they promise, rather than just presenting long lists of scrapers, MFA's, pseudo-directories, and all the other dross that pollutes the net. Of course, in their defense, the internet at present is 90% crap, so that's a ball and chain right off the bat. Garbage in, garbage out...

Lorel

5:46 pm on Apr 29, 2006 (gmt 0)

The website I reported when this thread started has regained PR on almost all it's pages that were affected. They had dropped from 4 pr to 0. We did nothing to the site. Some of the pages now have higher PR than originally.

BillyS

6:49 pm on Apr 29, 2006 (gmt 0)

Okay so now here is the secret to keeping pages in the Google index...
Ready?
Here it comes...
Delete them.
If you delete them Google will keep all the pages in its index forever as supplemental.

I just deleted my entire site, hope you're right!

Steph_R

7:03 pm on Apr 29, 2006 (gmt 0)

"The website I reported when this thread started has regained PR on almost all it's pages that were affected. They had dropped from 4 pr to 0. We did nothing to the site. Some of the pages now have higher PR than originally."

Lorel, did the site you are referring to also lose pages in the G index or did they just lose PR?

I am watching a site that has lost most of its pages in the G index and the remaining interior pages went from PR5 or 6 to PR0 about a month ago. There is no spam on that site that we are aware of and it has been checked, double-checked and triple-checked. Nothing going on there, but the pages are dropping from G index like a rock.

[edited by: Brett_Tabke at 10:31 pm (utc) on April 29, 2006]
[edit reason] fixed typo per poster request [/edit]

djmick200

7:36 pm on Apr 29, 2006 (gmt 0)

I wouldn't expect any answers from the email addy that is being talked about. GG said he would pass on the urls/sites for some one to look into. I guess if they find a common fault they will address it otherwise the problem is with our websites. No response so far would lead me to guess it is at our end the problem is at.

Which brings me back to the simillar pages theory, which many have been quick to dismiss.

I am of the opinion it is a combination of new spidering routines, low PR pages and simillar pages.

All of course my opinion, which as always will be blown out in a matter of 2 posts.

cbartow

10:19 pm on Apr 29, 2006 (gmt 0)

djmick200:

I thought the same thing. The problem with this is why is Googlebot crawling pages, but not putting them in the index? If it's crawling lower PR pages less, that's one thing, but it seems to be crawling a ton of pages then either not putting them in the index, or if it's a page already in the index it doesn't update it.

dramstore

10:40 pm on Apr 29, 2006 (gmt 0)

cbartow
Agreed - the problem is plenty of crawling and no indexing.

I think (as I think! Matt said in his blog) there are 2 different issues here: one is the lack of spidering and the other is normal spidering but simply losing pages (most, if not all of them) .

I get the impression the email address GoogleGuy mentioned was actually intended for people who are not being crawled (hense the email title 'Crawl whatever'.

I hope I'm wrong about that - I think these are 2 different issues and both need lookin at.

Lorel

11:51 pm on Apr 29, 2006 (gmt 0)

Lorel, did the site you are referring to also lose pages in the G index or did they just lose PR?

Yes, it lost pages and also PR. However the tool I use to check pr must be on a different google data center than either the owner or I use (in different parts of the country) as this restoration of PR does not show up on our G. Toolbars (his on IE and mine on firefox) so this change may be moving across the country. I don't know how to tell which data center the site where I checked PR uses.

I just checked the index for this site (on my local G. Datacenter) and all but 6 of the pages are back in the index.

Re duplication issues, the pages that had dissappeard from this site were all the main pages and all totally unique titles and descriptions and content. Totally white hat--no javascript urls, etc.

youfoundjake

6:47 am on Apr 30, 2006 (gmt 0)

Not that I'm a big player, but I lost 99 out of 100 pages, just my index is showing up. I compared 2 datacenters using cnn.com

[64.233.185.104...] 12,400,000
[66.249.93.104...] 14,300,000
that's gotta hurt...

sigh

This 254 message thread spans 9 pages: 254