Pages dropping out of the index - in two months time it will be 0 - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages dropping out of the index - in two months time it will be 0

Number of pages indexed drops from 112,000 to 270!

The_Tank

8:07 am on Apr 13, 2006 (gmt 0)

Has anyone else suffered from this? Has anyone else got their pages back? Did any one make changes to their site? If so what did you do?

I can't be the only one, I know some forum sites that have had similar experiences - but what about other sites?

moTi

3:30 am on Apr 24, 2006 (gmt 0)

maybe some of you refer to my post (300,000 pages down to 80,000 - template driven)

first of all: it is certainly possible to have this huge amount of pages with unique content. if you look at your own tiny website (no offence) and think that this couldn't be because it mustn't be and you think that every webmaster with 100,000+ pages is a spammer, then you should definitely broaden your horizon about content generation.

just to clarify, my website does not contain the said conventional product pages you are concentrating on but published press e-mails. it's a news site and i throw tens to hundreds of pages on the web each day just to be up to date for my visitors. that's how it's working in this industry. all perfectly legal press releases, no scraped content. if you imagine a site which is online since years, then you get a clue of how massive the database can grow. some of my competitors (though some of them i'd call black hat) have multiple the amount of pages indexed.

secondly, user generated content. any big forum webmaster who recieves thousands of entries each day knows what i'm talking about. i also rely on this content type at a certain rate. no forum but the option to self-publish articles.

of course you cannot nearly get to this amount of info by typing in the content all by yourself. and naturally, i'd also think that most of the website owners with huge amount of pages are spammers inflating the index deliberately with crap. but anyhow, i hope you now get an idea that running the white hat path is possible if you get outside support. maybe add a few grey sprinkles with clever seo/linkage and there you go.

so, now each article gets its own url. concerning the theory of too much similarity, you have a point there. but there's the question. same template, different description, different articles:

Sites with many thousands of very similar pages might be getting flagged as spam.

but google would be ill-advised if it came down to size of raw template code compared to size of the article - or more than ever if it deindexed third level pages. the only important issue is the size of the article, that is real valuable content and not the other stuff. if the article is too small, google should kick it out of the index:

if your product descriptions are short, google is questioning why does the product deserve its own page with so little infomration?

this is reasonable.

question remains: what's the similarity ratio between articles to be considered unique or duplicate and how does google define similarity?

by the way: google seems to be reindexing! +20,000 pages in the recent days :)

nippi

4:17 am on Apr 24, 2006 (gmt 0)

mattg3

Google does not know the pages are product pages, I'm not saying its a prodcut pages specific filter.

I see it happening to all sorts of pages. Sites that have databases of similar stuff.

tigger

6:43 am on Apr 24, 2006 (gmt 0)

OK so we are saying it "could" be a duplicate content filter thats removing these pages, if so hopefully! once G realises its removing valuable content from its database it will loosen the belt on the filter, but what I can't understand is why new content just inst getting in? and not even crawled!

We all loved G because we knew if we added content onto our sites normally within a few days it would be crawled & maybe ranking - but I'm now weeks into adding good content into my site thats just not getting crawled! Or if these new pages are getting crawled they are getting removed straight away so when we check the cached info nothing is coming up!

>Stefan

I'm holding out but not for much longer!

iProgram

6:50 am on Apr 24, 2006 (gmt 0)

>but what I can't understand is why new content just inst getting in? and not even crawled!

Maybe it's "crawl caching proxy".

tigger

6:55 am on Apr 24, 2006 (gmt 0)

>Maybe it's "crawl caching proxy".

sorry maybe I'm not awake yet what do you mean

Freedom

9:15 am on Apr 24, 2006 (gmt 0)

OK so we are saying it "could" be a duplicate content filter thats removing these pages, if so hopefully! once G realises its removing valuable content from its database it will loosen the belt on the filter, but what I can't understand is why new content just inst getting in? and not even crawled!

I don't understand why new pages aren't getting in either, and see that anomaly related to the losing pages problem as Tigger must be concluding as well.

I really don't see why Google would want to end up losing millions of pages since they once publicized how they wanted to index more of the web and the Size Wars were in the news last September.

[battellemedia.com...]

Google's silence on this matter (losing pages) could also add to my interpretation that this is a "bug," and not some kind of intentional spam or blackhat filter.

If losing pages, this many, was intentional, = Google would talk about it. Either via Cutts' blog or here by GG.

IF this is accidental, Google would circle the wagons and go into radio silence until the last possible moment. Which is what they have appeared to do so far.

I have 100s (not 1,000s) of pages built off the same template but have unique content on each that were dropped. I also have all kinds of pages on other sites that have not been indexed yet that have been waiting for 1 month to a few days.

I'd like to see this thread go into a rational analysis of this problem if we could.

Thanks,

Freedom

tigger

9:25 am on Apr 24, 2006 (gmt 0)

I do hope your right about this "bug" theory and a lot would point to this as why else would all this content get dropped or the lack of any good deep crawls to 3rd level pages - I've had one page up for 3 weeks now thats linked in from 20 themed pages and the page its linking to has good on theme content so really should be at least ranking by now - but is still not even cached

Freedom

9:27 am on Apr 24, 2006 (gmt 0)

Well, the problem has been brought to Matt's attention at least on the comment portion of his blog. Let's hope it garners some feedback.

Dayo_UK

9:31 am on Apr 24, 2006 (gmt 0)

Of course there maybe lots of different things going on at the same time and Google may be upping the Duplicate content filter. (although this normally results in pages not appearing for substantially duplicated content in the serps rather than not being crawled from my obvs. - of course things could change)

But - I still have not seen a site which has this problem that does not have Canonical url problems either currently or the very recent past.

I personally think it is still a problem with Canonical urls and the new calculation of PR - which does not appear to have been applied to the serps at the moment IMO.

My site which had Canonical url problems prior to Big Daddy had crawling issues as following.

200-300 pages in the index (from a few thousand page site) - these pages seemed to have been added randomly - eg Not Homepage followed by one level deep, two level deeps. (Also at this time homepage PR was 0 probably due to canonical issues - some internal pages had PR - this was probably where the small crawls started from - eg the 200-300 pages.)

Big Daddy comes into play still 200-300 pages (different pages though) are in the index - these pages are logically added though - Homepage, one level deep etc.

So same level of deep crawling but logicaly crawling from the correct place (homepage) - hmmmz (What is this depth of crawling based on? An internal pages PR rather than the Homepage?) Homepage TBPR still 0 - some internals 2-3....

PR update, Homepage gets PR6, internal pages that had PR get PR0 (they were supplemental at the time of the PR export) - crawling virtually ceases.

JuniorOptimizer

12:08 pm on Apr 24, 2006 (gmt 0)

3 days ago: 700 pages. 197,000 for the weekend. Today the total is 36. LOL. Google is comical.

McClaw

12:18 pm on Apr 24, 2006 (gmt 0)

One of my clients sites seem to be suffering the same problem as you guys have been noticing.

I checked on:

[64.233.185.104...]

and [66.249.93.104...]

On the first DC I see 1-2 of 4 pages (Should be about 200)

On the second DC I see 22 pages, all supplemental exept the home page) and all the supplemental pages where deleted middle of 2004

There does seem to be a canonical issues on this site looking at the first DC.

When I show duplicates, I see that there are two extra pages (One called Default.aspx and one called Home.aspx)

I have already 301 redirect ed home.aspx to / but I'm having trouble writing code to do the 301 for default.aspx

Basically I cant find a header (server side) that will tell me that the url is www.widgets.co.za/default.aspx in stead of www.widgets.co.za/

Obviously I'm running IIS and dotnet c#

Any code help would be greatly appreciated.

RobinK

7:12 pm on Apr 24, 2006 (gmt 0)

We too see pages being dropped and added on our site.

I decided to see if I could see a trend on the pages dropped. I noticed yesterday that we had 2 less pages (out of 300+) so I researched and figured out which two pages were not showing up on site search. (they don't show up with or without the omitted results, and we have way less than 1000 pages currently indexed in google now)

The two pages that no longer show up can be found on google by searching for the url or with the site name and keywords. They are not supplemental and have current caches.

I just found out I can do a site search and then do the "search within these results" and get both pages to come up that way too. But if I do just the site search and look for what pages are listed they do not show up.

Anyone have any thoughts on this?

Shurik

7:30 pm on Apr 24, 2006 (gmt 0)

McClaw, disallow default.aspx in robots.txt

LuckyGuy

8:24 pm on Apr 24, 2006 (gmt 0)

nippi,

i figured that out to and did the same thing just 5 weeks ago. Worked great but it hit my side hard and it lost about 4000 ( from 5800) pages. Maybe they were not unique enough. But my site: search shows up very old pages that have been deleted for over one and a half year now. There�s no cache for these sides. These non existing sides were fully template driven and with maybe 10% unique content.

So, your theory doesn�t match fully.

I think they thightened the DC filter but maybe just in time they release the new bot caching technology and went up into unexpected probs.

Freedom

9:18 pm on Apr 24, 2006 (gmt 0)

Despite 4 seperate requests on Matt's blog for more information on our shared concern here, he's avoided it while answering fluff question on other matters.

The longer they are silent on the matter, the more it tells me that this is a bug not within their control or influence and they are dodging the question.

Is this a temporary bug? I dont even think they know.

Armi

11:02 pm on Apr 24, 2006 (gmt 0)

A feedback by *Googleguy* would be very helpful!

GG - please help us ;-)

arbitrary

5:15 am on Apr 25, 2006 (gmt 0)

I am confident this is not a canoncial issue.

I have three sites that have canoncial problems. None have lost pages. I have a site that does not have canonical problems - it has lost 75% of its pages.

Yes, I know the what canonical problems are, my three sites have been in the tank for a year now. I also know that the site experiencing the page loss does not have canonical problems.

McClaw

5:36 am on Apr 25, 2006 (gmt 0)

>>McClaw, disallow default.aspx in robots.txt

Thanks for the reply dude.

Disallowing that file in robots could cause some other issues, I would like the Pr to be passed to the correct page, in stead of just throwing it away. :(

Continued: [webmasterworld.com...]

[edited by: Brett_Tabke at 4:37 pm (utc) on April 26, 2006]

This 168 message thread spans 6 pages: 168