|Number of indexed pages wierdness.|
About 2 months ago we launched a online communities site <edited> (our company had previously used the domain as our main marketing site). The pages of the site were indexed almost immediatly, up till last week when google had about 1,000 pages in it's index. Since then the number of pages has consistently dropped each day and it now about 450 indexed. The GoogleBot hits every page on our site almost hourly, there are currently around 5000 pages.
Because of the nature of the site, it is very heavy with unique content that is part of user profiles and also their blogs. We also get a ton of inbound links 20+ a day due to mentions in the press, outside bloggers talking about us and members linking into this site. We are growing at a fairly exponential rate right now so the amount of fresh content on the site is also increasing exponentially. I worry Google is severely penalizing us for what is actually natural growth.
Anybody have any idea's on why the number of pages indexed is consistently dropping at the same time the amount of content and links into the site is growing? Is it worth submitting a sitemap to Google if the number of pages is growing so rapidly? Why does GoogleBot absolutely pound the site 7000+ hits/hour, yet they don't index the pages?
<Sorry, no specifics.
See Forum Charter [webmasterworld.com]>
[edited by: tedster at 5:42 pm (utc) on July 12, 2006]
Yes, I would say the Sitempas program is a good idea --even if you don't create a Sitemap, you will get information back from Google about possible problems.
The rest of the strangeness is actually not so strange these days, not that anyone has a solid explanation for this kind of thing. Twice in recent weeks Google reps have spoken about "bad data pushes", so that may also be playing into what you see.
I've never heard of a "penalty" that keeps regularly spidered pages from showing up in the index. Penalties tend to hurt ranking position, but the url is still indexed. However, since you are using some form of community driven software, it's well worth checking that you are not creating more than one url to access the same content -- that can cause chaos. You may want to check on some important best practices, such not allow session id urls to be spidered.
Here are some good reference threads. Lot's of technical issues are mentioned in them, and I suggest you check through them pretty closely for ideas. While you currently may be seeing some Google-generated problems, there may also be steps you can take to make things better.
Checklist for Sudden Drops in Rank [webmasterworld.com]
Dropped from Google - a checklist to find out why [webmasterworld.com]
Dropped Site Checklist [webmasterworld.com]
The url-only problem [webmasterworld.com]
Your probably right about penalties. The pages we do see indexed rank very high, so they are probably not being penalized in anyway. No session ID's in the urls, and we don't have multiple URLs with the same content. Though there are a lot of pages for user navigation purposes which may end up haveing pretty similar content, but we didn't expect that google would add these to it's index anyways because they are mostly links to other area's of the site.
Will have to try the sitemap thing...
Check your pages that fell out of the index, and see if they are supplemental. I heard a number of people complaining the last month of losing a large number of pages, and most of the time they are supplemental, or have ZERO incoming links.
Don't know what your site is, but I bet most incoming links are coming to the index page, and not to your other 5000 pages, hence no reason for Google to index them.
Also I heard others reporting that a week later the pages come back, or when they check across different Google DCs (data centers) they get different page number results, sometimes varying greatly.
My $.02 says Google is looking to improve the quality, and do away with the 5 billion page web site scenario that embarrassed them last month.
A site with thousands of pages is likely to have them all the same info, and maybe they view it as lower quality thna say a real informative article, or a page that is updated daily.
I noticed this just the other day. I launched a new site and it had 19 pages indexed. Later that day, I did a search to see where my listings were for the keywords i got hits from. I couldn't find them. I did a site: search again and now it is only 13.
The weird thing is that it used to have a new root page cached. It's like it reverted to an older index for some reason. I should note that even though the site is new, i am getting first page listings on my major keywords and i see those constantly moving up in rank every couple days. Because of this, I dont think it's any penalty and it is definately not sandboxed.
Googlebot hits the site every day. So far 450 hits since launch on the 7th. Cant figure out why it would pick pages up, cache a new version of the root, then all of a sudden drop those sites and go back to an older cached root.
Sometimes too, people will get Google CRAWLING confused with INDEXING.
Big difference. They can crawl your site 100 a times a day, but they only change the index maybe once week, depending on your site. If your pages are supplemental, all bets are off.
I was investigating some scammer scraper pages 2 weeks ago, and saw one of them doing some dubious crap with our web site url, and his page is supplemental, and Google has not updated the cache on that scammers page since August 2005!
SO rest asssured, Google KNOWS what is on your site, and they KNOW you have made changes. THey just decide when to show the world you made these changes.
When we make changes to our pages, we see it reflected in the index 3-4 days later. This is where the Google Sitemaps program comes in handy, by telling you the crawl dates etc.
[edited by: JeffOstroff at 5:17 am (utc) on July 14, 2006]