Forum Moderators: open
The pages haven't been totally removed, instead it seems that the pages exist in the Google SERPS but they have no title and no snippet and therefore no longer appear for any searches.
At first I thought this was some sort of penalty / filter to remove some of the controversial search sites from its index, but it seems this applies to other large sites, e.g. dmoz. I would estimate that dmoz has had around 200,000 pages "nuked".
Has anyone noticed this phenomenon on any other sites?
No title / no snippet in my eyes just means that Google knows about the page but is not considering it at the present time to be in SERPs. There are many reasons for it
- Google is not able to crawl the page anymore (or was not able to crawl it)
- page has content that is very similar to other content
- technical problem of any kind
...
There must be a valid reason why this happens. It isn't random - that would be ridiculous. There must be some factor that stops google from spidering these pages and not indexing them.
I think it is a combination of needing a higher PR value overall throughout the site (more inbound links) and possibly reducing the similarity of the pages throughout the site.
Dan
Perhaps Google has a duff robot running amok (either hardware or software fault). Each robot, presumably has a list of urls to visit. All that would be required to create this effect is for one of those robots to report (incorrectly) that the pages are offline.
Just a thought.
Kaled.
this is a lot more plausible than some complicated theories of filters.
in fact let's do a survey:
your total number of pages
total number of pages in the online serp
total number of url-only pages
That's a totally different issue. Google has never and will never completely forget about them. Nor about disallowed pages either. But they will never show in the results for standard queries. They are just NOT INDEXED. And they never will. That's what the meta tag says: DO NOT INDEX ME. That's totally different from the observations of url-only listings for pages that should normally be indexed.
>google just runs out of space in the online index.
Yawn ... [webmasterworld.com]
>Perhaps Google has a duff robot running amok
>report (incorrectly) that the pages are offline.
Amazing. Where do you get those funny ideas?
are you acknowledging or denying that google is running out of space index space? this is a lot more plausible than some of the funny theories you've been posting. let's see some of your proof. how do you explain the stats above on msn, yahoo, cnn, amazon? i can show you more if you want.
don't yawn and be intellectually lazy. think!
Amazing. Where do you get those funny ideas?
Daft as it seems (and I was not being serious) it would explain the problem.
Clearly, this phenomenon is either by accident or by design (or perhaps the result of problem management, e.g. if some index capacity had to be taken offline, but that doesn't seem likely).
If by design, then I for one am at a loss to understand the logic.
If by accident (i.e. a bug) a faulty robot is plausible except for one factor - it ought to have been spotted and fixed within 24 hours.
I suppose there is another possibility. Perhaps someone has let a virus loose that is blocking Googlebot. Again, not likely and, in any case, Google should have spotted it immediately.
Kaled.
Mod note:
There can be absolutely no quotes from emails posted at the board. Paraphrased, it indicated that they are pages included in the index but not fully crawled by robots and only partially indexed.
[edited by: Marcia at 4:03 am (utc) on May 21, 2004]
It could be a convenient explanation for a big problem or an intentional way to keep their index large while cutting costs.
this is evidenced by google creating a separate supplemental index,url-only entries in the serps and mysteriously disappearing pages.
can the constraint be time as implied by GG and the canned replies as well as the google faqs? hardly, since this can easily be solved by simply keeping the old info of existing pages.
is it a space constraint? hardly, since memory and disk is cheap.
is it a docid/index problem? GG has vehemently denied this in one of the posts.
whatever the constraint is, the problem must be insidious and very difficult to solve particularly that this has been happening for several months now.
GG - this problem has not been metioned in the risk section of the IPO papers. tell the gods in the plex that if this problem is uncovered after the IPO has been launched, that this is very serious grounds for stock fraud and manipulation!
Why don't you want to be told the obvious answer? Google hasn't recently crawled pages that it either crawled awhile ago or merely saw the link but didn't crawl through.
Duplicate content, relative link, poorly constructed websites are the ones mostly being hit by this. Huge sites with pages that only have one or a few super deep links to those pages also get hit for exactly the reason Google says, they haven't crawled the pages since they last dumped the master cache for it.
Big sites do have a hard time keeping a crawlable structure consistently thoughout their sites, but that gets more important as Google is apparently depending more on us to tell them what is important via linking.
i disagree. almost all sites are affected. just look at any large enough site. a few more examples:
site/fully indexed pages/url-only pages
webmasterworld.com/116,000/29,000
google.com/196,000/29,300
mtv.com/162,000/86,500
cisco.com/187,000/107,000
guardian.co.uk/309,000/235,000
whitehouse.gov/36,800/38100
as i said almost all pages are affected. and my speculation is that google chooses pages randomly as the fairest way to allocate their precious space in the full index!
Webmasterworld has thousands of pages that can only be accessed via a daisy chain of linking. Many of the URL pages are years old, PR0 pages like [webmasterworld.com...]
Google is crawling more, but less deeply. Perhaps it isn't unreasonable to think that they should crawl every ancient page with one link to it every month, but its no surprise that they don't.
I have added NOINDEX to these pages, and now i have only got about 20 url only listings of these pages that havent been crawled. So i have reduced the number of URL only listings on my site my marking these pages noindex; 20 out of 990 is ok to me. - compared to 200/1100 or so it was three weeks ago...
i truly believe that google has just added a few more factors into the mix of its crawlers that determine whether a page is even 'worth' indexing properly.. including dupe content and long paramater urls ie: ?product=2222&department=345344..
space problem that isnt publicly known just before they go public... i think they are a little smarter than to try to rip off their investors; its not like they are selling a dodgy used car here...
I don't know why the disagreement either. There is not just one reason is there?
Pages can be URL only when they're first discovered, and they can ALSO turn URL only when they're being removed for penalties or otherwise.
If you track a site getting pages removed you can watch the number of URL only pages gradually increase over a number of days, especially if you keep watch on a couple of different data centers. And no, they are not being removed for lack of room.
steveb,
i am disagreeing with the above statement. all sites are being hit not just sites with duplicate content, relative link, poorly constructed. do you have any facts to support this assertion? or are you just speculating?
<snip>
[edited by: Marcia at 3:52 am (utc) on May 21, 2004]
[edit reason] No pointing out specific sites, please. [/edit]
marcia,
look at the evidence:
site/fully indexed pages/url-only pages
webmasterworld.com/116,000/29,000
google.com/196,000/29,300
mtv.com/162,000/86,500
cisco.com/187,000/107,000
guardian.co.uk/309,000/235,000
whitehouse.gov/36,800/38100
msn.com (1,580,000 / 1,830,000)
yahoo.com (5,460,000 / 3,290,000)
cnn.com (501,000 / 207,000)
amazon.com (2,590,000 / 2,350,000)
you think these are newly discovered pages? are they being removed for penalties.
if we propose any theories, let's try to support them with facts. otherwise we're just spinning old wives tales.
This phenomenon occurs in lots of large sites but certainly not all sites on the Internet. That's just silly.
Forget this running out of space junk. Google is crawling more actively than ever before, with fresh tages appearing every single day.