Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Doesn't Index Everything

Not Even Major Sites, Not Even Close

         

androidtech

4:21 am on Aug 1, 2005 (gmt 0)

10+ Year Member



I'm not naive, I know the search engines miss most of the pages on the web; but I just had an experience which I found very enlightening.

I was under the impression that the major search engines, at least Google, did do a good job of covering the major sites. I now have to change my mental map.

I've been working on a little applet that searches a major site using the Google API. Now this site is light years from being considered a medium size site, both in page count and traffic. It's a big site. It has a page rank of 8 and an Alexa traffic rank of 503. It's a well respected site, nothing fishy about it.

While doing testing, I tried some "reverse engineered" lookups. I would take pages that I knew about and use a combination of distinct keywords found on the page, and then try to get back to the page by punching those keywords into Google.

I don't have an exact percentage yet, but a large percentage of the pages were not in Google's index. Even if I restricted my attempts to pages that were from last year, to avoid pages that might not have been crawled yet, there were many missing pages.

I have to change a lot of my opinions about the web and the opaque window I have of it, and especially how I navigate it.

Thanks.

theBear

1:24 pm on Aug 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Let's see.

Normal things first.

1. Robots meta tag exclusion both of the noindex and prior page nofollow kind.
2. robots.txt exclusion.
3. too many results return for the search and the page does exist but is at position 1001 or higher.
4. Page is indexed but the api doesn't allow uncovering omitted results.
5. Pages were urlconsoled out.
6. Page is indexed but supplemental and no supplemenat results were needed to provide an answer for the query.
7. Pages were by other means denied to the robots.
8. Use of rel="nofollow" in all link chains to the page.

But it still wouldn't surprise me if pages of a major site didn't get indexed, just due to the nature of the link churn on a major site.

BigDave

5:56 pm on Aug 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Instead of trying to jump thorough those weird hoops, why not just do a search like [inurl:forum30/30000.htm site:webmasterworld.com].

It doesn't really matter what the PR of the "site" is when it comes to whether a page will be indexed, it has more to do with the PR of the pages that link to that particular page. Yahoo is PR9, and they have 46 million pages in the index, but they probably have a couple hundred million more that are burried so deep, and are so many levels removed from that PR9 page, that google never goes near them. Not to mention all those pages that are hidden, or are only found through SE unfriendly links.

If you have a PR8 home page, less than 100,000 pages, and a good linking strategy (both internal and external), you will almost certainly be completely crawled. If you have PR8, 10,000,000 pages and a bad linking strategy, then you might be lucky to get 10% crawled.

androidtech

4:01 am on Aug 3, 2005 (gmt 0)

10+ Year Member



BigDave,

re: Yahoo

Exactly! But in the past I would have assumed that those "buried so deep" pages were not indexed because they had "low content value". I now know that to be absolutely untrue.

I think there's going to be a rise of starkly vertical niche search engines; possibly licensing or using Google's technology as the backbone. Too much useful information is getting lost, just an opinion.

Thanks.