Since October, I have been developing a site (cxp.paterra.com) containing free class-expanded abstracts of U.S. patents. The class-expansion of patent classification codes are in English, Chinese and Japanese. The site now contains more than 4 million pages.
As part of this development, I have been monitoring the activity of search engine spiders and the appearance of pages on the various search engines. The patterns have turned out to be quite surprising. Please see the graphs at [paterra.com...]
1) Essentially only Google was allowed to spider the site until January. So all of Ask and Baidu records are the result of subsequent spidering.
2) The site was filled to essentially its current size (i.e., U.S. pregrants back to 2001) at the end of January. At this time some of the linkage structure was rearranged to fit the 4 million pages into Google's maximum of 100 links per page. All of the preexisting URLs remained valid but many links changed.
Observations and questions raised:
1) Google's content is very volatile. It may be that Google's page rank system coupled with its monthly recrawling cycle is inherently unable to index large sites. The peak and drop in Google's indexing of cxp.paterra.com roughly corresponds to the linkage restructuring propagated over a month's crawl cycle. Links (as opposed to URLs alone) must remain valid for content to remain in Google's database.
2) Ask and Baidu spider aggressively and systematically. Not being dependent on links (for link-based page ranking) may make them inherently more stable and more capable of thorough indexing. As long as the URLs remain valid, the content remains in their databases.
3) While Microsoft, Yahoo and several other engines crawl the site, they still don't rise above the baselines for coverage.
In designing our site, I did take into account several of the SEO principles with one notable exception being the external links on which Google's page ranking is based. Otherwise, the class-expanded abstract approach is intended to take advantage of relevance ranking methods. In other words, the length of the documents is short while still containing meaningful terms. Another bow to SEO was to follow Google's posted advice to keep the number of links per page to 100 or fewer.
I did not try to optimize page ranking based on external links for two reasons. First, it is simply impractical. The site contains millions of patent documents. While some third party sites may link to the top level pages, there is no way of generating links to the records themselves in a meaningful way. Secondly, page ranking would skew the relevance of records within the site and lessen its usefulness to searchers.
What I think I am observing in the spidering patterns is an inherent instability in Google's spidering which is apparently a result of page ranking. While I have not investigated page ranking technology, based on relational database principles, it must be based on recording both the target URL _and_ the linking URL in a table. A "group by" together with "order by" in the SQL query would then generate a page rank. While page ranking may be Google's current basis for fame and fortune, it may also be their Achilles heel.
Of course, the above is an oversimplification and Google's algorithms are much more sophisticated. Google can sell ads based on the URL string itself with no page content so it is profitable for them to have a lot of content-free URLs in their database. (Check out the query [google.com...] to see what I mean.) Google's AdSense algorithm itself has at least three levels of response: 1) based on the actual content (probably only up to the first few kilobytes) of the page, 2) based on the URL string by itself, and 3) public services ads. The level implemented for a particular page access depends on response time. I posted an analysis of Google's spidering patterns earlier on chminf (https://listserv.indiana.edu/cgi-bin/wa-iub.exe?A2=ind0502&L=CHMINF-L&P=R3823&I=-3).
In contrast, Ask and Baidu spiders are much more methodical and their database content much more stable (at least with respect to our site). One reason for the current post is to plant in information professionals' minds the possibility that Google may not index the entire Web, even entire sites, and may be inherently incapable of doing so. Are the Internet search engines really doing what they are claiming to do? I would hope that other information professionals with detailed knowledge of particular content-focused web sites would run tests on search engines' coverage of those sites. If anyone knows of other critical testing of search engines, I would appreciate pointers to the results. (Any master's dissertations out there?)
My own searches have changed after seeing these results. Now if I really want to search a site, I will 'Ask' it before 'Googling' it.