Page is a not externally linkable
Scarecrow - 6:29 pm on Sep 29, 2004 (gmt 0)
But what is PageRank in the larger scheme of ranking? It's a number that ranks the importance of the page, that is assigned without respect to the search terms that may be used to pull up the page. The key thing about this number is that it can be precomputed. Then the docIDs in the inverted indexes can be sorted by this number. That means you only have to scrape off the top of the docIDs for a search term -- just deep enough to satisfy the searcher's request for 10 to 100 links. You don't have to look at 99 percent of your index for most searches. After you scrape off the top docIDs for a search, then you look at how each document relates to the search terms, using other algorithms. But this initial sort in the inverted indexes is probably the most crucial efficiency algorithm in Google's entire system. Now this initial "PageRank" number certainly does not have to be the pure link calculation it was originally. Links are an obvious indication of importance, but the calculation doesn't have to be pure or recursive. If you did a seat-of-your-pants link calculation, you might want to consider other factors also. Remember, all these factors would blend into a number that is precomputed -- before you even construct the inverted indexes for searching. The inverted indexes are sorted on this number. One thing that comes to mind is some measurement of the quality of a page in the context of the site. The original PageRank never looked at the site as a whole. But the more you know about the site, the more you know about the quality of pages that make up the site. Is the site spammy? Is it a .gov, .edu, or .org where the spam problem is less? Is it a new site? If new, does it have thousands of pages already? Is the site commercial or informational? If commercial, is it an affiliate site? What if Google started keeping information on the nature of sites, and used this to weight the "PageRank" of the pages on that site? This would probably be the best approach to fighting spam. In the Florida update, they tried to do something on the other end of the pipeline. Florida was an on-the-fly filter that was applied after the search terms were collected from the searcher. It didn't work too well. Maybe the semantic stuff was overrated internally at Google, by some engineers who had influence. Now they may be working on the pre-computed part of the algorithm. I think they'll still call it "PageRank" (at least until all the lockups expire in five months and they all dump their stock), but it's going to be something more than PageRank. I suspect the logical direction is to evaluate the page as a member of a site. There are many fewer sites than there are pages, and it might be workable. Something else I'll throw in here. My site, a 129,000 page nonprofit site, got a special crawl over the Labor Day weekend. It was special because it was manually dispatched. I know this because they grabbed all the pages, didn't ask for anything that was 404, and didn't ask for any of the sitemap pages. Every crawled page was sorted -- they crawled from the shortest URL to the longest URL. The only way they could have done a crawl this clean would be to either study my sitemap pages, or take my CSV dump of the deep page URLs, parsed out that field, and resorted. I've never seen a crawl like this in four years. They crawled for 36 hours. Only two IP addresses were used. About every 25 minutes, they'd hit the site for around 2 minutes only. It was very methodical. The peak fetch rate I recorded was 40 pages per second. Yes, per second -- even though almost all pages are very small, and are all static, this tripped my load alarms. I survived and let them do their thing. Why did I go off-topic to mention this? Because I'm not sure it's off-topic. I think it might be evidence that Google is no longer exclusively looking at the web as a bunch of pages, but as pages that belong to sites. This could explain the sandbox effect. So far, by the way, there is no evidence that this special crawl has kicked in.
If your definition of PageRank is that insane recursive formula defined by Page and Brin in 1998, which by 2002 took days to calculate after a full crawl of the entire web, then that's already gone. It disappeared in April 2003.