Forum Moderators: phranque
Here's one issue - what appears to be a single "site" can consist of page elements drawn from more than one domain. Check out the URL for the images on this page, for a basic example.
I have one client whose website has become so entangled with their parent company's site that it's hard to say whether it's one site or two. And it's served from 6 different domains and three different servers in three different locations.
There are site grabber tools that will spider links beginning on a certain page and then save all the pages they find on the crawl. But they would miss orphaned pages, doorway pages that only link inbound, and so on.
Just type "site:<site.com> <some-common-word> " in google
The site.com is the site in question and the common word is some word/phrase which is the same in all pages ( like copyright notice )...
Ofcourse it will give only the pages indexed by google :)
(See tedster's caveats though - these are only the things that fast knows about)
WebmasterWorld investigated [alltheweb.com] - list number of pages as 114,000+
Or type in www.webmasterworld.com in the search and find out about the 42,000 incoming links :)