Forum Moderators: phranque
OK, just kidding. Netcraft [news.netcraft.com] does a very renowned survey on the number of servers on the internet. How many pages each server delivers is hard to tell, especially when you take the increasingly popular dynamically generated pages into account.
I would assume the average to be around the 10-20 page number.
But again, giving the increasing "dynamicity" of modern websites, one is more inclined to start a discussion on "WHAT is a webpage"?
The Netcraft study is for individual domains and subdomains:
1) Microsoft.com (G: 644,000) is listed with 360 sites (microsoft.com and subdomains hereof)
2) Bbc.co.uk (G: 1,350,000) is listed with 95 sites (bbc.co.uk and subdomains hereof)
3) Tripod.com (G: 1,790,000) is listed with more than 500 sites (as they provide subdomains for their users).
4) Aol.com (G: 874,000) is listed with more than 500 sites as well.
By the way, i recall that i have done that bbc search before and got a much larger number of pages... here it is (msg #6): [webmasterworld.com...]
October 12, 2003: "bbc site:bbc.co.uk" returned 3,1 million pages, now it's down to 823,000. I don't suppose they've deleted two thirds of their pages since October, so Google must have dropped a lot of them from the index.
And/or high PR. The above examples are all PR9 except tripod that redirects to a lycos subdomain (it's "only" PR8). Recent threads seem to indicate that the googlebot still does go deep, but the deep pages don't always make it to the index.
That BBC would lose around 2 million indexed pages in six months tells me that google indexes a smaller percentage of large sites now than it did before.
Still, the google.com homepage statistic went from 3-something to 4-something billion pages during this period, so if this is not due to deeper spidering it must be broader spidering. So, what's new - pages with eastern character sets? more blogs? Froogle catalog pages? News sources? email lists? forums? dynamic pages? I don't know.
>> A lot of pages are not getting spidered by the major search engines.
If that 66% gap (assuming that bbc.co.uk is no larger than 3,1 million pages) is valid across sites of different size, then google would index around one third of the pages available, which sets the total figure to around 12-15 billion pages.
Personally i think the Google index of 4,285,199,774 holds no more than 10-20% of the total number of individual pages available, but your guess is as good as mine.
Added: Some percentage, perhaps a large one, of the pages currently not indexed will be duplicates or near-duplicates due to session ID's, customization, and the like - another percentage will be password-protected pages, eg. dating sites
If we take the 38,115,793 active domains and divide it into the 2,069,188,504 IPs, it equals to 54 - so lets keep that as the total pages per site (even though I don't know how I got it.).
Now taking the 38,115,793, if we take out a quarter (due to quarter of them being "duplicatish" domains), it equals 28,586,845. Taking the 28,586,845, if we use my theory and multiply it into 54, it will equal 1,543,689,630. Taking that 1,543,689,630, we could also take out a quarter (due to duplicate web pages), and it will equal 1,157,767,222.
So according to my theory, there are approximately 1,157,767,222 "unique" web pages in the world, even though this may not be a tad closer to the "actual" amount which only the god knows.
Sid
PS; I'm no Maths geek - I just wanted to show-off ;)
So, what's new - pages with eastern character sets? more blogs? Froogle catalog pages? News sources? email lists? forums? dynamic pages? I don't know.
I've certainly seen an immense increase in eBay-redirect-pages. There seems to be almost ALL remotely popular keywords covered! Only today I tried to estimate the value of an old Denon amplifier by searching the web for its specs. Doing so, I stumbled across slightly less than 20(!) linklist-to-eBay "sites".
Also Kelkoo-and-alike sites have certainly increased in numbers.
sidyadav:
According to Whois.sc, there are approximately 38,115,793 active domains and 2,069,188,504 IPs in the World.
Small mistake. In domains, they only count COM/NET/ORG/INFO/BIZ/US domains - all the country specific domains like .uk, .de, .nz, ... are NOT counted.
The IP-List however DOES list the countries as well!
Anyway, this seems like a nice thing to do on a sunday morning:
1) Lets assume that 80% are for email, ftp, dialup, and various ISP-stuff. That figure might be too high, but it leaves us with:
(a): 0.20 x 2,069,188,504 = 413,837,701 web-page holding IP's (shared and dedicated)
2) Let's calculate the estimated number of shared IP's and dedicated ones first:
(b): 0.80 x 413,837,701 = 331,070,161 shared web page holding IP's
(c): 0.20 x 413,837,701 = 82,767,540 dedicated web page holding IP's
3) So, how many web pages per IP on shared IP's? First, let's recalculate our 20 pages from before, to see what that would correspond to, as we have removed 80% of the IP's in step one:
(d): (20/20) x 100 = 100 pages on average per "web IP" (when 80% of IP's are not hosting web pages).
That still seems like a fair number to me, ie. not too high. It corresponds to around 85 pages per shared IP and around 170 pages per dedicated IP given the assumptions of double size (*).
But, let's be careful. Let's say:
4) Get that calculator of yours. Nevermind, i got it already:
And, "the real figure is very probably higher" - but i definitely don't think it's lower than case 1.
5) Solving for the Google index of 4,285,199,774 pages:
So, i still think that Google has no more than 10-20% of the available pages indexed.
(If you find errors in the above, blame me)
But I think the best and the most accurate way to calculate this, is to buy a good server, a great server actually, which is on a high-speed internet connection and has a decent amount of space, install Larbin, leave the computer on for lets say 2 or 3 months.
Check then, and you'll find the real amount :)
You could actually do that and also make a website about it, name it the "Internet Size-Calculation Project (ISCP)" and then, when've got the information - sell the results to whoever wants it and also take donations through PayPal - whoever donates (during the buying/crawling period) more than $100 gets the info for free ;)
Not a bad idea to earn money I guess, also, when you've done it:
a) give me credit: As the guy who gave you the crazy idea to download the web.
b) share the info to me for free - as to giving you the idea ;)
hehehe... like this will ever happen.. ;)
Sid
I think this probably can be done by gathering data from isp's. use surfers as the spiders and log pages, if it's already recorded drop it, if not add it. They can then gather all the different data as one source.
Then the problem is, there are probably millions of pages out there that never get any visits.
Mack.
Google has a caching web page for each of the web pages. Do you count all those?
What about email .. if a user can click and view all of his emails, each as their own web page, do you count all those?
The deep web becomes a very very big place when you start to think about each web page as a database entry.
For example, you could probably take usenet alone is a few billion pages when you count the fact that it is mirrored in a lot of different places.
Add a billion to that, there's my best guess. :)
>>I think the real answer to this post is that this question is quite purposeless<<
On the contrary. When selling an SEO service or a type of website that is connecting buyers with sellers, informing the client or potential client that they are up against billions or conservatively "24,830,262,050" pages, all of a sudden having first page exposure becomes a phenominal value.
I know, I know...but there are only 607,800 results in your query. That's not the point. The point is out of easily over 24 billion web pages on the Internet, I made the world find yours. That's a great value proposition. That's powerful marketing!
That's the point, for me, of this thread.