Amazing figures, Jim, thanks for sharing.
What are your criteria for establishing country relatedness - whois data?
Wow, that is interesting Jmccormac.
When will you be able to estimate numbers of websites by country?
Do you check whois for country of the domain owner?
owww good information Jim :)
thank you for sharing - however i am not quite sure about the figures - it would certainly help to document your selection/computing/tracing criteria -
as for Germany you might want to look at the
Denic (Germany`s cc registrar)
there they say :
Domains gesamt (total) : 5.938.340 (and that is cc ".de" alone ..." - the real number will be much higher -
adding all the German .com's, .net's, .org's ... and all the other tld's.
|What are your criteria for establishing country relatedness - whois data? |
At the ultimate level it would be Heini. The algorithm for the basic search is simple enough in that it is based on identifying key nameservers for each countries and working downwards. The registrar effect does create an error margin for the countries in that the domains registered on a particular registrar's nameservers (joker.com/gandi.net etc) will not necessarily be sites in those particular countries. These are the domains that tend to give the most problems. The rules for the rough count are a: is the domain on a nameserver associated with the country; b: is the domain on an IP associated with the country.
The crossover between registrars in one country and customers in another country sharing the same language is the source of a lot of the confusion. Many Irish domain owners use UK hosting/registrars and as a result, the simple algorithm tags these domains as being UK. However by developing a more complex algorithm that produces a model of a country's internet business, these errors can be reduced. The big challenge is on domains that are registered on US nameservers and are hosted on US servers. The more complex internet business model for a country will identify a lot of these but in many cases, the whois data is necessary to accurately identify whether they really belong to a country's domains. Even the whois data can be wrong - I have seen Irish domains registered with addresses of Dublin, Ireland, UK or in some cases Dublin, UK. Again this kind of iffy whois data would require a manual decision or a very good parsing of the whois data.
After a while, from building up an image based on nameservers/SOAs/website data, it becomes possible to refine the data to produce a fairly accurate set of figures for each country and patterns of domain registration become more apparent. It is probably the closest to a precise figure without individually checking the whois data for each domain. Though at 5.6 M domains, that would take a while. :)
|Domains gesamt (total) : 5.938.340 (and that is cc ".de" alone ..." - the real number will be much higher - |
adding all the German .com's, .net's, .org's ... and all the other tld's.
Probe, the German cctld is probably the biggest in the world at the moment. (I am not sure how .us will grow.) The use of US hosting by registrants in each country means that these figures will be on the low side. However the patterns are clear: Where there is a reasonably priced cctld and good internet connectivity, there will be more cctld domains registered in that country. One of the best examples for that is Belgium - it has about 200K .be domains but in CNO, it only has approximately 45076 domains. The Irish (.ie) cctld is a very good example of what happens when the cost of a cctld is too high and the connectivity is poor. The official figure for .ie registrations is 32488 but only 29801 of these domains had valid SOAs. The cctld is so badly run that it had not been actively deleting dead domains.
The next step in this process is to check the delegation of the domains (has it a Start Of Authority record (SOA) and then has the domain an associated website. Then the website and other associated data is checked.) Patterns of domain registration tend to become more apparent as the data on each particular domain increases. In some respects, it is all a question of reducing the margin of error on the present dataset before going for the big US hosted dataset. By using a crawler to detect links on some of the identified websites, it is possible to pick out some of the US hosted domains/sites. It has a lot of parallels with codebreaking - most of the work is really traffic analysis. However it is the magitude of the problem that can be, at times, quite terrifying. With codebreaking, you are either right or wrong. With this kind of work, it is all about reducing the possibility of 'wrong' and increasing the probability of 'right'. :)
thank you - that explains some of the questions -
by the way - we are currently developing an ip2ll
service (internet protocol address/domain to country/
city/street/longitude-latitude translation) ...
going the hard way - via whois data ... and following the
"next machine" verification model ...
(actually used for mapping/visual routing of
w-lan networks ... coupled with gps based tracking and recording).
if interested - please feel free to sticky mail me ...
|When will you be able to estimate numbers of websites by country? |
I should have the French figures and the smaller country counts later today (when I wake up).The larger ones (Germany, UK, France, Spain, Italy) will take up to a day each to process. The results should be completed in the next few days. At a guess, about 75% would have websites though only spidering/linkswamp  analysis will determine whether these websites are active.
 Identifying IPs with a large number of websites. These are often 'on hold' or 'coming soon' websites or redirection sites.
Update on some of the counts domains>websites
France: 715570 SOA: 632788 Websites: 599228 *
Greece: 5432 SOA: 4194 Websites: 3913
Luxembourg: 2664 SOA: 2121 Websites: 2028
Iceland: 1359 SOA: 1146 Websites: 1116
These are preliminary website figures - the indepth analysis will take a while due to following website redirects,historical nameserver migration etc. The intial figure for France is a bit high due to the existence of registrar used by clients in other countries. The free hosting services packages offered by these registrars often means that a website site of a user in country A will actually appear on the servers of the free hosting server in country B. (Sorting this kind of thing out produces a headache worse than a hangover. :) )
> a website site of a user in country A will actually appear on the servers of the free hosting server in country B.
I (in the UK) host dozens of sites in the US (not free hosting, but a good service). The names are all registered with a UK address but the sites are on US servers. I bet there are a lot of people who do this kind of thing. How does that affect your figures?
|>a website site of a user in country A will actually appear on the servers of the free hosting server in country B. |
I (in the UK) host dozens of sites in the US (not free hosting, but a good service). The names are all registered with a UK address but the sites are on US servers. I bet there are a lot of people who do this kind of thing.
How does that affect your figures?
It can skew the figures considerably for some countries kapow.
The effect has two extremes: the country with more sites outside its IP space than hosted locally and at the other extreme, countries with one or more registrars in its IP space. The simplistic IP/hosted sites/cctld geofiltering model used by the big SEs will be badly affected by this problem.
What I am working on is a set of algorithms and domain usage and registration models for each country/area that can work around the geofiltering issues. The sites CNO hosted in the US are the hardest to find and tie to a particular country. This is what makes it more of a codebreaking problem than a simple domains problem.
The brute force attack (BFA) method of checking all whois data for every domain is one solution but it is not the most effective way of solving the problem. There are other techniques that can be used to identify domains relevant to each country. From work on Irish owned CNO domains, certain clustering patterns emerge quickly from the data and as a result you can begin to assign probabilities to particular registrars. Some rules are pretty easy to establish such as linkswamps and regions from the domains to be checked for a particular country. I haven't enough data yet to estimate the local:non-local percentage for the UK yet. But for Ireland, the percentage is probably 40% and upwards on websites. Hopefully this research will produce better search results.
Just to throw some more info in:
1 & 1, largest german hoster, today announced to have hit the 3 Mill mark and claiming to be the largest hoster worldwide:
3 Mill. domains hosted
2,1 Mill .de
Rest is .org, .ch, .at and some .co.uk.
I'd suggest this to be pretty representative for the german market, where 1 & 1 holds some 35%.