Forum Moderators: coopster
Examples:
[linux.org...] => linux.org
[gprep.pvt.k12.md.us...] => gprep.pvt.k12.md.us
[resn23.resnet.cornell.edu...] => cornell.edu
[we.dev.pub.cookies.co.uk...] => cookies.co.uk
Does anyone know a regular expression that can get this? It's pretty complicated once you add country extensions.
what are you calling a bona fide domain name?
co.uk.? uk.? or .?
i think all of those are bona fide, depending on who you ask.
i suggest you read the o'reilly book on dns for help solving this puzzle.
i believe that book is the only computer book i have read every single page from cover to cover ;-) (so it is recommended reading indeed).
you could possibly use the geo ip database and guess at what you are looking for (idear), explode the array and start at the end going backwards.
my best guess is you will get close to what you are after but never 100% perfect ;-)
take care,
If string contains,
[www....]
[www....]
https://
http://
ftp://
remove from string:
we.dev.pub.cookies.co.uk/some_dir
If string contains a '/', make end of string:
we.dev.pub.cookies.co.uk
Take last 2 dots:
.co.uk
Open data file and find line that is in the
dot string '.co.uk':
.sch.uk
.police.uk
.plc.uk
.org.uk
.nhs.uk
.net.uk
.mod.uk
.ltd.uk
.gov.uk
.co.uk
.ac.uk
.uk
Found '.co.uk', remove:
we.dev.pub.cookies
reverse string:
seikooc.bup.ved.ew
If string contains a '.', make end of string:
seikooc
reverse string:
cookies
add dot string '.co.uk' from above:
cookies.co.uk
Are you working on a 'whois' page?
If you need it, my data file is about 800 lines
and looks more like:
.sch.uk¦whois.nic.uk
.police.uk¦whois.nic.uk
etc.
GGG
i was being a bit of a smart ass in my previous post ;-)
anyhow, i think he is after the domain name.
so, my guess is that you would first explode between '//', then explode between '/', then explode between '.', then iterate backwards until you get the whole domain portion.
if the end of the fqdn is two letters it is probably a country domain, etc. so you check to see if the next part is more than two letters. please note this isn't foolproof, someone could have 'gg.uk' or 'gg.tv' or something like that.
but for a good number of them, this should work
<?php
$specimins=array();
$specimins[] = 'http://willey.freakinexample.co.uk/wonderbar/foo.php?awefawefawefawef';
$specimins[] = 'http://freakinexample.org.uk/weflkjaef/jekjk/awefawef/aefe.html';
$specimins[] = 'https://freakinexample.org/';
$specimins[] = 'ftp://freakinexample.net';
$specimins[] = 'gopher://wunder.bra.xxx.freakinexample.com';
foreach ($specimins as $specimin)
{
$pcs = @explode('//',$specimin);
$pcs = @explode('/',$pcs[1]);
$pcs = @explode('.',$pcs[0]);
$pcs = array_reverse($pcs);
if (strlen($pcs[0])<3)
{
$dotcommers = array('com','org','net');
} else {
$dotcommers = '**';
}
$rcs = array();
$rcs[]=array_shift($pcs);
$foundit=false;
$len=count($pcs);
for ($i=0;$i<$len;$i++)
{
if (!$foundit)
{
if (strlen(str_replace($dotcommers,'',$pcs[0]))>2)
{
$foundit=true;
}
$rcs[]=array_shift($pcs);
}
}
$domain[] = join('.',array_reverse($rcs));
}
echo '<pre>';
print_r($domain);
echo '</pre>';
?>
results:
Array
(
[0] => freakinexample.co.uk
[1] => freakinexample.org.uk
[2] => freakinexample.org
[3] => freakinexample.net
[4] => freakinexample.com
)
then you can do your mx lookups. whatever..
take care,
k, maybe i've not read this right, but, wouldnt this work?
btw- are you only wanting http:// addresses?
preg_match("'^http://([^/]+)'",$the_url,$match);
echo $match[1];
That regex should match anything that's not the forward slash, so if there's not a forward slash (i.e. just the domain), it will match the rest of the string.
The purpose of this exercise is not to build a WHOIS database or anything. Rather, I was attempting to find a way to arbitrarily build a regular expression to describe a domain that would permit any and all subdomains. I am using this for a data mining project I am doing for independent study at school.
My new solution is to stem two arbitrary URLs to find a similarity which can then map (recursively if need be) to a final domain.
For example:
item 1: sub1.gprep.pvt.k12.md.us
item 2: sub5.gprep.pvt.k12.md.us
would create the expression *.gprep.pvt.k12.md.us
Furthermore, suppose you have these entities:
item 1: resn1.sub1.resnet.cornell.edu
item 2: admin.resnet.cornell.edu
item 3: telematics.hev.cornell.edu
item 4: dyn-234.dhcp.redrover.cornell.edu
item 1 and 2 would map to the expression (i.e. wildcard) *.resnet.cornell.edu, and a continued test against item 3 would rebuild this to *.cornell.edu, and this regular expression would accept item 4.
This method is fast in that it requires no external resources. Initially I was worried it would require O(n) lookup time, but by parsing the domain components one-by-one from the end to the start can provide O(lg n) or O(1) depending on the database system.
Thank you for your responses though!
Mike
would have to expirement to see how accurate it would be...
[root@v019 /]# dig pvt.k12.md.us
; <<>> DiG <<>> pvt.k12.md.us
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20101
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;pvt.k12.md.us. IN A
;; AUTHORITY SECTION:
k12.md.us. 600 IN SOA NS1.UMD.EDU. HOSTMASTER.k12.md.u
s. 2003227743 18000 900 720000 600
;; Query time: 78 msec
;; SERVER: \#53(\)
;; WHEN: Sun Sep 7 00:48:21 2003
;; MSG SIZE rcvd: 89