Forum Moderators: coopster

Message Too Old, No Replies

Extracting a Domain NOT through parse_url?

         

mcs37

8:24 pm on Aug 8, 2003 (gmt 0)

10+ Year Member



I'm working on a project where I need to parse the domain and get the actual bona fide domain, not the hostname that's storing the code. It's a tricky thing to do since DNS can get pretty complicated and WHOIS lookups are slow.

Examples:

[linux.org...] => linux.org
[gprep.pvt.k12.md.us...] => gprep.pvt.k12.md.us
[resn23.resnet.cornell.edu...] => cornell.edu
[we.dev.pub.cookies.co.uk...] => cookies.co.uk

Does anyone know a regular expression that can get this? It's pretty complicated once you add country extensions.

bakedjake

9:09 pm on Aug 8, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



[edit: read the titles]

Nevermind. Useless suggestion.

waitman

2:46 am on Aug 14, 2003 (gmt 0)

10+ Year Member



i don't believe a quick and easy regular expression exists, however i am not an expert ;-)

what are you calling a bona fide domain name?

co.uk.? uk.? or .?

i think all of those are bona fide, depending on who you ask.

i suggest you read the o'reilly book on dns for help solving this puzzle.

i believe that book is the only computer book i have read every single page from cover to cover ;-) (so it is recommended reading indeed).

you could possibly use the geo ip database and guess at what you are looking for (idear), explode the array and start at the end going backwards.

my best guess is you will get close to what you are after but never 100% perfect ;-)

take care,

GeorgeGG

4:10 am on Aug 15, 2003 (gmt 0)

10+ Year Member



I don't know much about expressions, but I do something about like:
http*//we.dev.pub.cookies.co.uk/some_dir

If string contains,
[www....]
[www....]
https://
http://
ftp://

remove from string:
we.dev.pub.cookies.co.uk/some_dir

If string contains a '/', make end of string:
we.dev.pub.cookies.co.uk

Take last 2 dots:
.co.uk

Open data file and find line that is in the
dot string '.co.uk':
.sch.uk
.police.uk
.plc.uk
.org.uk
.nhs.uk
.net.uk
.mod.uk
.ltd.uk
.gov.uk
.co.uk
.ac.uk
.uk

Found '.co.uk', remove:
we.dev.pub.cookies

reverse string:
seikooc.bup.ved.ew

If string contains a '.', make end of string:
seikooc

reverse string:
cookies

add dot string '.co.uk' from above:
cookies.co.uk

Are you working on a 'whois' page?
If you need it, my data file is about 800 lines
and looks more like:
.sch.uk¦whois.nic.uk
.police.uk¦whois.nic.uk
etc.

GGG

jaski

5:10 am on Aug 15, 2003 (gmt 0)

10+ Year Member



Is this what you are looking for [php.net...]

waitman

5:47 am on Aug 15, 2003 (gmt 0)

10+ Year Member



naw, don't think so.

i was being a bit of a smart ass in my previous post ;-)

anyhow, i think he is after the domain name.

so, my guess is that you would first explode between '//', then explode between '/', then explode between '.', then iterate backwards until you get the whole domain portion.

if the end of the fqdn is two letters it is probably a country domain, etc. so you check to see if the next part is more than two letters. please note this isn't foolproof, someone could have 'gg.uk' or 'gg.tv' or something like that.

but for a good number of them, this should work

<?php

$specimins=array();

$specimins[] = 'http://willey.freakinexample.co.uk/wonderbar/foo.php?awefawefawefawef';
$specimins[] = 'http://freakinexample.org.uk/weflkjaef/jekjk/awefawef/aefe.html';
$specimins[] = 'https://freakinexample.org/';
$specimins[] = 'ftp://freakinexample.net';
$specimins[] = 'gopher://wunder.bra.xxx.freakinexample.com';

foreach ($specimins as $specimin)
{

$pcs = @explode('//',$specimin);
$pcs = @explode('/',$pcs[1]);
$pcs = @explode('.',$pcs[0]);

$pcs = array_reverse($pcs);

if (strlen($pcs[0])<3)
{
$dotcommers = array('com','org','net');
} else {
$dotcommers = '**';
}
$rcs = array();
$rcs[]=array_shift($pcs);

$foundit=false;
$len=count($pcs);
for ($i=0;$i<$len;$i++)
{
if (!$foundit)
{
if (strlen(str_replace($dotcommers,'',$pcs[0]))>2)
{
$foundit=true;
}
$rcs[]=array_shift($pcs);
}
}
$domain[] = join('.',array_reverse($rcs));
}

echo '<pre>';
print_r($domain);
echo '</pre>';
?>

results:

Array
(
[0] => freakinexample.co.uk
[1] => freakinexample.org.uk
[2] => freakinexample.org
[3] => freakinexample.net
[4] => freakinexample.com
)

then you can do your mx lookups. whatever..

take care,

waitman

5:52 am on Aug 15, 2003 (gmt 0)

10+ Year Member



well, except for the school domains. i just reread his post....

maybe build a special case for those, keep going until you hit 'www' ...

;-)

waitman

5:55 am on Aug 15, 2003 (gmt 0)

10+ Year Member



and it looks like you need that other fellow's data file....

perfection will be difficult.

brotherhood of LAN

7:40 am on Aug 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>>If string contains a '/'

k, maybe i've not read this right, but, wouldnt this work?

btw- are you only wanting http:// addresses?

preg_match("'^http://([^/]+)'",$the_url,$match);
echo $match[1];

That regex should match anything that's not the forward slash, so if there's not a forward slash (i.e. just the domain), it will match the rest of the string.

waitman

2:27 pm on Aug 15, 2003 (gmt 0)

10+ Year Member



hmmm, what the guy is after is the domain name.

for example:

machine dns name is foo.bar.willey.maximinimum.goo

he wants maximinimum.goo

dmorison

2:31 pm on Aug 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How long have you got?

I mean is this an interactive thing that has to work within 2 or 3 seconds; or can you chew on it?

mcs37

9:44 pm on Sep 6, 2003 (gmt 0)

10+ Year Member



I am definitely looking for the domain, not the host name. The best solution is to use some of the code posted here, and then lookup the records through some WHOIS web service. I was hoping it could be done instantly in O(n) time rather than relying on some external service. However, I think this is pretty much impossible, considering the flexibility provided in domains UNLESS some list of all possible TLD's can be provided, so the algorithm knows that .pvt.k12.md.us is an acceptable component of a domain but not the domain itself.

The purpose of this exercise is not to build a WHOIS database or anything. Rather, I was attempting to find a way to arbitrarily build a regular expression to describe a domain that would permit any and all subdomains. I am using this for a data mining project I am doing for independent study at school.

My new solution is to stem two arbitrary URLs to find a similarity which can then map (recursively if need be) to a final domain.

For example:

item 1: sub1.gprep.pvt.k12.md.us
item 2: sub5.gprep.pvt.k12.md.us

would create the expression *.gprep.pvt.k12.md.us

Furthermore, suppose you have these entities:

item 1: resn1.sub1.resnet.cornell.edu
item 2: admin.resnet.cornell.edu
item 3: telematics.hev.cornell.edu
item 4: dyn-234.dhcp.redrover.cornell.edu

item 1 and 2 would map to the expression (i.e. wildcard) *.resnet.cornell.edu, and a continued test against item 3 would rebuild this to *.cornell.edu, and this regular expression would accept item 4.

This method is fast in that it requires no external resources. Initially I was worried it would require O(n) lookup time, but by parsing the domain components one-by-one from the end to the start can provide O(lg n) or O(1) depending on the database system.

Thank you for your responses though!

Mike

waitman

7:51 am on Sep 7, 2003 (gmt 0)

10+ Year Member



just a thought, you might be able to use 'dig' and check out the 'authority' section.

would have to expirement to see how accurate it would be...

[root@v019 /]# dig pvt.k12.md.us

; <<>> DiG <<>> pvt.k12.md.us
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20101
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;pvt.k12.md.us. IN A

;; AUTHORITY SECTION:
k12.md.us. 600 IN SOA NS1.UMD.EDU. HOSTMASTER.k12.md.u
s. 2003227743 18000 900 720000 600

;; Query time: 78 msec
;; SERVER: \#53(\)
;; WHEN: Sun Sep 7 00:48:21 2003
;; MSG SIZE rcvd: 89