homepage Welcome to WebmasterWorld Guest from 54.226.93.128
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Extact Domain Name and Extension from a URL
could be a very interesting and tricky Algo......
Anyango




msg:1272693
 8:01 am on Nov 3, 2005 (gmt 0)

Yo brothers!

you have a variable $_SERVER['HTTP_HOST'] right and you want to find out exactly two things

1) whats the domain name with extension, i.e something.com
2) whats the subdomain name ie . mail , anysub

Its definitely very very easy to extract that information from that variable, either using php's functions or write 2-3 lines of your own that splits that variable on all dots(.) and then take tokens and findout what the required variables are

for example

$domainarray = explode('.', $_SERVER['HTTP_HOST']);
$index=count($domainarray)-1;
$domainname= $domainarray[$index-1].".".$domainarray[$index];

for this url [anything.com...]

it shall return

$domainname=anything.com

similarly another code could easily extract subdomain name too.

Now the problem is, which i am sure many others would have faced too, is when the domain name itself has 2 dots(.) in it, like anything.co.uk , anything.me.uk , anything.org.au

in this case, niether my code is getting me correct values in

1) $domainname
2) $subdomainname

nor any php's function built for the same purpose gives correct values for those variables in that case.

you guys have any logic? i dont need code exactly, just logic if possible.

Exact inputs and outputs for your Algo for a test are

[case1]

[input]
url= [examplesite.com...]
[output]
domain=examplesite.com
subdom=www

[case2]

[input]
url= [examplesite.com...]
[output]
domain=examplesite.com
subdom=

[case3]

[input]
url= [examplesite.com.pk...]
[output]
domain=examplesite.com.pk
subdom=www

and now the worst case

[worst case]

[input]
url= [secure.email.website.co.uk...]
[output]
domain=website.co.uk
subdom=secure.email

Tricky? ;)

Kami

 

Hester




msg:1272694
 9:39 am on Nov 3, 2005 (gmt 0)

Why not just split the subdomain variable down further with another split? Then just test to see if the second value exists or not. If it does, then grab whichever word you want. If not, then proceed as normal.

Eg: using your example:

[worst case]

[input]
url= [secure.email.website.co.uk...]
[output]
domain=website.co.uk
subdom=secure.email

list($word1, $word2) = split($subdomain,'.');

Or use explode. Then:

if ($word2 <> '') {
$subdomain = $word1
}

I haven't tested this, and it's not the most elegant code, but hopefully you get the idea.

The result should then be:

subdomain = secure (since word2 was 'email').

Anyango




msg:1272695
 9:48 am on Nov 3, 2005 (gmt 0)

Hey

Thanks for Your words , but the problem is, we will only be able to further split subdomain variable by first populating it, i mean from my example the value in subdomain variable was "expected output" from our algo, it was not populated already, we have to come up with a way to have an algo which simply looks at a url and says, this is the subdomain name and this is the domain name, no matter if the domain is .com or .co.uk or .me.uk or anything

Thanks
Kami

NomikOS




msg:1272696
 1:21 pm on Nov 3, 2005 (gmt 0)

If is it true than TLD (gLTD and ccTLD: generic and country code top level domains) are not accepted for normal domains (ex: normal.com)
Then:

1) you do a list with all this TLD/ccTLD domains: $tldarray
2) and considering domain = $subs.$domainName.$tld

<?php
$subs = '';
$domainName = '';
$tld = '';

$http_host = str_replace('http://', '' , $_SERVER['HTTP_HOST']);
$domainarray = explode('.', $http_host);
$top = count($domainarray);

for ($i = 0; $i < $top; $i++)
{
$_domainPart = array_pop($domainarray);
if (!$tld_isReady)
{
if (in_array($_domainPart, $tldarray))
{
$tld = ".$_domainPart".$tld;
}
else
{
$domainName = $_domainPart;
$tld_isReady = 1;
}
}
else
{
$subs = ".$_domainPart".$subs;
}
}

echo 'subdomain is '.substr($subs ,1);
echo " and domain is $domainName$tld. Beautiful.";
?>

This code can deal with any extension of subdomains and topleveldomains.

See this: [iana.org...] for a TLD list. You have work to do.

Please tell me if you prove this code.

[edited by: jatar_k at 4:20 pm (utc) on Nov. 3, 2005]
[edit reason] swapped link [/edit]

Anyango




msg:1272697
 5:13 pm on Nov 3, 2005 (gmt 0)

Sounds coool! let me try this, i ll get back here shortly with my findinds.

ciao

Anyango




msg:1272698
 5:56 pm on Nov 3, 2005 (gmt 0)

Nopes , That won't work, i think we gotta do some modifications in the algo

i populated $tldarray with all available ccTlds and gTlds and after running 3 test cases, here are my findings after running that code.

-------
[input url]
[testing.com...]
[output]
subdomain is www.testing
and domain is com.

[input url]
[testing.co.uk...]
[output]
subdomain is www.testing.co
and domain is uk.

[input url]
[testing.com.pk...]
[output]
subdomain is www.testing.com
and domain is pk.
-------

Although appearantly i couldnt get it to work, but this has helped me explain another important question related to this main algo, the question is

.uk tld works with .co

i dont think one can have abc.com.uk, he would have abc.co.uk right? same is the case about .in

and .pk domain works with .com you can have testing.com.pk but you cant have testing.co.pk right?

how to tackle that issue too?

And by the way forum88's bosses arent taking part in duscission.... isnt this question worth your time folks? Jatar_k?

jatar_k




msg:1272699
 6:26 pm on Nov 3, 2005 (gmt 0)

geez, I am just having coffee and breakfast it's early in the morning here, I am not up in the wee hours checking threads ;)

I don't know your answer but take a look at
[php.net...]

reliably acquiring the full domain is your first step.

from there you need to establish rules

- the full domain name will have 2, 3 or 4 parts
- there can be 1,2 or 3 periods
- the subdomain will be first or missing
- the domain will be second or first
- the tld will be 1 part or two (seperated by a period) and will always include the last part and possibly the two last parts

there may be other rules as well but that will be based on profiling

if 3 parts and part 4 is in [array of terms] then part1 is subdomain

if only 2 parts are present then tld is 2 and domain is 1

so on and so on

py9jmas




msg:1272700
 6:30 pm on Nov 3, 2005 (gmt 0)

Since the differentiation between 'domain' and 'subdomain' is a purely arbitrary one made up by humans, I can't see how you can expect an algorithm to do this, without a complete list of all the tlds and how you have chosen to define (sub)domains for each one.

Anyango




msg:1272701
 6:41 pm on Nov 3, 2005 (gmt 0)

Have a nice Breakfast Jatar ;) ll get back to your post later.

py9jmas

Thanks for your post, If this is true that difference between domain and subdomain is not actually defined and is percieved by us, then thats a really new and valuable information for me, i really used to think that you can easily define subdomain and domain differently by taking just a look at any url. Although i am still reluctant to go with that idea but i would trust your skills and experience more then mine ;).

Question is simple

[this-is-something.anywebsite.co.uk...]

3 questions

1) whats the tld(.com,.org,.co.uk)
2) whats the domain which you registered with registrar (abc,testing,mywebsite)
3) and whats the remaining part (i.e subdomain)

Simple!

Anyango




msg:1272702
 7:07 pm on Nov 3, 2005 (gmt 0)

Yes Jatar_K , it looks like i would have to do it that way as you suggested, leme do some research on it and write some code, i ll get back later hopefully with some solution ;)

Ciao

Anyango




msg:1272703
 7:13 pm on Nov 3, 2005 (gmt 0)

Sorry to be very sticky but with any lists that i am getting of ccTlds and gTlds , i am unable to get answer to one important question, which is gonna help me in making rules Jatar,

the question is

which cctlds work with .co
and which cctlds work with .com

example is again

1).co.in
2).co.uk

3).com.pk
4).com.hk

py9jmas




msg:1272704
 8:41 pm on Nov 3, 2005 (gmt 0)

The top level domain is easy - it's uk. Beyond that depends on how you define things. I'm get the impression you want to break the FQDN up by who 'manages' it - ie
mail.example.co.uk
co.uk as the tld as that's managed by Nominet
example as that is what the organisation registered
mail as the subdomain as that's what's left.

The problem is each domain is handled differently. ac.uk is managed by someone different to co.uk, with different rules that might affect your decision as to how to split the FQDN. What would you do with parliament.uk? There is no website there (that's at www.parliament.uk), but it a domain. You could end up with as many rules as there are domains.

Hester




msg:1272705
 10:20 pm on Nov 3, 2005 (gmt 0)

3 questions

1) whats the tld(.com,.org,.co.uk)
2) whats the domain which you registered with registrar (abc,testing,mywebsite)
3) and whats the remaining part (i.e subdomain)

1. .com, .org or .uk. (But not .co.uk.) See this breakdown of how Nominet decipher a URL:

[nominet.org.uk...]

2. In the example "www.mysite.com" it is the bit after the www's, eg: "mysite.com".

3. The subdomain can be added later. Eg:

friends.mysite.com
fastcars.mysite.com
flowers.mysite.com

Etc.

Anyango




msg:1272706
 6:28 am on Nov 4, 2005 (gmt 0)

This discussion is getting more and more informative, atleast for me, All three of you guys , please stay in touch , i ll update this board with my findings shortly

Currently Working on :

1)The algo in Jatar_K's style
2)The Domain break up as py9jmas suggestions.
3)Doing research on The domain structure as reffered by Hester.

Ciao

NomikOS




msg:1272707
 7:01 am on Nov 4, 2005 (gmt 0)

OK :( In my case now is night and I am a little (very)sleepy. I check it out this.
And yeah, This discussion is getting very useful and interesting.
regards.-

Anyango




msg:1272708
 11:22 am on Nov 4, 2005 (gmt 0)

Yo Folks!

Figured it out, mystery seems to be solved, atleast for me, before posting my findings i would like to express some points.

1) I was wondering which ccTlds work with .co and which work with .com . I noticed that with a nice algo in place, you don't need to bother about that, it will itself tell you which ccTld works with what.

2) I would also like to express why do i need such a thing, maybe it could help someone. guess what? Catch-all Subdomains ;) . I wont like to go into detail but still to explain my point, this script will be used in a code of ours, which is set-up such that you can put hundreds of domains to point to that code and it will show content based upon which domain was requested by the user/client ;)

3) Now in order for the script to be compatible with all types of domains, there had to be a way for the script to match the requested domain name with the one stored in it's database so that it can show content accordingly.

4) Above all, all those domains that are set to point to that script have Catch-all subdomain property enabled, so there literally can be very very strange sub-domains name too, requested from that script.

5) I would also like to express that some of you might feel that the terms i used in discussion were probably not accurate, something to which i am saying "sub-domain" might not be a sub-domain in your eyes but my scenario was exactly as i mentioned.

6) Each of you guys, All four of you gave me atleast one good advice that helped me write this, I am gratefull to all four of you for your help and your time.

7) Our Final Rule is pretty simple:-
"remove http:// from url, explode it on (.) take the last segment, if its a ccTld then domain extension will usually have two parts, if its a gTld then domain extension MUST have only one part, in this case second last part is domainhost and thirdlast part or blank, is subdomain. similarly for ccTld (one index back)."

8) The Above rule will FAIL if there is a ccTld without a gTld for example, this cannot handle

for example [testing.tk...]

Here is what i just wrote, and it works perfectly for me.
-----------------------

<?

function getHostDetails($url)
{

$extension="";
$domain="";
$subdomain="";
$host="";
$segments="";

$gTlds="aero, biz, com, coop, info, jobs, museum, name, net, org, pro, travel, gov, edu, mil, int";

$cTlds="ac, ad, ae, af, ag, ai, al, am, an, ao, aq, ar, as, at, au, aw, az, ax, ba, bb, bd, be, bf, bg, bh, bi, bj, bm, bn, bo, br, bs, bt, bv, bw, by, bz, ca, cc, cd, cf, cg, ch, ci, ck, cl, cm, cn, co, cr, cs, cu, cv, cx, cy, cz, de, dj, dk, dm, do, dz, ec, ee, eg, eh, er, es, et, eu, fi, fj, fk, fm, fo, fr, ga, gb, gd, ge, gf, gg, gh, gi, gl, gm, gn, gp, gq, gr, gs, gt, gu, gw, gy, hk, hm, hn, hr, ht, hu, id, ie, il, im, in, io, iq, ir, is, it, je, jm, jo, jp, ke, kg, kh, ki, km, kn, kp, kr, kw, ky, kz, la, lb, lc, li, lk, lr, ls, lt, lu, lv, ly, ma, mc, md, mg, mh, mk, ml, mm, mn, mo, mp, mq, mr, ms, mt, mu, mv, mw, mx, my, mz, na, nc, ne, nf, ng, ni, nl, no, np, nr, nu, nz, om, pa, pe, pf, pg, ph, pk, pl, pm, pn, pr, ps, pt, pw, py, qa, re, ro, ru, rw, sa, sb, sc, sd, se, sg, sh, si, sj, sk, sl, sm, sn, so, sr, st, sv, sy, sz, tc, td, tf, tg, th, tj, tk, tl, tm, tn, to, tp, tr, tt, tv, tw, tz, ua, ug, uk, um, us, uy, uz, va, vc, ve, vg, vi, vn, vu, wf, ws, ye, yt, yu, za, zm, zw";

$gArray=explode(",",$gTlds);
$cArray=explode(",",$cTlds);
$url=str_replace("http://","",$url);
$segments=explode('.',$url);
$lastIndex=count($segments)-1;

if (in_array($segments[$lastIndex],$gArray))
{
$extension=$segments[$lastIndex];
$domain=$segments[$lastIndex-1];
$host=$domain.".".$extension;
if(count($segments)>2)
{
$subdomain=str_replace(".".$host,"",$url);
}
}

else

if (in_array($segments[$lastIndex],$cArray))
{
$extension=$segments[$lastIndex-1].".".$segments[$lastIndex];
$domain=$segments[$lastIndex-2];
$host=$domain.".".$extension;
if(count($segments)>3)
{
$subdomain=str_replace(".".$host,"",$url);
}
}

$rArray['url']=$url;
$rArray['subdomain']=$subdomain;
$rArray['domain']=$domain;
$rArray['extension']=$extension;
$rArray['host']=$host;

return $rArray;
}

?>
-----------------------

Now a Test Code for that function

<?
$url="http://testing123.anywebsite.co.uk";
$a=getHostDetails($url);
echo "<table width=50% border=0>";
echo "<tr><td>Input url:<td>http://".$a[url]."</tr>";
echo "<tr><td>Sub-domain:<td>".$a[subdomain]."</tr>";
echo "<tr><td>Domain:<td>".$a[domain]."</tr>";
echo "<tr><td>Extension:<td>".$a[extension]."</tr>";
echo "<tr><td>Full Host:<td>".$a[host]."</tr>";
echo "</table><hr>";
?>

Try this function using this input url.

$url="http://this.is.a.worst.example.of.a.possible.subdomain.thisIsMyMainWebsite.com";

it will still work as expected ;)

I have a solution to Point # 8 above but that solution will make the code complicated, does anyone have a smart solution to point # 8?

Thanks Allot all of you,

waiting for your words.

Regards,

Kami

P.S: The code was properly idented, it might not appear in proper idented form here,please ignore that.

[edited by: jatar_k at 4:10 pm (utc) on Nov. 4, 2005]
[edit reason] fixed sidescroll [/edit]

rpking




msg:1272709
 11:32 am on Nov 4, 2005 (gmt 0)

Just to play devil's advocate...

How about this as the trickiest input URL:

[username:password@this.is.a.worst.example.of.a.possible.subdomain.thisIsMyMainWebsite.com...]

You need to strip out the login details before splitting the domain.

Anyango




msg:1272710
 11:34 am on Nov 4, 2005 (gmt 0)

Excellent Point!

Yes i should take care of that too.

Thanks Alot Boss.

jatar_k




msg:1272711
 4:13 pm on Nov 4, 2005 (gmt 0)

very nice work Anyango

>> with a nice algo in place, you don't need to bother about that

exactly

Anyango




msg:1272712
 4:40 pm on Nov 4, 2005 (gmt 0)

Thanks Brotha!

Your messages have always helped me alot one way or other.

;)

NomikOS




msg:1272713
 8:41 pm on Nov 4, 2005 (gmt 0)


<?php
require_once('../core/utils/debug.inc');
$debug_vars = true;

$subs = '';
$domainName = '';
$tld = '';

$gTlds = explode(',', str_replace(' ', '', "aero, biz, com, coop, info,
jobs, museum, name, net, org, pro, travel, gov, edu, mil, int"));

$cTlds = explode(',', str_replace(' ', '', "ac, ad, ae, af, ag, ai, al,
am, an, ao, aq, ar, as, at, au, aw, az, ax, ba, bb, bd, be, bf, bg, bh,
bi, bj, bm, bn, bo, br, bs, bt, bv, bw, by, bz, ca, cc, cd, cf, cg, ch,
ci, ck, cl, cm, cn, co, cr, cs, cu, cv, cx, cy, cz, de, dj, dk, dm, do,
dz, ec, ee, eg, eh, er, es, et, eu, fi, fj, fk, fm, fo, fr, ga, gb, gd,
ge, gf, gg, gh, gi, gl, gm, gn, gp, gq, gr, gs, gt, gu, gw, gy, hk, hm,
hn, hr, ht, hu, id, ie, il, im, in, io, iq, ir, is, it, je, jm, jo, jp,
ke, kg, kh, ki, km, kn, kp, kr, kw, ky, kz, la, lb, lc, li, lk, lr, ls,
lt, lu, lv, ly, ma, mc, md, mg, mh, mk, ml, mm, mn, mo, mp, mq, mr, ms,
mt, mu, mv, mw, mx, my, mz, na, nc, ne, nf, ng, ni, nl, no, np, nr, nu,
nz, om, pa, pe, pf, pg, ph, pk, pl, pm, pn, pr, ps, pt, pw, py, qa, re,
ro, ru, rw, sa, sb, sc, sd, se, sg, sh, si, sj, sk, sl, sm, sn, so, sr,
st, sv, sy, sz, tc, td, tf, tg, th, tj, tk, tl, tm, tn, to, tp, tr, tt,
tv, tw, tz, ua, ug, uk, um, us, uy, uz, va,
vc, ve, vg, vi, vn, vu, wf, ws, ye, yt, yu, za, zm, zw"));

$tldarray = array_merge($gTlds,$cTlds);

$testUrl = 'www.examplesite.com.pk';
$testUrl = 'http://secure.email.website.co.uk';
$testUrl = 'http://username:password@this.is.a.worst.shortly.subdomain.thisIsMyMainWebsite.com.cl';

if (!strstr($testUrl, 'http://'))
{
$testUrl = "http://$testUrl";
}

$testUrlParsed = parse_url(trim($testUrl));
$testUrlHost = $testUrlParsed['host'];

$domainarray = explode('.', $testUrlHost);
$top = count($domainarray);

for ($i = 0; $i < $top; $i++)
{
$_domainPart = array_pop($domainarray);

if (!$tld_isReady)
{
if (in_array($_domainPart, $tldarray))
{
$tld = ".$_domainPart".$tld;
}
else
{
$domainName = $_domainPart;
$tld_isReady = 1;
}
}
else
{
$subs = ".$_domainPart".$subs;
}
}

echo 'subdomain is: <code>'.substr($subs ,1).'</code><br>';
echo " and domain is: <code>$domainName$tld</code><br>Beautiful.";

/*
Anyango note the str_replace function added in $gTlds and $cTlds definitions. Errors was coming from there.
Interesting point from rpking and jatar_k (parse_url).
Definitily is necessary go in deep about url construction.
It must be interesting do a battery of test and then sellit :)

Good thread! (This a drug?)

NomikOS.-
*/
?>


AlexK




msg:1272714
 6:22 am on Nov 5, 2005 (gmt 0)

Don't forget that the port can be placed on the end of a url:

[mysite.com:80...]

Anyango




msg:1272715
 6:30 am on Nov 5, 2005 (gmt 0)

Simplest Possible solution to cope with

1) Port at the end of url
2) Username@Password in the begining

is to first use php's any built-in function to parse out those elements and then send the remaining value in $url parameter to function getHostDetails($url) .

Thanks AlexK
;)

NomikOS




msg:1272716
 6:03 pm on Nov 5, 2005 (gmt 0)

good point AlexK.
in msg #21 above: in some place before for instruction, parse last component of $domainarray looking for any char that is not a letter, take their position and then erase from there.

What do yout think Anyango?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved