Welcome to WebmasterWorld Guest from

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

finding out base href through php curl

finding out base href through php curl

2:54 pm on Oct 21, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Oct 20, 2009
votes: 0

I've been searching the net about 3hrs looking for a method to properly format links that are index through curl

is there a way in php to obtain the base href

5:01 pm on Oct 21, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 28, 2004
votes: 0

I'm not sure what you mean, but can have a guess at what the problem is . . .

If you're curl-ing links from "some location", say, directory crons:


The "base href" you are looking for, I am guessing, is actually httpdocs, which is also the domain root. index.html in httpdocs is your main page.

Your curl returns links:

<a href="some-file.html">Some file</a>

If you want those at the domain toot, just use a PHP string function or preg_replace to add a leading slash to it:

<a href="/some-file.html">Some file</a>

Hope that works, crystal ball is a little cloudy today . . .

5:14 pm on Oct 21, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Oct 20, 2009
votes: 0

Thank you rock n bill. what I'm pulling my hari out about is I'm creating a spider bot to try to make a search engine. The process is below:
1. Go to page utilizing curl (external site)
2. parse the html to dom
3. access all links on the page ignoring javascript, mailto, and some other algorithm.

4. properly structuring the links provided

some links are example:

I'm trying to structure each link to turn it into a lead to make the spider automated. If I were to just take the current links, it would result into errors for the curl script.

I have this so far....if anyone can add on to how to properly structure href links, I would greatly appreciate it.

I've even thought of only displaying the links then utilizing javascript to get them all. They would be properly formatted, but I really want to try to keep it server side so I can make it a cron job.

/*FORMAT URL LINKS*/ function newleads($url,$focusurl){
/*get host name from resulting URL*/
/*base non secure*/ if (substr($url, 0, 7) == "http://"){return $url;}
/*base secure*/ elseif (substr($url, 0, 8) == "https://"){return $url;}
/*base unknown*/ elseif (substr($url, 0, 7) != "http://" && substr($url, 0, 8) != "https://" && !preg_match("/#/i", $url) && !preg_match("/mailto:/i", $url) && !preg_match("/javascript:/i", $url)){
/*1st letter or number */ if (preg_match("/[A-Za-z0-9\'\"]/", substr($url, 0, 1))){
/*last not / */ if (substr($focusurl, -1, 1) == "/"){return $focusurl.$url;}else{return $focusurl.'/'.$url;} }
/*1st / */ elseif (substr($url, 0, 1) == "/"){preg_match('@^(?:http://)?([^/]+)@i', $focusurl, $matches);return $matches[0].$url;}

when calling the above function url is the link obtained from dom access and focus url is the current page accessed by curl