homepage Welcome to WebmasterWorld Guest from 54.242.231.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
finding out base href through php curl
finding out base href through php curl
miketheman




msg:4010850
 2:54 pm on Oct 21, 2009 (gmt 0)

I've been searching the net about 3hrs looking for a method to properly format links that are index through curl

is there a way in php to obtain the base href

 

rocknbil




msg:4010933
 5:01 pm on Oct 21, 2009 (gmt 0)

I'm not sure what you mean, but can have a guess at what the problem is . . .

If you're curl-ing links from "some location", say, directory crons:

/var/www/virtuals/example.com/httpdocs/crons

The "base href" you are looking for, I am guessing, is actually httpdocs, which is also the domain root. index.html in httpdocs is your main page.

Your curl returns links:

<a href="some-file.html">Some file</a>

If you want those at the domain toot, just use a PHP string function or preg_replace to add a leading slash to it:

<a href="/some-file.html">Some file</a>

Hope that works, crystal ball is a little cloudy today . . .

miketheman




msg:4010945
 5:14 pm on Oct 21, 2009 (gmt 0)

Thank you rock n bill. what I'm pulling my hari out about is I'm creating a spider bot to try to make a search engine. The process is below:
1. Go to page utilizing curl (external site)
2. parse the html to dom
3. access all links on the page ignoring javascript, mailto, and some other algorithm.

THIS IS THE PROBLEM
4. properly structuring the links provided

some links are example:
/
here.html
/here.html
?gp=1

I'm trying to structure each link to turn it into a lead to make the spider automated. If I were to just take the current links, it would result into errors for the curl script.

I have this so far....if anyone can add on to how to properly structure href links, I would greatly appreciate it.

I've even thought of only displaying the links then utilizing javascript to get them all. They would be properly formatted, but I really want to try to keep it server side so I can make it a cron job.

SNIPPET OF CURRENT LINK FORMATTING:
/*FORMAT URL LINKS*/ function newleads($url,$focusurl){
/*get host name from resulting URL*/
/*base non secure*/ if (substr($url, 0, 7) == "http://"){return $url;}
/*base secure*/ elseif (substr($url, 0, 8) == "https://"){return $url;}
/*base unknown*/ elseif (substr($url, 0, 7) != "http://" && substr($url, 0, 8) != "https://" && !preg_match("/#/i", $url) && !preg_match("/mailto:/i", $url) && !preg_match("/javascript:/i", $url)){
/*1st letter or number */ if (preg_match("/[A-Za-z0-9\'\"]/", substr($url, 0, 1))){
/*last not / */ if (substr($focusurl, -1, 1) == "/"){return $focusurl.$url;}else{return $focusurl.'/'.$url;} }
/*1st / */ elseif (substr($url, 0, 1) == "/"){preg_match('@^(?:http://)?([^/]+)@i', $focusurl, $matches);return $matches[0].$url;}
}
}

when calling the above function url is the link obtained from dom access and focus url is the current page accessed by curl

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved