Thank you rock n bill. what I'm pulling my hari out about is I'm creating a spider bot to try to make a search engine. The process is below:
1. Go to page utilizing curl (external site)
2. parse the html to dom
3. access all links on the page ignoring javascript, mailto, and some other algorithm. THIS IS THE PROBLEM
4. properly structuring the links provided
some links are example:
/
here.html
/here.html
?gp=1
I'm trying to structure each link to turn it into a lead to make the spider automated. If I were to just take the current links, it would result into errors for the curl script.
I have this so far....if anyone can add on to how to properly structure href links, I would greatly appreciate it.
I've even thought of only displaying the links then utilizing javascript to get them all. They would be properly formatted, but I really want to try to keep it server side so I can make it a cron job.
SNIPPET OF CURRENT LINK FORMATTING:
/*FORMAT URL LINKS*/ function newleads($url,$focusurl){
/*get host name from resulting URL*/
/*base non secure*/ if (substr($url, 0, 7) == "http://"){return $url;}
/*base secure*/ elseif (substr($url, 0, 8) == "https://"){return $url;}
/*base unknown*/ elseif (substr($url, 0, 7) != "http://" && substr($url, 0, 8) != "https://" && !preg_match("/#/i", $url) && !preg_match("/mailto:/i", $url) && !preg_match("/javascript:/i", $url)){
/*1st letter or number */ if (preg_match("/[A-Za-z0-9\'\"]/", substr($url, 0, 1))){
/*last not / */ if (substr($focusurl, -1, 1) == "/"){return $focusurl.$url;}else{return $focusurl.'/'.$url;} }
/*1st / */ elseif (substr($url, 0, 1) == "/"){preg_match('@^(?:http://)?([^/]+)@i', $focusurl, $matches);return $matches[0].$url;}
}
}
when calling the above function url is the link obtained from dom access and focus url is the current page accessed by curl