Forum Moderators: coopster
I have a complete site map for this site, which is broken down into 13 pages, each with 100 or fewer URLs. I would like to use some sort of script to pull the URLs and put them in a big list in Excel or Notepad.
Is there an easy way to do this? I posted this in PHP because I understand it well; if there is another language that does it better I may need some help.
MQ
Put your intial urls into an array and then use something like the following function to download the urls:
function download($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
A lazier way to do it would be to download webproxy from www.atstake.com, install it, point IE to it's proxy address (localhost 5111) and let it do the spidering for you (a built in feature).
href attribute of the <a> element: basically it takes a url variable "$url" and grabs links off of it and prints them out as a link with the option of going to that page and doing it again.
I haven't messed with this in a while. I was working with register_globals ON, and feeding the url in a form - PHP_SELF and running the script to print out. Obviously you could print it out in such a way that you could import into excel.
------------------------------
function print_links ($url)
{
$fp = fopen($url, "r")
or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all('/<\s*A\s*HREF="([^\"]+)"\s*>([^>]*)<\/A>/i',
$page_contents,
$match_array,
PREG_SET_ORDER);
foreach ($match_array as $entry) {
$href = $entry[1];
$anchortext = $entry[2];
$lcheck = substr($href, 0, 1);
if($lcheck == "h"){
print("<a href=\"$href\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$href\">Crawl this page</a><br>\n");
}elseif($lcheck == "/"){
$hreffix = substr($href, 1, 250);
print("<a href=\"$url/$hreffix\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$url/$hreffix\">Crawl this page</a><br>\n");
}else{
print("<a href=\"$url/$href\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$url/$href\">Crawl this page</a><br>\n");
}
}
}