Forum Moderators: coopster

Message Too Old, No Replies

Extract URLs from a Web Page

Simple way to do this

         

mquarles

5:54 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am setting up a trusted feed for Ink and the third-party provider seems to be having a bit of a problem getting the full list of my URLs extracted.

I have a complete site map for this site, which is broken down into 13 pages, each with 100 or fewer URLs. I would like to use some sort of script to pull the URLs and put them in a big list in Excel or Notepad.

Is there an easy way to do this? I posted this in PHP because I understand it well; if there is another language that does it better I may need some help.

MQ

mquarles

8:33 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bump

defanjos

8:37 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have used a program called links suite 4 that works very well. They have a free dowload and evaluation period.

dubmeier

8:43 pm on Jan 27, 2004 (gmt 0)

10+ Year Member



Write a script using curl to retrieve the pages and then parse them for any hrefs.

Put your intial urls into an array and then use something like the following function to download the urls:

function download($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}

A lazier way to do it would be to download webproxy from www.atstake.com, install it, point IE to it's proxy address (localhost 5111) and let it do the spidering for you (a built in feature).

coopster

10:23 pm on Jan 27, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Similar thread that may provide another idea/option. This one was looking for the
href
attribute of the
<a>
element:
[webmasterworld.com...]

slade7

10:10 pm on Jan 28, 2004 (gmt 0)

10+ Year Member



You'll have to fiddle with this...

basically it takes a url variable "$url" and grabs links off of it and prints them out as a link with the option of going to that page and doing it again.

I haven't messed with this in a while. I was working with register_globals ON, and feeding the url in a form - PHP_SELF and running the script to print out. Obviously you could print it out in such a way that you could import into excel.

------------------------------
function print_links ($url)
{
$fp = fopen($url, "r")
or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all('/<\s*A\s*HREF="([^\"]+)"\s*>([^>]*)<\/A>/i',
$page_contents,
$match_array,
PREG_SET_ORDER);

foreach ($match_array as $entry) {
$href = $entry[1];
$anchortext = $entry[2];
$lcheck = substr($href, 0, 1);
if($lcheck == "h"){
print("<a href=\"$href\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$href\">Crawl this page</a><br>\n");
}elseif($lcheck == "/"){
$hreffix = substr($href, 1, 250);
print("<a href=\"$url/$hreffix\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$url/$hreffix\">Crawl this page</a><br>\n");
}else{
print("<a href=\"$url/$href\">$anchortext</a> OR --> <a href=\"/yourpage.php?url=$url/$href\">Crawl this page</a><br>\n");
}
}
}

killroy

1:10 pm on Jan 29, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What I usually do is to do a SELECT DISTINCT on the path_info section of my logs database. This will extract all pages that have ever been visited. Usualyl after a few weeks running that's a pretty complete list.

SN