Forum Moderators: coopster & phranque

Message Too Old, No Replies

perl link extraction

modifying largish directory of links

         

jeremy goodrich

10:39 pm on Jan 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The following perl code is in a foreach loop I'm working on to extract the links for ease in rewriting a directory I'm building.

$line =~ m/http\:\/\/([^:\/]+.*\")/i;
$url = $1;
$wholeLink = "http\:\/\/" . "$url\n";

Trouble is, I'm still not very good with regex :) so I only half understand what I'm doing here...the link that gets extracted from the target file contains the trailing quote from the html, and I want to only include up to but not that trailing quote.

Any tips or pointers? (must be something sillly I'm missing, I know it!)

amoore

11:26 pm on Jan 20, 2002 (gmt 0)

10+ Year Member



woah. Here's a good time for you to not use "/" as your regex delimiter. It prevents the "leaning toothpick syndrome". I'll use "{" and "}"

try something like this:

m{href="(.*?)"}ig

I use "href' instead of "http" to find my links so I don't miss the relative ones. the "g" after the reges is in case ther's more than one per line. The "?" in the regex makes it non-greedy.

Hope it helps.

jeremy goodrich

12:18 am on Jan 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



thanks, I'll try that soon.