Regex for matching links

Forum Moderators: coopster

Message Too Old, No Replies

Regex for matching links

What do you use

brotherhood of LAN

5:29 pm on Feb 8, 2004 (gmt 0)

When looking to match text within <a> tags, I have this REGEX, bearing in mind that the input $tag is a single <a> tag on its own line;

preg_match("'href\s*=\s*([\"\'])([^\"\']+)([\"\'])'i",$tag,$link);
if(!$link[1])
preg_match("'href\s*=\s*(^\s*)'i",$tag,$link);

Would this be OK or are there exceptions? I'm not too keen on having ["'], preferably it would find one either the " or ' characters and then find the matching end character rather than either/or at both sides.

I've never been sure about using backreferences, couldn't get this code working, though if you look at it you can probably see what I want to do in light of the above...

preg_match("'href\s*=\s*([\"\'])([^\\1]+)([\\1])'i",$tag,$link);

Is that the correct way to use the backreferences? Putting some examples through my code and the output seems to be a bit flaky. I'm using similar REGEX with a preg_match_all to look for alt/title text within tags so hints about using backreferences would also be appreciated.

brotherhood of LAN

10:41 pm on Feb 8, 2004 (gmt 0)

No one ever matched a link before? ;)

coopster

4:34 pm on Feb 10, 2004 (gmt 0)

Are you trying to match the href attribute's value [webmasterworld.com] or the actual text between the <a>elements</a> [webmasterworld.com]?

brotherhood of LAN

5:56 pm on Feb 11, 2004 (gmt 0)

Cheers for posting coopster,

Yes I'm matching the href attribute as well as alt¦title attribute in another REGEX.

Does the second example look OK to you?

ikbenhet1

11:07 pm on Feb 11, 2004 (gmt 0)

hi, i am not good at regexes but this is what is use:

preg_match_all("/(href[= = ])(.*?)(>)(.*?)(<\/a>+)/i",$string, $matches);

Maybe you can use that. Can you make anything out of it?

brotherhood of LAN

3:55 pm on Feb 12, 2004 (gmt 0)

ikhenbet, cheers.....

>> (.*?)

If I remember right, these are called lookahead assertions, any chance you can tell me why oyu use them in the REGEX :)

>> [= = ]

Wasn't too sure about this either.

ikbenhet1

11:16 pm on Feb 12, 2004 (gmt 0)

I took these scripts from a site, i can try to explain, but ehehe. it's just a guess

>> (.*?)

I beleive that just means, grab everything in between the 2 "matches" that surround it so it can be put in a php variable. (i beleive...)

>> [= = ]

me neither, i think it says start matching the (.*?) starting from a "="-sign or a " "-sign (space) until it encounters a ">"-sign and continue the regex

here is the whole function, and please don't forget to inform me when you have a better regex , this one doesn't extract 100% of the links

brotherhood of LAN

10:26 am on Feb 16, 2004 (gmt 0)

OK, delayed response, but responding ;)

>>grab everything in between the 2 "matches"

OK, so something like

preg_match("'9(.*?)'") would be the same as preg_match("'9([^9]+)'")? I'm used to using the character class method....almost looks like they're identical in use.

>>doesn't extract 100% of the links

This is why I posted, here was me thinking with all these PHP users there would be a universal REGEX, or even someone else, maybe even a Perl User :)

Examples of "different" ways of making an href link

<a href="/stuff.htm"> - OK
<a href='/stuff.htm'> - OK
<a href=/stuff.htm - OK
<a href = /stuff.htm - OK
<a href=/ stuff.htm - only matches the "/"
- Links inside javascript

etc etc etc

As shown in my first post, I'm stuck with using 2 regexes....and I'm thinking it's due to some REGEX syntax ignorance.

ikbenhet1

12:46 pm on Feb 16, 2004 (gmt 0)

Dont know much about REGEX as you can see.You probably already know packages like LWP...
I was thinking of making a function like this below:

$totalstring=$website_content;
for ($link=0;$link<100;$link++){
$pos=strpos( "href" , lower(totalstring);
$pos1=strpos( "\/a>" , lower(totalstring);
$link1=substr($totalstring, $pos+1, $pos1);
$endurl=strpos( ">" , lower($link1) );
$url[$link]=substring($link1, 0, $endurl-1);
$url[$link]=eregi_replace("^\"�\"$","",$url[$link]);
$anchor[$link]=substring($link, $endurl+1, $pos1-1);
$anchor[$link]=eregi_replace("^\"�\"$","",$anchor[$link]);
$totalstring=substr($totalstring, $pos1+3, strlen($totalstring)-($pos1-3));
}

I never did, might be an idea to work out, if their is no universal regex availible.
So what about this regex, would this match the urls better?
$matchstr = '/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i';

Javascript links are difficult when they don't use 1 document.write for the entire link, then the regex won't match. You can try without parsing the javascript and also when they don't use variables like: document.write(a+b);, put all the document.write strings in a php variable, then remove all "document.write('" and "');" and you would be left with the html links only, so you can match that string with your regex..

Exuse me for the long, maybe useless post, but i noticed i became a senior member in this thread, so hehe i give it a try. I'd love to have a good regex, so if anybody had one, please do post.

brotherhood of LAN

1:15 pm on Feb 16, 2004 (gmt 0)

Congrats on senior membership, keys in stickymail to the posh wine and cheese tasting room ;)

I see what youre doing in the function you have there, thanks for posting it.

Here's an alternative you might find useful. (It's a cut and paste from a script, but it's the general idea of it for matching links and other stuff..it's the way I've been heading with my REGEXes for a while so I hope not to abandon it too quickly!)

$page = 'page with stuff';
$page = preg_split("'<body[^>]*>'i",$page);
// do stuff with <head> section here before deleting it
$page = preg_split("'(<[^>]+>)'",preg_replace("'</(body�html)\s*>'i","",$page[1]),-1,PREG_SPLIT_NO_EMPTY�PREG_SPLIT_DELIM_CAPTURE);

I also have a function that cleans the page, so there really shouldnt be any need for "preg_split_no_empty" as there is no space between the tags. Every tag or info between tags is in a variable of its own so its easier to match stuff....so that leaves me only bothered about how to match stuff within a single tag.

$matchstr = '/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i';

I think this (.*?) syntax could help me out here, the 2nd regex in my first post didnt seem to work right AFAIRemember.....I'll have to throw both our slices of code together and figure out what does what :)

brotherhood of LAN

10:16 am on Feb 17, 2004 (gmt 0)

Some related stuff:

www.notestips.com/80256B3A007F2692/1/NAMO5RNV2S

I'm not one for knowing the ins and outs of the W3C "standard" but it seems you can use single quotes for attributes too.

ikbenhet1

12:17 pm on Feb 17, 2004 (gmt 0)

But, is their REGEX better than the ones we have (combined)?
If so then it's time to switch.

-That page is really usefull, exerpt that it only extract links, and not achor as well.

brotherhood of LAN

1:36 pm on Feb 17, 2004 (gmt 0)

Not sure about better, the code we have looks OK, but I start the thread wondering the alternatives ;)

AFAIK my code gets all the links, just didnt like the fact I had to use two regexes.

//added
There's something about \\backreferences in the regex o'reilly book I have, seems a lot of these regex examples as above dont quite cover the big picture.

coopster

1:40 pm on Feb 17, 2004 (gmt 0)

You can use [w3.org] single quotes, double quotes, numeric character references and the character entity reference

&quot;

. In certain cases, authors may specify the value of an attribute without any quotation marks (this is, however, invalid XHTML [w3.org]). But, if you are searching web pages, especially older, non-XHTML-standards-conforming, you'll have to include the absence of any quotation marks in your REGEX.

brotherhood of LAN

1:55 pm on Feb 17, 2004 (gmt 0)

Cheers for the post

>numeric character references and the character entity reference "

Didn't know that, thanks, time to slip it into the regex(es)

slade7

2:48 pm on Feb 17, 2004 (gmt 0)

This works tolerably well for me.

function print_links ($url)
{
$fp = fopen($url, "r")
or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all('/<\s*A\s*HREF="([^\"]+)"\s*>([^>]*)<\/A>/i',$page_contents,$match_array,PREG_SET_ORDER);

foreach ($match_array as $entry) {
$href = $entry[1];
$anchortext = $entry[2];
$lcheck = substr($href, 0, 1);
if($lcheck == "h"){
print("<a href=\"$href\">$anchortext</a> ");
}elseif($lcheck == "/"){
$hreffix = substr($href, 1, 250);
print("<a href=\"$url/$hreffix\">$anchortext</a> \n");
}else{
print("<a href=\"$url/$href\">$anchortext</a> ");
}
}
}