Forum Moderators: coopster

Message Too Old, No Replies

Regex to find URLs in <a> tags

         

Marked

1:11 pm on Jul 3, 2011 (gmt 0)

10+ Year Member



Hi all,

I spent like a day working on this. Basically what I am after is regex to find all the href URLs and <a> tags, and if it contains a certain part of a URL, then replace it. It definitely has to pick up both single and double quotes.

To illustrate, take the following example:
<a id='user_link' class='' href="http://mysite.com/forums_real_path/index.php?showuser=1" title='Your Profile'Username &nbsp;<span id='user_link_dd'></span></a>

I want to use regex to change 'forums_real_path' to just 'forums'. But it must be only in <a></a> tags.

Here's the code I was using, and it was working, but for some reason it didn't work on a bunch of new links:
$pattern='/<\s*A\s*HREF=(\'|")(.*?)forums_real_path(.*?)(\'|")\s*>(.*?)<\/A>/i';
$replacement='<a href=$1$2forums$3$4>$5</a>';
$final_string=preg_replace($pattern,$replacement,$string);


Its kinda messy :/

If anyone could write a much better pattern for me, I'd be very grateful :)

brotherhood of LAN

2:50 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Using regex to match HTML can get messy. Using PHP's DOM [php.net] functions might be easier. I found them a bit awkward to get to grips with but they bypass a lot of hassle in trying to parse documents.

$dom = new DOMDocument; $dom->loadHTML($htmlstring); // echo Links and their anchor text echo '
';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
// Do something here
}
echo '
';

lucy24

8:50 pm on Jul 3, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you posted this bit
(.*?)
next door in the Apache forums, you would immediately be read the riot act not necessarily by me because it means "continue capturing until further notice or until you meet a close quote, whichever comes first", so the computer is essentially racing through your file while holding its breath. Figure of speech. It will probably run "cleaner" if you express it as ([^'"]*) in the first place. And cleaner still if you've got a finite number of patterns to search for. You probably do; few things are genuinely random. Even in an url.

Marked

12:45 pm on Jul 4, 2011 (gmt 0)

10+ Year Member



@brotherhood of LAN: Ah I just recently come across that (PHP's domdocument) the other day in one of my many google searches on this very matter. Is is possible to get an example of actually manipulating the html? On the surface it seems to be for getting rather than changing?

@lucy24: Haha yeah, I'm kinda very bad at regex as you have probably concluded by now having posted in 2 of my topics. Regex just frustrates me.. but noted on the (.*?), I shall not use it again lol it was just an easy fix.

g1smd

1:04 pm on Jul 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Handy tips:

Use "exactly"
(.*)
ONLY when the very next thing is a $ "end" anchor, or when it is the ONLY thing in the pattern. Never use (.*) at the start or in the middle of a pattern.

Use "exactly"
.*
if it is the ONLY thing in the RegEx pattern and the value is NOT being captured for re-use. Never use
.*
at the start or in the middle of a pattern.

The (.*?) pattern is less greedy but can still be problematical.

[edited by: g1smd at 1:21 pm (utc) on Jul 4, 2011]

brotherhood of LAN

1:04 pm on Jul 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Marked, you can manipulate the HTML as well. Hopefully this example is clear enough. If "getattribute" is matched in the link, it's going to switch the link. You can use the first example to echo out the new values, this example will show you the original document with the altered HTML nodes.

getattribute() '; $dom = new DOMDocument; $dom->loadHTML($htmlstring); foreach($dom->getElementsByTagName('a') as $link) { $href = $link->getAttribute('href'); if(preg_match("/getattribute/",$href)) { $link->setAttribute('href','http://www.php.net/manual/en/domelement.setattribute.php'); // Change href attribute $link->nodeValue = 'setAttribute()'; // Change 'inner HTML' } } echo $dom->saveHTML(); ?>

There are some good examples lurking about online but as I say, I found the manual a touch confusing but otherwise the functions are extremely useful.

Marked

7:38 am on Jul 5, 2011 (gmt 0)

10+ Year Member



@brotherhood of LAN: that is super cool :D I got it working except for that it messes with quoted html elements inside javascript.

For example, the html has the follow in it (this is the ORIGINAL html):
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum</h3><div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications</span><br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></p></div>");
</script>


And this is transferred after $dom->saveHTML();, even without any changes to the html, to the following:
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum<div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications<br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></script></div>");


If that's a bit long, just look at the end of both of these, you can see the ending div is moved outside of the script tags.

Any idea what I can do about it?

brotherhood of LAN

9:51 am on Jul 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm not sure why it's doing that, my version produces an altered document too, but all closing tags are removed. I tried looking around for feedback on it but no joy (took too long :o)).. if likely has a lot to do with the doctype and default settings for the DOM functions.

Change $dom->loadHTML to $dom->loadXML and the same for the save function, it should return the desired output.

There are a few decent comments in the manual but some trial and error can get you there too.

Marked

8:48 am on Jul 6, 2011 (gmt 0)

10+ Year Member



Hmm I see i see. Yeah it takes a long time doesn't it :P I'll spend like a day on like 50 lines of code, but learning is all about getting stuck in and coding properly.

Well, I guess I'm on my own from here, I'll do more reading and experimenting and see if I get lucky.

I appreciate the help :)

brotherhood of LAN

10:12 am on Jul 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>I guess I'm on my own from here

Not at all! You have a working version so far, let us know if you have any more questions.

g1smd

10:33 pm on Jul 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll spend like a day on like 50 lines of code.

Yes. That happens to me on a regular basis too.