Welcome to WebmasterWorld Guest from 54.144.108.92

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Regex to find URLs in <a> tags

     
1:11 pm on Jul 3, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:June 30, 2009
posts:74
votes: 0


Hi all,

I spent like a day working on this. Basically what I am after is regex to find all the href URLs and <a> tags, and if it contains a certain part of a URL, then replace it. It definitely has to pick up both single and double quotes.

To illustrate, take the following example:
<a id='user_link' class='' href="http://mysite.com/forums_real_path/index.php?showuser=1" title='Your Profile'Username &nbsp;<span id='user_link_dd'></span></a>

I want to use regex to change 'forums_real_path' to just 'forums'. But it must be only in <a></a> tags.

Here's the code I was using, and it was working, but for some reason it didn't work on a bunch of new links:
$pattern='/<\s*A\s*HREF=(\'|")(.*?)forums_real_path(.*?)(\'|")\s*>(.*?)<\/A>/i';
$replacement='<a href=$1$2forums$3$4>$5</a>';
$final_string=preg_replace($pattern,$replacement,$string);


Its kinda messy :/

If anyone could write a much better pattern for me, I'd be very grateful :)
2:50 pm on July 3, 2011 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


Using regex to match HTML can get messy. Using PHP's DOM [php.net] functions might be easier. I found them a bit awkward to get to grips with but they bypass a lot of hassle in trying to parse documents.

$dom = new DOMDocument; $dom->loadHTML($htmlstring); // echo Links and their anchor text echo '
';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
// Do something here
}
echo '
';
8:50 pm on July 3, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:12999
votes: 289


If you posted this bit
(.*?)
next door in the Apache forums, you would immediately be read the riot act not necessarily by me because it means "continue capturing until further notice or until you meet a close quote, whichever comes first", so the computer is essentially racing through your file while holding its breath. Figure of speech. It will probably run "cleaner" if you express it as ([^'"]*) in the first place. And cleaner still if you've got a finite number of patterns to search for. You probably do; few things are genuinely random. Even in an url.
12:45 pm on July 4, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:June 30, 2009
posts:74
votes: 0


@brotherhood of LAN: Ah I just recently come across that (PHP's domdocument) the other day in one of my many google searches on this very matter. Is is possible to get an example of actually manipulating the html? On the surface it seems to be for getting rather than changing?

@lucy24: Haha yeah, I'm kinda very bad at regex as you have probably concluded by now having posted in 2 of my topics. Regex just frustrates me.. but noted on the (.*?), I shall not use it again lol it was just an easy fix.
1:04 pm on July 4, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Handy tips:

Use "exactly"
(.*)
ONLY when the very next thing is a $ "end" anchor, or when it is the ONLY thing in the pattern. Never use (.*) at the start or in the middle of a pattern.

Use "exactly"
.*
if it is the ONLY thing in the RegEx pattern and the value is NOT being captured for re-use. Never use
.*
at the start or in the middle of a pattern.

The (.*?) pattern is less greedy but can still be problematical.

[edited by: g1smd at 1:21 pm (utc) on Jul 4, 2011]

1:04 pm on July 4, 2011 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


Marked, you can manipulate the HTML as well. Hopefully this example is clear enough. If "getattribute" is matched in the link, it's going to switch the link. You can use the first example to echo out the new values, this example will show you the original document with the altered HTML nodes.

getattribute() '; $dom = new DOMDocument; $dom->loadHTML($htmlstring); foreach($dom->getElementsByTagName('a') as $link) { $href = $link->getAttribute('href'); if(preg_match("/getattribute/",$href)) { $link->setAttribute('href','http://www.php.net/manual/en/domelement.setattribute.php'); // Change href attribute $link->nodeValue = 'setAttribute()'; // Change 'inner HTML' } } echo $dom->saveHTML(); ?>

There are some good examples lurking about online but as I say, I found the manual a touch confusing but otherwise the functions are extremely useful.
7:38 am on July 5, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:June 30, 2009
posts:74
votes: 0


@brotherhood of LAN: that is super cool :D I got it working except for that it messes with quoted html elements inside javascript.

For example, the html has the follow in it (this is the ORIGINAL html):
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum</h3><div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications</span><br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></p></div>");
</script>


And this is transferred after $dom->saveHTML();, even without any changes to the html, to the following:
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum<div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications<br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></script></div>");


If that's a bit long, just look at the end of both of these, you can see the ending div is moved outside of the script tags.

Any idea what I can do about it?
9:51 am on July 5, 2011 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


I'm not sure why it's doing that, my version produces an altered document too, but all closing tags are removed. I tried looking around for feedback on it but no joy (took too long :o)).. if likely has a lot to do with the doctype and default settings for the DOM functions.

Change $dom->loadHTML to $dom->loadXML and the same for the save function, it should return the desired output.

There are a few decent comments in the manual but some trial and error can get you there too.
8:48 am on July 6, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:June 30, 2009
posts:74
votes: 0


Hmm I see i see. Yeah it takes a long time doesn't it :P I'll spend like a day on like 50 lines of code, but learning is all about getting stuck in and coding properly.

Well, I guess I'm on my own from here, I'll do more reading and experimenting and see if I get lucky.

I appreciate the help :)
10:12 am on July 6, 2011 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator brotherhood_of_lan is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 30, 2002
posts:4842
votes: 1


>I guess I'm on my own from here

Not at all! You have a working version so far, let us know if you have any more questions.
10:33 pm on July 6, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I'll spend like a day on like 50 lines of code.

Yes. That happens to me on a regular basis too.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members