homepage Welcome to WebmasterWorld Guest from 23.22.173.58
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Regex to find URLs in <a> tags
Marked




msg:4334409
 1:11 pm on Jul 3, 2011 (gmt 0)

Hi all,

I spent like a day working on this. Basically what I am after is regex to find all the href URLs and <a> tags, and if it contains a certain part of a URL, then replace it. It definitely has to pick up both single and double quotes.

To illustrate, take the following example:
<a id='user_link' class='' href="http://mysite.com/forums_real_path/index.php?showuser=1" title='Your Profile'Username &nbsp;<span id='user_link_dd'></span></a>
I want to use regex to change 'forums_real_path' to just 'forums'. But it must be only in <a></a> tags.

Here's the code I was using, and it was working, but for some reason it didn't work on a bunch of new links:
$pattern='/<\s*A\s*HREF=(\'|")(.*?)forums_real_path(.*?)(\'|")\s*>(.*?)<\/A>/i';
$replacement='<a href=$1$2forums$3$4>$5</a>';
$final_string=preg_replace($pattern,$replacement,$string);


Its kinda messy :/

If anyone could write a much better pattern for me, I'd be very grateful :)

 

brotherhood of LAN




msg:4334423
 2:50 pm on Jul 3, 2011 (gmt 0)

Using regex to match HTML can get messy. Using PHP's DOM [php.net] functions might be easier. I found them a bit awkward to get to grips with but they bypass a lot of hassle in trying to parse documents.

<?php

$dom = new DOMDocument;
$dom->loadHTML($htmlstring);
 
// echo Links and their anchor text
echo '<pre>';
echo "Link\tAnchor\n";
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
$anchor = $link->nodeValue;
echo $href,"\t",$anchor,"\n";
// Do something here
}
echo '</pre>';

?>

lucy24




msg:4334571
 8:50 pm on Jul 3, 2011 (gmt 0)

If you posted this bit
(.*?)
next door in the Apache forums, you would immediately be read the riot act not necessarily by me because it means "continue capturing until further notice or until you meet a close quote, whichever comes first", so the computer is essentially racing through your file while holding its breath. Figure of speech. It will probably run "cleaner" if you express it as ([^'"]*) in the first place. And cleaner still if you've got a finite number of patterns to search for. You probably do; few things are genuinely random. Even in an url.

Marked




msg:4334771
 12:45 pm on Jul 4, 2011 (gmt 0)

@brotherhood of LAN: Ah I just recently come across that (PHP's domdocument) the other day in one of my many google searches on this very matter. Is is possible to get an example of actually manipulating the html? On the surface it seems to be for getting rather than changing?

@lucy24: Haha yeah, I'm kinda very bad at regex as you have probably concluded by now having posted in 2 of my topics. Regex just frustrates me.. but noted on the (.*?), I shall not use it again lol it was just an easy fix.

g1smd




msg:4334778
 1:04 pm on Jul 4, 2011 (gmt 0)

Handy tips:

Use "exactly"
(.*) ONLY when the very next thing is a $ "end" anchor, or when it is the ONLY thing in the pattern. Never use (.*) at the start or in the middle of a pattern.

Use "exactly"
.* if it is the ONLY thing in the RegEx pattern and the value is NOT being captured for re-use. Never use .* at the start or in the middle of a pattern.

The (.*?) pattern is less greedy but can still be problematical.

[edited by: g1smd at 1:21 pm (utc) on Jul 4, 2011]

brotherhood of LAN




msg:4334779
 1:04 pm on Jul 4, 2011 (gmt 0)

Marked, you can manipulate the HTML as well. Hopefully this example is clear enough. If "getattribute" is matched in the link, it's going to switch the link. You can use the first example to echo out the new values, this example will show you the original document with the altered HTML nodes.

<?php

$htmlstring = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a href="http://www.php.net/manual/en/domelement.getattribute.php">getattribute()</a>
</body></html>';

$dom = new DOMDocument;
$dom->loadHTML($htmlstring);
 
foreach($dom->getElementsByTagName('a') as $link)
{
$href = $link->getAttribute('href');
   if(preg_match("/getattribute/",$href))
   {
   $link->setAttribute('href','http://www.php.net/manual/en/domelement.setattribute.php'); // Change href attribute
   $link->nodeValue = 'setAttribute()'; // Change 'inner HTML'
   }
}

echo $dom->saveHTML();
?>

There are some good examples lurking about online but as I say, I found the manual a touch confusing but otherwise the functions are extremely useful.

Marked




msg:4334985
 7:38 am on Jul 5, 2011 (gmt 0)

@brotherhood of LAN: that is super cool :D I got it working except for that it messes with quoted html elements inside javascript.

For example, the html has the follow in it (this is the ORIGINAL html):
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum</h3><div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications</span><br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></p></div>");
</script>


And this is transferred after $dom->saveHTML();, even without any changes to the html, to the following:
<script type="text/javascript">
var FAVE_TEMPLATE = new Template( "<h3>Unfollow this forum<div class='ipsPad'><span class='desc'>If you unfollow this forum this you will no longer receive any notifications<br /><p class='ipsForm_center'><input type='button' value='Unfollow this forum' class='input_submit _funset' /></script></div>");


If that's a bit long, just look at the end of both of these, you can see the ending div is moved outside of the script tags.

Any idea what I can do about it?

brotherhood of LAN




msg:4335053
 9:51 am on Jul 5, 2011 (gmt 0)

I'm not sure why it's doing that, my version produces an altered document too, but all closing tags are removed. I tried looking around for feedback on it but no joy (took too long :o)).. if likely has a lot to do with the doctype and default settings for the DOM functions.

Change $dom->loadHTML to $dom->loadXML and the same for the save function, it should return the desired output.

There are a few decent comments in the manual but some trial and error can get you there too.

Marked




msg:4335627
 8:48 am on Jul 6, 2011 (gmt 0)

Hmm I see i see. Yeah it takes a long time doesn't it :P I'll spend like a day on like 50 lines of code, but learning is all about getting stuck in and coding properly.

Well, I guess I'm on my own from here, I'll do more reading and experimenting and see if I get lucky.

I appreciate the help :)

brotherhood of LAN




msg:4335668
 10:12 am on Jul 6, 2011 (gmt 0)

>I guess I'm on my own from here

Not at all! You have a working version so far, let us know if you have any more questions.

g1smd




msg:4336058
 10:33 pm on Jul 6, 2011 (gmt 0)

I'll spend like a day on like 50 lines of code.

Yes. That happens to me on a regular basis too.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved