Welcome to WebmasterWorld Guest from 54.198.132.40

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

regex help

     
3:52 am on Jul 23, 2013 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 20, 2006
posts: 109
votes: 0


hello.. can anybody help me?

using php regex, for example:

FROM

<br><b> <i>LoA - </b>only applies to housing, work camps<br>



TO

<br><div class="xxx"><b> <i>LoA - </b>only applies to housing, work camps</div><br>



anybody?
4:56 am on July 23, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


How to do it depends on the structure of the entire string. It would help me more if there's a longer example that showed the <p>s and things along those lines, but if you want to change any <br><b> to have a div with a class in the middle, then from what you've posted I'd do it in 2 parts.

First, I'd:
str_replace('<br><b>','<br><div class="the_class"><b>',$the_string);

Second, I'd:
preg_replace('#(<br><div class="the_class"><b>.*?)(<br>)#m',"$1</div>$2",$the_string);

Or something along those lines.

* Note: My regexes look a bit odd to most people because I use a # delimiter, but one day when I was matching URLs and got tired of escaping every bleeping / I switched and it's really made things easier since there aren't #s used anywhere near as often as /s in the expressions I normally write.

[edited by: phranque at 5:10 am (utc) on Jul 23, 2013]
[edit reason] disabled graphic smileys [/edit]

5:23 am on July 23, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13062
votes: 305


Now, about the sequence of tags...

<b> <i>text </b> more-text-here


;)
8:35 pm on July 23, 2013 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 20, 2006
posts: 109
votes: 0


It will already be broken down by \n so you don’t need to worry about that, what I need is for it to find every occurrence of “LoA” and then put that inside a div until it gets to a <br>

Basically /<b>LoA<\/b>.*<br>/ is an example, however he’s putting <b>, <u> and all sorts of other stuff in there and the format isn’t consistent.

Example:

<b> 5. G3.3.2 - </b> Do spaces used for food preparation and utensil washing have: <br> <b>a) </b>interior linings and work surfaces that are impervious and easily cleaned? <br> <b>(b) </b>all building elements constructed with materials which are free from hazardous substances which could cause contamination to the building contents? <br><b> LoA </b>- only applies to housing, work camps, old people's homes and early childhood centres and where appropriate Commercial and Industrial buildings whose intended use includes the manufacture, preparation, packaging or storage of food. <br> <b>(c) </b> exposed building elements located & shaped to avoid accumulation of dirt? <br><b> <i>LoA </b> only applies to housing</i>

Becomes:

<b> 5. G3.3.2 - </b> Do spaces used for food preparation and utensil washing have: <br> <b>a) </b>interior linings and work surfaces that are impervious and easily cleaned? <br> <b>(b) </b>all building elements constructed with materials which are free from hazardous substances which could cause contamination to the building contents? <br><div class=”xxxxxx”><b> LoA </b>- only applies to housing, work camps, old people's homes and early childhood centres and where appropriate Commercial and Industrial buildings whose intended use includes the manufacture, preparation, packaging or storage of food.</div><br> <b>(c) </b> exposed building elements located & shaped to avoid accumulation of dirt? <br><div class=”xxxxxx”><b> <i>LoA </b> only applies to housing</i></div>
9:13 pm on July 23, 2013 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 20, 2006
posts: 109
votes: 0


i think i got it...


function fixLoA($content) {
$lines = explode("\n",str_replace('<br />',"\n", str_replace('<br>',"\n",$content)));
for ($i = 0; $i < count($lines); $i++) {
$lines[$i] = str_replace('? ',"? \n",$lines[$i]);
$lines[$i] = str_replace('. ',". \n",$lines[$i]);
$lines[$i] = preg_replace_callback(
//"#\<b\>(s+|.*LoA.*|s+)(\n|<br>)#s",
//"/^<b>(?=.*)(.*LoA.*)(?=.*)<br>$/",
'/[<b>|<b>|<b><i>].*LoA.*<\/b>.*(<br>|<br \/>|<\/i>|\n)/',
create_function(
'$matches',
'return "<div class=\"xxx\" style=\"color: RED\">".$matches[0]."</div>";'
),
$lines[$i]
);
}
return implode("\n",$lines);
}



anyone can improve my method? please help.. thanks..
9:20 pm on July 23, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


So it sounds like we need to find <br> followed by anything or nothing (space, <u>, etc.) followed by <b>, followed by anything or nothing (space, <u>, etc.), followed by exactly LoA, followed by anything, followed by <br>...

preg_replace('#(<br>.*)(<b>.*LoA.*)(<br>)#U',"$1<div class=\"the_class\">$2</div>$3",$the_string);

* Note: I added the U modifier to make all .* patterns "ungreedy", meaning .*? will make them "greedy" again rather than the default behavior.

** Added Note: This would probably be a good place to use a "look-ahead" to find the LoA or "break and move on" for efficiency, but didn't write that in.
10:42 pm on July 23, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13062
votes: 305


[<b>|<b>|<b><i>]

Is that a typo for

[<b>|<i>|<b><i>]
?

Seems like what you'd want is something based on

LoA([^<]*((</?\w>[^<]*)+)<br>

Will the source text ever contain anything beyond simple inline markup like <b>? Either a multi-letter tag like <em> or <wbr>, or something still more complex like <span blahblah or <a blahblah containing non-word characters.

It is extremely inconvenient that <b> and <br> start with the same letter, so you can't simply say </?[^b][^>]* and be done with it. It would have to be
</?(?:b|[^b][^>]*)>
where I had
</?\w>
above.
11:28 pm on July 23, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


<?php

$the_string="<b> 5. G3.3.2 - </b> Do spaces used for food preparation and utensil washing have: <br> <b>a) </b>interior linings and work surfaces that are impervious and easily cleaned? <br> <b>(b) </b>all building elements constructed with materials which are free from hazardous substances which could cause contamination to the building contents? <br><b> LoA </b>- only applies to housing, work camps, old people's homes and early childhood centres and where appropriate Commercial and Industrial buildings whose intended use includes the manufacture, preparation, packaging or storage of food. <br> <b>(c) </b> exposed building elements located & shaped to avoid accumulation of dirt? <br><b> <i>LoA </b> only applies to housing</i><br>";

$the_string=preg_replace('#(<br>.*?)(<b>\s*(<(i|u)>)?\s*\bLoA\b.*?)(<br>)#',"$1<div class=\"the_class\">$2</div>$5",$the_string);

echo $the_string;

?>

* Obviously a "non-capturing" grouping of the <i> or <u> would be a bit more efficient.

[edited by: phranque at 11:34 pm (utc) on Jul 23, 2013]
1:16 am on July 24, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13062
votes: 305


(i|u)

Is this more efficient than
[iu]
?
1:31 am on July 24, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


That's a good question and one I've had for years. Even though I've asked experts the answer is: I really don't know, because no one I've asked has had an answer either.

Fortunately, I'm guessing since RAM has become so cheap and processing power so much better over the last few years we're likely in "6 of one, half dozen of the other" territory as far as "speed impact" of either is concerned. I actually switch back and forth, even in the middle of a single expression sometimes, because I just use whichever pops into my head first as "will work here".

If I had to guess it would be yours by a "blip" because [ui] is less characters than (?:u|i) and (u|i) "stores the match for back-reference" so "not storing + less characters" should have a slight advantage.
7:22 pm on July 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


Found the answer:
Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives such as (a|e|i|o|u).

Haven't read that page in years, so I don't know when it got updated to include the preceding, but it's right there in #000 and #FFF.

[php.net...]