Forum Moderators: coopster

Message Too Old, No Replies

Preg_match syntax help

Matching within html tags

         

pixel_juice

11:50 pm on Nov 12, 2005 (gmt 0)

10+ Year Member



I'm trying to match data within html tags, which I've got working fine (as long as the tags are very basic) with the code below:

function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( "/<$tag>(.*$needle.*)<\/$tag>/si", $haystack, $needle )) {echo "found";
} else {echo "Not found";}

The problem I've encountered is that some tags will have additional info in them, such as <body bgcolor="000000"> etc.

I think I can get around this by mathing everything but the > symbol, with something like


preg_match( "/<$tag(.*^>)>

but I can't seem to get the correct syntax. Can anyone point me in the right direction?

jd01

1:35 am on Nov 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Something like this should do it:

if (preg_match( "/<$tag[^>]*>(.*$needle.*)<\/$tag>/si", $haystack, $needle ))

[^>]* will match 0 to N characters that are not a > followed by a >

Just make sure you have all your tags closed...

Justin

pixel_juice

1:48 am on Nov 13, 2005 (gmt 0)

10+ Year Member



Unfortunately, that syntax gives me an error:

parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING

I've had a look around to try and understand this without much luck - any ideas?

ergophobe

6:12 pm on Nov 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Try

1. using single quotes

or

2. escaping the quotes in your string

$tag = addslashes($tag);
if (preg_match( "/<$tag[^>]*>(.*$needle.*)<\/$tag>/si", $haystack, $needle ))

pixel_juice

7:04 pm on Nov 13, 2005 (gmt 0)

10+ Year Member



Ah thanks - single quotes fixed the error. Unfortunately the function no longer seems to find anything.

I changed [^>]*>( to [^>].*? in order to try and match <p>keyword</p> and also <p class='someclass'>keyword</p>. But this change doesn't seem to help.


function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( '/<$tag[^>].*?>(.*$needle.*)<\/$tag>/si', $haystack, $needle )) {echo "found";
} else {echo "not found";}
}
find_in_tags ("p", "this", "<p class='someclass'>this</p>");

Grrr @ regex ;)

pixel_juice

11:04 pm on Nov 13, 2005 (gmt 0)

10+ Year Member



My problem seemed to be that the variables in my preg_match line were not being evaluated unless I use double quotes (which then prevents the more advanced code from working). I need to use '.$ for it to work. My final code is below in case it proves useful to anyone in future:


function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( '/<'.$tag.'[^>]*>(.*'.$needle.'.*)<\/'.$tag.'>/si', $haystack, $needle )) {echo "found";} else {echo "not found";}
}

Many thanks for the help guys!

pixel_juice

11:07 pm on Nov 21, 2005 (gmt 0)

10+ Year Member



I'm stuck again on syntax for this function unfortunately!

As is, the function will partial match words, which I don't want. I've tried /b for word endings and also not matching letters or number, but neither seems to work properly.

Can anyone help me with the syntax for matching 0-N characters that are not letters or numbers (or closing tag symbols). [^a-zA-Z0-9].* doesn't seem to do it.

i.e to match <$tag>$needle</$tag> but also <$tag>bleh $needle,</$tag> etc.

jd01

11:14 am on Nov 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It looks like you have a mistake in your syntax.

[^a-zA-Z0-9].* = is not a-zA-Z0-9 followed 'any character that is not the end of a line' 0 or more times.

The problem is the misuse of the .(dot) meta character with a 0 to N modifier...

Modifiers work on the immediately preceding character or group of characters, so the correct syntax you asked for would be:
[^a-zA-Z0-9]*

If you continue to have issues, please post some real examples of what you are matching and trying to accomplish -- some of us are *very* visual and if we cannot see the pattern it is tough for us to help you.

EG from above:
<$tag>bleh $needle,</$tag>

to me 'bleh' looks like letters or numbers, so I am not sure why you are asking for the expression you are, and I cannot offer any real advice on efficiency, or what might be missing.

Justin

pixel_juice

11:37 am on Nov 22, 2005 (gmt 0)

10+ Year Member



Apologies if I'm not being particularly clear.

I want to match the occurance of a particular word (or number of words) within identified html tags.

So if $needle is 'chicken soup' (without the quotes) and the $tag is p (<p>), I want match to be true for <p>chicken soup</p> and <p>i like chicken soup for tea</p>

The script above achieves this, however, it will also be true for <p> notchicken soup</p> and<p>notchicken souple</p>. I figured the easiest way to get around this was to check for the existence of a character that was not a letter or number before and after the end of the string. But I also need to continue matching <p>chicken soup</p>. So basically, if there's a character at the beginning or end of the string, as long as that's not a letter or number that's OK. Although perhaps my approach is flawed ;)

jd01

10:52 pm on Nov 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$haystack="<title>Untitled Document</title></head><body></body></html><title>Untitled Document</title></head><body></body></html><p class=\"test\">Untitled Document</p><title>Untitled Document</title></head><body></body></html><title>Untitled Document</title></head><body></body></html>";

$tag="p";
$needle="Document";
preg_match("/<".$tag."[^>]*>[^\b]*(\b".$needle."\b)[^<]*<\/".$tag.">/im", $haystack, $result);
if($result[1]) {
echo "Success! ".$result[1];
}
else {
echo "No Match";
}

I'll let you play with the efficiency -- to remove the multi-line function, delete the m after the right slash.

Justin

pixel_juice

11:24 pm on Nov 23, 2005 (gmt 0)

10+ Year Member



Many thanks Justin :)

I hate to be posting here again, but while this new code does exactly what I asked for (!), it's thrown up a separate problem - keywords in nested tags can no longer be found, so while <p class="test">Untitled Document</p> finds 'untitled document', <p class="test"><span>Untitled Document</span></p> or <p class="test"><span>Untitled</span> Document</p> does not.

I've tried some different approaches and none seem to get the desired result. Any help?

(If you're a drinking man, a pint is yours!)

jd01

7:10 pm on Nov 26, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My guess is, it is not matching because of the closing tags... </p class="end"> will not match </p>, so you will need to find the closing tag from the original.

$tag="p";
$close=preg_match("/^([^\b]+)/", $tag, $closing_tag);

$needle="Document";
preg_match("/<".$tag."[^>]*>[^\b]*(\b".$needle."\b)[^<]*<\/".$closing_tag[1].">/im", $haystack, $result);

Hope this helps.

Justin