Preg_match syntax help

Forum Moderators: coopster

Message Too Old, No Replies

Preg_match syntax help

Matching within html tags

pixel_juice

11:50 pm on Nov 12, 2005 (gmt 0)

I'm trying to match data within html tags, which I've got working fine (as long as the tags are very basic) with the code below:

function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( "/<$tag>(.*$needle.*)<\/$tag>/si", $haystack, $needle )) {echo "found";
} else {echo "Not found";}

The problem I've encountered is that some tags will have additional info in them, such as <body bgcolor="000000"> etc.

I think I can get around this by mathing everything but the > symbol, with something like


preg_match( "/<$tag(.*^>)>

but I can't seem to get the correct syntax. Can anyone point me in the right direction?

jd01

1:35 am on Nov 13, 2005 (gmt 0)

Something like this should do it:

if (preg_match( "/<$tag[^>]*>(.*$needle.*)<\/$tag>/si", $haystack, $needle ))

[^>]* will match 0 to N characters that are not a > followed by a >

Just make sure you have all your tags closed...

Justin

pixel_juice

1:48 am on Nov 13, 2005 (gmt 0)

Unfortunately, that syntax gives me an error:

parse error, unexpected T_ENCAPSED_AND_WHITESPACE, expecting T_STRING or T_VARIABLE or T_NUM_STRING

I've had a look around to try and understand this without much luck - any ideas?

ergophobe

6:12 pm on Nov 13, 2005 (gmt 0)

Try

1. using single quotes

2. escaping the quotes in your string

$tag = addslashes($tag);
if (preg_match( "/<$tag[^>]*>(.*$needle.*)<\/$tag>/si", $haystack, $needle ))

pixel_juice

7:04 pm on Nov 13, 2005 (gmt 0)

Ah thanks - single quotes fixed the error. Unfortunately the function no longer seems to find anything.

I changed [^>]*>( to [^>].*? in order to try and match keyword and also keyword. But this change doesn't seem to help.


function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( '/<$tag[^>].*?>(.*$needle.*)<\/$tag>/si', $haystack, $needle ))  {echo "found";
} else {echo "not found";}
}
find_in_tags ("p", "this", "<p class='someclass'>this</p>");

Grrr @ regex ;)

pixel_juice

11:04 pm on Nov 13, 2005 (gmt 0)

My problem seemed to be that the variables in my preg_match line were not being evaluated unless I use double quotes (which then prevents the more advanced code from working). I need to use '.$ for it to work. My final code is below in case it proves useful to anyone in future:


function find_in_tags ($tag, $needle, $haystack) {
if (preg_match( '/<'.$tag.'[^>]*>(.*'.$needle.'.*)<\/'.$tag.'>/si', $haystack, $needle ))  {echo "found";} else {echo "not found";}
}

Many thanks for the help guys!

pixel_juice

11:07 pm on Nov 21, 2005 (gmt 0)

I'm stuck again on syntax for this function unfortunately!

As is, the function will partial match words, which I don't want. I've tried /b for word endings and also not matching letters or number, but neither seems to work properly.

Can anyone help me with the syntax for matching 0-N characters that are not letters or numbers (or closing tag symbols). [^a-zA-Z0-9].* doesn't seem to do it.

i.e to match <$tag>$needle</$tag> but also <$tag>bleh $needle,</$tag> etc.

jd01

11:14 am on Nov 22, 2005 (gmt 0)

It looks like you have a mistake in your syntax.

[^a-zA-Z0-9].* = is not a-zA-Z0-9 followed 'any character that is not the end of a line' 0 or more times.

The problem is the misuse of the .(dot) meta character with a 0 to N modifier...

Modifiers work on the immediately preceding character or group of characters, so the correct syntax you asked for would be:
[^a-zA-Z0-9]*

If you continue to have issues, please post some real examples of what you are matching and trying to accomplish -- some of us are *very* visual and if we cannot see the pattern it is tough for us to help you.

EG from above:
<$tag>bleh $needle,</$tag>

to me 'bleh' looks like letters or numbers, so I am not sure why you are asking for the expression you are, and I cannot offer any real advice on efficiency, or what might be missing.

Justin

pixel_juice

11:37 am on Nov 22, 2005 (gmt 0)

Apologies if I'm not being particularly clear.

I want to match the occurance of a particular word (or number of words) within identified html tags.

So if $needle is 'chicken soup' (without the quotes) and the $tag is p (), I want match to be true for chicken soup and i like chicken soup for tea

The script above achieves this, however, it will also be true for notchicken soup andnotchicken souple. I figured the easiest way to get around this was to check for the existence of a character that was not a letter or number before and after the end of the string. But I also need to continue matching chicken soup. So basically, if there's a character at the beginning or end of the string, as long as that's not a letter or number that's OK. Although perhaps my approach is flawed ;)

jd01

10:52 pm on Nov 22, 2005 (gmt 0)

$haystack="<title>Untitled Document</title></head><body></body></html><title>Untitled Document</title></head><body></body></html>Untitled Document<title>Untitled Document</title></head><body></body></html><title>Untitled Document</title></head><body></body></html>";

$tag="p";
$needle="Document";
preg_match("/<".$tag."[^>]*>[^\b]*(\b".$needle."\b)[^<]*<\/".$tag.">/im", $haystack, $result);
if($result[1]) {
echo "Success! ".$result[1];
}
else {
echo "No Match";
}

I'll let you play with the efficiency -- to remove the multi-line function, delete the m after the right slash.

Justin

pixel_juice

11:24 pm on Nov 23, 2005 (gmt 0)

Many thanks Justin :)

I hate to be posting here again, but while this new code does exactly what I asked for (!), it's thrown up a separate problem - keywords in nested tags can no longer be found, so while Untitled Document finds 'untitled document', Untitled Document or Untitled Document does not.

I've tried some different approaches and none seem to get the desired result. Any help?

(If you're a drinking man, a pint is yours!)

jd01

7:10 pm on Nov 26, 2005 (gmt 0)

My guess is, it is not matching because of the closing tags... will not match , so you will need to find the closing tag from the original.

$tag="p";
$close=preg_match("/^([^\b]+)/", $tag, $closing_tag);

$needle="Document";
preg_match("/<".$tag."[^>]*>[^\b]*(\b".$needle."\b)[^<]*<\/".$closing_tag[1].">/im", $haystack, $result);

Hope this helps.

Justin