Forum Moderators: coopster

Message Too Old, No Replies

Regex Help Needed from a Guru (long)

Need to match all code between tags (even nested tags of same tag name)

         

skyeflye

10:54 pm on Mar 13, 2005 (gmt 0)

10+ Year Member



Hopefully this will be easy for someone to answer that has faced a similar situation before. I'll try to keep it as straightforward and simple as possible.

Simply put, I am parsing a string of well-formed HTML code. I am trying to extract a given tag, and all the code contained within it. I do not know how to determine (or specify) the position of a given tag's actual closing tag if the tag contains nested tags of the same name (such as nested <div> tags).

For example, let's say you have the following string of HTML code:

<h1>Title of page</h1>

<p>An introductory paragraph.</p>

<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>
</div>
<p>Some more text.</p>
</div>

<div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div>

<p>Here is a paragraph of text</p>

OK, so let's say you wanted to write a regular expressions pattern that matched the entire <div id="myDiv"> tag, its closing tag, and everything in between.

It's not as simple as it seems at first because of the nested <div> tags. The nested closing div tags (</div>) will throw off any pattern that I can come up with.

For example, if I do an ungreedy match, the closing tag of the first nested <div> is matched and I get the bolded portion below.


<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#Us';
?>

<h1>Title of page</h1>

<p>An introductory paragraph.</p>

<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>

</div>
<p>Some more text.</p>
</div>

<div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div>

<p>Here is a paragraph of text</p>

And conversely, if I do a "greedy" match it matches all the way down to the very last closing </div> tag:


<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#s';
?>

<h1>Title of page</h1>

<p>An introductory paragraph.</p>


<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>
</div>
<p>Some more text.</p>
</div>

<div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div>

<p>Here is a paragraph of text</p>

I think I am just stumbling on the logic. I just can't quite seem to wrap my mind around the actual logical procedure that would be used to determine which </div> is the correct one to stop at. For example, it would be great if I could make it skip over any closing </div> tag that it finds if it previously found an opening <div tag. (And it would need to repeat that as many times as necessary, and keep track of how many opening <div tags it encountered so it would 'know' to skip the same number of closing </div> tags). Does that make sense?

So, is there any reasonable way to make sure that a different regex pattern will always match the actual closing tag of a given HTML tag?

Or is there some other method (not relying only on regex) in which I would be able to determine the exact position of a given html tag's actual closing tag? I can easily determine the location of the starting tag. If I could determine the position of its actual closing tag, then I could potentially just use the built-in substr() function.

I realize that I could potentially use some kind of XML parsing functions, but I have never looked into them before, and (for portability) I would like to avoid utilizing any PHP extensions that are not part of the default PHP 4.3.10 install on secure (production) *nix servers.

Thanks in advance if anyone can shed some light on this for me.

Cheers!

skyeflye

4:09 pm on Mar 14, 2005 (gmt 0)

10+ Year Member



Thanks to anyone who may have given my original question (above) any thought. I came to the conclusion that without a Mensa I.D. card, I wasn't going to be able to come up with a regex solution in any reasonable amount of time...and time is money.

I did create a solution for this problem last night without relying solely on regex. I just wrote a PHP function that "manually" scans through the string of HTML code in order to locate the desired tag's actual closing tag.

I will post the function I wrote to this thread when I get home tonight just in case it might ever be of some use to anybody.

Cheers!

andyat11

4:13 pm on Mar 14, 2005 (gmt 0)

10+ Year Member



CONGRADS

jatar_k

10:17 pm on Mar 14, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



the thing that comes to mind is to mark the position where you find the opening tag and start reading from there

could use a counter for open div's if you find another <div you need to read through one more </div

which equates to $divcount++ and $divcount-- and if $divcount<0 then you just found your closing tag ;)

skyeflye

1:13 am on Mar 15, 2005 (gmt 0)

10+ Year Member



Thanks for the input, Jatar. That is almost exactly what my function does. Unfortunately, my hard drive failed and I am recovering everything off of the backup drive and reinstalling everything.

But once I have a moment, I will go ahead and post the PHP function that does essentially what you said. It's probably not the most efficient thing in the world, but it seems to be working for me.

I'll post it in this thread once I put my 'life' back together. :)

Cheers!

gliff

2:39 pm on Mar 15, 2005 (gmt 0)

10+ Year Member



Parsing more than trivial HTML with regular expressions is a sure way to drive yourself crazy. Perl has several HTML parsers that can make this kind of task a lot easier, (I've yet to find a good one for PHP, but haven't looked very hard).

Also, if you know the code is well formed and you're using XHTML, you could probably use the expat parser to pull out what you need.

Or, you know, use the function your wrote. (-:

gettopreacherman

9:06 pm on Mar 15, 2005 (gmt 0)

10+ Year Member



<div>([\S*]^<)<p>