Regex Help Needed from a Guru (long)

Hopefully this will be easy for someone to answer that has faced a similar situation before. I'll try to keep it as straightforward and simple as possible.

Simply put, I am parsing a string of well-formed HTML code. I am trying to extract a given tag, and all the code contained within it. I do not know how to determine (or specify) the position of a given tag's actual closing tag if the tag contains nested tags of the same name (such as nested <div> tags).

For example, let's say you have the following string of HTML code:

<h1>Title of page</h1> 
<p>An introductory paragraph.</p> 
<div id="myDiv"> 
<p>Some text here.</p> 
<div id="anotherDiv"> 
<p>Some text inside the first nested div.</p> 
<div class="omgAnother"> 
<p>Text inside the second double-nested div</p> 
</div> 
</div> 
<p>Some more text.</p> 
</div> 
<div id="separateDiv"> 
<p>Here is a separate second div tag.</p> 
</div> 
<p>Here is a paragraph of text</p>

OK, so let's say you wanted to write a regular expressions pattern that matched the entire <div id="myDiv"> tag, its closing tag, and everything in between.

It's not as simple as it seems at first because of the nested <div> tags. The nested closing div tags (</div>) will throw off any pattern that I can come up with.

For example, if I do an ungreedy match, the closing tag of the first nested <div> is matched and I get the bolded portion below.


<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#Us';
?>

<h1>Title of page</h1> 
<p>An introductory paragraph.</p> 
<div id="myDiv"> 
<p>Some text here.</p> 
<div id="anotherDiv"> 
<p>Some text inside the first nested div.</p> 
<div class="omgAnother"> 
<p>Text inside the second double-nested div</p> 
</div> 
</div> 
<p>Some more text.</p> 
</div> 
<div id="separateDiv"> 
<p>Here is a separate second div tag.</p> 
</div> 
<p>Here is a paragraph of text</p>

And conversely, if I do a "greedy" match it matches all the way down to the very last closing </div> tag:


<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#s';
?>

<h1>Title of page</h1> 
<p>An introductory paragraph.</p> 

<div id="myDiv"> 
<p>Some text here.</p> 
<div id="anotherDiv"> 
<p>Some text inside the first nested div.</p> 
<div class="omgAnother"> 
<p>Text inside the second double-nested div</p> 
</div> 
</div> 
<p>Some more text.</p> 
</div> 
<div id="separateDiv"> 
<p>Here is a separate second div tag.</p> 
</div> 
 
<p>Here is a paragraph of text</p>

I think I am just stumbling on the logic. I just can't quite seem to wrap my mind around the actual logical procedure that would be used to determine which </div> is the correct one to stop at. For example, it would be great if I could make it skip over any closing </div> tag that it finds if it previously found an opening <div tag. (And it would need to repeat that as many times as necessary, and keep track of how many opening <div tags it encountered so it would 'know' to skip the same number of closing </div> tags). Does that make sense?

So, is there any reasonable way to make sure that a different regex pattern will always match the actual closing tag of a given HTML tag?

Or is there some other method (not relying only on regex) in which I would be able to determine the exact position of a given html tag's actual closing tag? I can easily determine the location of the starting tag. If I could determine the position of its actual closing tag, then I could potentially just use the built-in substr() function.

I realize that I could potentially use some kind of XML parsing functions, but I have never looked into them before, and (for portability) I would like to avoid utilizing any PHP extensions that are not part of the default PHP 4.3.10 install on secure (production) *nix servers.

Thanks in advance if anyone can shed some light on this for me.

Cheers!

Regex Help Needed from a Guru (long)

Need to match all code between tags (even nested tags of same tag name)

skyeflye

skyeflye

andyat11

jatar_k

skyeflye

gliff

gettopreacherman

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week