Forum Moderators: coopster
Simply put, I am parsing a string of well-formed HTML code. I am trying to extract a given tag, and all the code contained within it. I do not know how to determine (or specify) the position of a given tag's actual closing tag if the tag contains nested tags of the same name (such as nested <div> tags).
For example, let's say you have the following string of HTML code:
<h1>Title of page</h1>
<p>An introductory paragraph.</p>
<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>
</div>
<p>Some more text.</p>
</div><div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div><p>Here is a paragraph of text</p>
OK, so let's say you wanted to write a regular expressions pattern that matched the entire <div id="myDiv"> tag, its closing tag, and everything in between.
It's not as simple as it seems at first because of the nested <div> tags. The nested closing div tags (</div>) will throw off any pattern that I can come up with.
For example, if I do an ungreedy match, the closing tag of the first nested <div> is matched and I get the bolded portion below.
<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#Us';
?>
<h1>Title of page</h1>
<p>An introductory paragraph.</p>
<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>
</div>
<p>Some more text.</p>
</div><div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div><p>Here is a paragraph of text</p>
And conversely, if I do a "greedy" match it matches all the way down to the very last closing </div> tag:
<?php
$regex_pattern = '#\<div\s.*id="myDiv".*\>.*\</div\>#s';
?>
<h1>Title of page</h1>
<p>An introductory paragraph.</p>
<div id="myDiv">
<p>Some text here.</p>
<div id="anotherDiv">
<p>Some text inside the first nested div.</p>
<div class="omgAnother">
<p>Text inside the second double-nested div</p>
</div>
</div>
<p>Some more text.</p>
</div><div id="separateDiv">
<p>Here is a separate second div tag.</p>
</div>
<p>Here is a paragraph of text</p>
I think I am just stumbling on the logic. I just can't quite seem to wrap my mind around the actual logical procedure that would be used to determine which </div> is the correct one to stop at. For example, it would be great if I could make it skip over any closing </div> tag that it finds if it previously found an opening <div tag. (And it would need to repeat that as many times as necessary, and keep track of how many opening <div tags it encountered so it would 'know' to skip the same number of closing </div> tags). Does that make sense?
So, is there any reasonable way to make sure that a different regex pattern will always match the actual closing tag of a given HTML tag?
Or is there some other method (not relying only on regex) in which I would be able to determine the exact position of a given html tag's actual closing tag? I can easily determine the location of the starting tag. If I could determine the position of its actual closing tag, then I could potentially just use the built-in substr() function.
I realize that I could potentially use some kind of XML parsing functions, but I have never looked into them before, and (for portability) I would like to avoid utilizing any PHP extensions that are not part of the default PHP 4.3.10 install on secure (production) *nix servers.
Thanks in advance if anyone can shed some light on this for me.
Cheers!
I did create a solution for this problem last night without relying solely on regex. I just wrote a PHP function that "manually" scans through the string of HTML code in order to locate the desired tag's actual closing tag.
I will post the function I wrote to this thread when I get home tonight just in case it might ever be of some use to anybody.
Cheers!
could use a counter for open div's if you find another <div you need to read through one more </div
which equates to $divcount++ and $divcount-- and if $divcount<0 then you just found your closing tag ;)
But once I have a moment, I will go ahead and post the PHP function that does essentially what you said. It's probably not the most efficient thing in the world, but it seems to be working for me.
I'll post it in this thread once I put my 'life' back together. :)
Cheers!
Also, if you know the code is well formed and you're using XHTML, you could probably use the expat parser to pull out what you need.
Or, you know, use the function your wrote. (-: