Forum Moderators: coopster

Message Too Old, No Replies

how to get the text after the div tag using preg match in php

preg_match, getting the text after the div tag

         

freshfromseo

10:48 am on Jan 24, 2008 (gmt 0)

10+ Year Member



hello everyone,

i am new to web development and my problem is i cannot get the contents after the <div id="article_text">. i used this regex preg_match('/(<div id="article_text">) (.*) (<\/div>)',$html,$body);...what's wrong with this expression? please help me.....tnx a lot..:)

[edited by: jatar_k at 1:55 pm (utc) on Jan. 24, 2008]

[edited by: eelixduppy at 3:28 pm (utc) on Jan. 24, 2008]
[edit reason] removed email [/edit]

PHP_Chimp

7:50 pm on Jan 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How are you trying to echo the $body? As $body is an array that will contain that is equal to matches in the example in the manual [uk3.php.net] -
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matcheswill have the text that matched the first captured parenthesized subpattern, and so on.

So -


preg_match('/(<div id="article_text">) (.*) (<\/div>)[b]/[/b]',$html,$body); // were you getting a warning about no final / in your pattern?
echo '<pre>';
print_r($body);
echo '</pre>';

should give you what you are looking for. If you are just looking for the text between those tags then $body[2] may well be what you want.

You can also use (?:<div ) if you want to use the brackets to group things, but dont want them to be returned in the $body array.
Also remember with regex that if you have -

<div id='article_text'> // single not double quotes
<div id="article_text" > // space after final "
<div id='article_text' style="color:fff"> // style attribute

NONE of these will match your pattern as they are all different. So you may need to expand the pattern to cover more possibilities. As there may well be a blank array returned if there are no matching patterns. You also need to be careful as the .* will match as much as possible. So you may well get everything in the (.*) from after your div tag until the final </div> in the code. So if you use more than 1 div you may well need to amend the code so that it stops at the first </div>, not the last.

<edit>
O ye, welcome to WebmasterWorld :)

[1][edited by: PHP_Chimp at 7:51 pm (utc) on Jan. 24, 2008]

[edited by: eelixduppy at 9:03 am (utc) on Jan. 25, 2008]

freshfromseo

7:08 am on Jan 25, 2008 (gmt 0)

10+ Year Member



thank you Sir PHP_Chimp, by your reply i can now get the text between the div tag.. sorry if i did not explain the problem very well.... tnx to you....

ahm, i have now a new prob regarding preg_match and i wanted to display the text between the div tag... im using this regex -->
<---
$orig_link=$rss->items[$id]['feedburner']['origlink'];
$html = getfile($orig_link);

preg_match('/<div class="leftColStoryPhoto"">(?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+¦(?R))*<\/div>/is',$html,$body);

echo $body[0]; // i did not see any display for the value of $body
--->

below is the html code.... i want to get the text between the <div class="leftColStoryPhoto">....
//html code
-----------------------------------------------------------------------------------
<div class="leftColStoryPhoto">
<!-- temp image holder only below -->
<img src="http://www.example.com/i/s4/illo/photos/2008/Jan/Cool%20Apple/Macintosh128k.bmp" border="0" alt="" />
<p><p>Apple's long history of innovative and occasionally quirky design was reinforced with <a href="http://hardware.silicon.com/desktops/0,39024645,39169694,00.htm">last week's Macworld launch of the wafer-thin MacBook Air laptop</a>. But could this new machine one day figure in the pantheon of Apple's greatest creations? Seb Janacek sifts out his favourite Apple products from more than 30 years of lemons, blind alleys and sheer genius.</p>

<p><strong>1: Macintosh 128K</strong></p>

<p>Arguably the computer that had the biggest influence on personal computing. The Macintosh was launched in 1984 and everything about it was revolutionary, from the consumer-oriented graphical user interface to the mouse - both firsts for a commercially successful computer. Even the history of its development has passed into Mac mythology. The Macintosh set a new paradigm for Apple and the industry as a whole. Without the Mac there'd be no Windows. It was rather cute too - an all-in-one case with a nine-inch monitor. It was also marketed via the famous <em>1984</em> commercial directed by Ridley Scott. It's interesting to speculate about where personal computing would be now without the original Mac - the computer for the "rest of us" that Steve Jobs promised would put a dent in the universe.</p>

<p><strong>Photo credit: <a href="http://example.org/licenses/by-sa/2.5/">Creative Commons licence</a></strong></p>
</p>
</div>

------------------------------------------------------------------------

what should be my regex for that? can you teach me about regex and what are (?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+¦(?R))*<\/div>/is and all of the character used for regex especially in preg_match is all about? actually i just found that code on the net im using it w/o understanding...can u help me please?...thank you

[edited by: eelixduppy at 9:04 am (utc) on Jan. 25, 2008]
[edit reason] specifics and smileys [/edit]

PHP_Chimp

10:45 pm on Jan 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wow that is a serious regex :)
preg_match('/<div class="leftColStoryPhoto"">(?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+¦(?R))*<\/div>/is',$html,$body);

The full explanations can be found in the manual [uk.php.net], however here is a basic breakdown of what you have there -
The first bit -
<div class="leftColStoryPhoto""> should be <div class="leftColStoryPhoto">, as there wouldnt be a second " floating around at the end of the class attribute (typo I guess).
(?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+

means -
I made it smaller so it doesnt take up so much room ;)

(?> means that this is a once-only sub-pattern
(?: means that this pattern will not be captured, so wont appear in the $body array
[^<] means everything except <
+ means 1 or more. You dont need 2 +'s in a row, if you want 2 or more characters then use {2,}. Im guessing another typo
¦ or
< literal character
(?! negative lookahead
\/ literal / character
? 0 or 1 (so the / may or may not be there)
div literal characters (as this goes with the < literal
\b outside a character class this is a word boundary
[^>] not a >
* 0 or more
> literal character to end the div tag
)) end of negative lookahead, end of non-capturing patern
+ 1 or more of those

¦(?R))*<\/div>/is


¦ or, so this is part of the original once-only pattern
(?R) means a recursive match, so of the original pattern
) end of once-only pattern
* 0 or more times
<\/div> a literal </div>

i modifier case insensitive
s modifier means that the . will match everything. As normally it matches everything except a newline. Seeing as you are using multi line string you need this in there so that the . doesnt get confused. Although in the case of this regex you dont actually need it as a negative character class counts the \n character when it reaches one.

Wow that was a long explanation. The regex that you have may well work with the extra " removed.
There is an easier regex, although the one you are using takes care of nested tags, so you can have another div in there and it should find the </div> for the leftColStoryPhoto regardless of any other div tags that may be within that block.
If removing the additional " doesnt work then I, or someone else, can put together a more simple regex that will work for you.

[edited by: eelixduppy at 5:07 am (utc) on Jan. 26, 2008]
[edit reason] disabled smileys [/edit]

freshfromseo

2:47 am on Jan 26, 2008 (gmt 0)

10+ Year Member



yes, there was a typo in <div id="lefColStoryPhoto""> there's an extra (") in my code...thank you Mr. PHP_chimp..

can you help me again about this code below?

<!-- Main Story content START -->
<div class="leftColStory">
<script type="text/javascript">

document.write("<div class=\"findBox\" id=\"resultRelatedContainer39169779\">" +
"<span id=\"resultRelated39169779\"><a href=\"javascript: om_ctrack(this, 'ra-open', 'ra - story - open'); article_loadSimDocs(39169779, '', 'http://management.silicon.com/careers/0,39024671,39169779,00.htm')\">" +
"<img src=\"/i/s4/gl/ico/related-articles.gif\" width=\"13\" height=\"14\" alt=\"\" title=\"\" border=0/>Show related <br />" +
"<span class=\"alignSecondLine\">articles</span></a></span><div id=\"resultRelatedContent39169779\" style=\"display: none; clear: both\"></div></div>");

</script>

<p>Stress at work could be more of a pain than you think. New research shows strong evidence of a direct biological link between workplace worry and coronary heart disease.</p>
<p><p class="textBox" style="width:200px;"><strong>Office insights… </strong> <br /> <br /> &diams;&nbsp <a href="http://management.silicon.com/careers/0,39024671,39169708,00.htm">Bored and underpaid? You're not alone…</a> <br /> <br /> &diams;&nbsp <a href="http://management.silicon.com/smedirector/0,39024679,39169683,00.htm">Health warning to overweight IT managers</a><br /> <br /> &diams;&nbsp <a href="http://management.silicon.com/careers/0,39024671,39169593,00.htm">Demand for tech workers hits six-year high</a> <br /><br /> &diams;&nbsp <a href="http://software.silicon.com/webservices/0,39024657,39168079,00.htm">How the staffing crisis is deepening</a> <br /><br /> &diams;&nbsp <a href="http://management.silicon.com/careers/0,39024671,39168129,00.htm">How techie salaries are faring</a><br /><br /> &diams;&nbsp <a href="http://networks.silicon.com/mobile/0,39024665,39168118,00.htm">Is the office getting you down?</a></p>
<p>The research, which was carried out by scientists at University College London (UCL) and is part of a long-running study following 10,308 London-based civil servants, has found evidence that workplace stress directly affects the biological mechanisms underlying coronary heart disease (CHD), rather than by simply encouraging unhealthy, heart-disease inducing habits in stress sufferers.</p>
<p>Dr Tarani Chandola, a senior lecturer in UCL's Department of Epidemiology and Public Health and one of the authors of the study, said for the first time the research sheds light on the mechanisms underlying the association between stress and heart disease, which "have remained unclear until now".</p>
<p>He said in a statement: "During 12 years of follow-up, we found that chronic work stress was associated with CHD and this association was stronger among both men and women aged under 50 - their risk of CHD was an average of 68 per cent more than for people who reported no stress at work."</p>
<p>Chandola added the association is less pronounced among people of retirement age who are therefore less likely to be exposed to work stress.</p>
<!-- Main Quote at Top START -->
<div class="leftColStoryQuote">
<div class="leftColStoryQuoteImg"><img src="/i/s4/gl/quote-left-dark.gif" width="21" height="18" border="0" alt="" /></div>
<div class="leftColStoryQuoteText">Workers suffering from stress had higher than normal levels of cortisol - the so-called 'stress' hormone.<img src="/i/s4/gl/quote-right-dark.gif" width="21" height="18" border="0" alt="" /></div>
<br clear="all" />
</div>
<!-- Main Quote at Top END -->
<p>Workers suffering from stress had higher than normal levels of cortisol - the so-called 'stress' hormone. The scientists also found evidence that stress disturbs the hypothalamic-pituitary-adrenal axis, part of the body's neuroendocrine system. Such disruptions to the nervous system can affect the signals being sent to the heart and could thus lead to cardiac instability, according to Chandola.</p>
<p>But the biological impact of a hair-tearing workplace is not the only negative effect of stress. The study also found stressed workers are more likely to engage in unhealthy behaviour that can lead indirectly to heart disease, such as having a poor diet or taking less exercise. This accounted for around 32 per cent of the effect of work stress on CHD, said Chandola.</p>
<p>A recent survey by the Policy Studies Institute found 'Big Brother'-style electronic surveillance systems <a href="http://management.silicon.com/government/0,39024677,39169595,00.htm">can fuel stress at work</a>. Job-related stress can also increase employee churn: 10 per cent of respondents to a silicon.com reader poll who are <a href="http://management.silicon.com/careers/0,39024671,39169708,00.htm">looking to change jobs this year</a> said their aim is to reduce workplace stress.</p>

</div>
<!-- Main Story content END -->
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

im using this
preg_match('/<div class="leftColStory">(?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+¦(?R))*<\/div>/is',$html,$body);
preg_match('/<div class="leftColStoryQuoteText">(?>(?:[^<]++¦<(?!\/?div\b[^>]*>))+¦(?R))*<\/div>/is',$html,$body1);

$text = $body[0] . "<p" . $body1[0];

echo $text;

but it doesnt work, i want to display the text between <div class="leftColStory"> and <div class="leftColStoryQuoteText">...please help me..i hope somebody could give me the right code for that...thank you very much

[edited by: eelixduppy at 5:05 am (utc) on Jan. 26, 2008]
[edit reason] disabled smileys [/edit]