help with regex - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

help with regex

Includes something it shouldn't

lorax

6:35 pm on Nov 22, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'm pulling in an XML feed and pushing it all into a variable $data.

Then I look for a string in the form of "XX(XX) 5##" where the X are uppercase letters min 2 but max 4 and the 5## are numbers that start with a 5 but the # can be anything from 0-9 (three numbers total).

ex: HI 520

Then I look for the beginning of the XML tag "<course" by locating the position of the string above and go backward 25 characters which puts me slightly before the opening XML tag.

Then I locate the position of the first instance of the opening tag. And do the same for the closing tag "</course>" with offsets.

Then I run a substring replace to remove all courses that have a numeric code that starts with 5 (like 500 or 595) EXCEPT those that start with AP (like AP 500).

The code works great in general BUT for some reason removes a course with the code "EN 005". Which leads me to thinking I've got an issue with my REGEX?

$i=0;
$graduate = '/([A-Z]{2,4}\s(5[0-9][0-9]))/';
preg_match_all ($graduate,$data,$matches);
// echo count($matches[0]);
while($i<=count($matches[0])){
$startpos = strpos($data,$matches[0][$i])-25;
// echo $startpos."<br/>";
if (!stripos($data, "AP 5")) {
$beginning = "<course";
$ending = "</course>";
$beginning_pos = strpos($data, $beginning, $startpos);
$middle_pos = $beginning_pos + strlen($beginning)-7;
$ending_pos = strpos($data, $ending, $beginning_pos + 1)+9;
$data = substr_replace($data, "", $middle_pos, $ending_pos - $middle_pos);
// echo "i'm in";
}
$i++;
}

lorax

7:06 pm on Nov 22, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Here's a snippet of the problem source XML

<courseinfo>
<course code="EN 005">
<![CDATA[
<div class="courseblock"> <p class="courseblocktitle"><strong>EN 005. Basic English. 3 Credits.</strong></p> <p class="courseblockdesc"> A review of the fundamentals of composition designed to raise the student's command of English to the college level. Required for those whose tests and records demonstrate weakness in diction, spelling, grammar, punctuation and organization. Offered fall semester only. Students assigned to EN 005 must successfully complete the course before enrolling in EN 101. This course will not meet any degree requirements and cannot be used as an elective.<br /> </p> </div>
]]>
</course>
<course code="EN 101">
<![CDATA[
<div class="courseblock"> <p class="courseblocktitle"><strong>EN 101. Composition and Literature I. 3 Credits.</strong></p> <p class="courseblockdesc"> EN 101 is devoted chiefly to the principles of written organization, exposition, argumentation, and research.<br /> </p> </div>
]]>
</course>

lucy24

10:45 pm on Nov 22, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

"XX(XX) 5##" where the X are uppercase letters min 2 but max 4 and the 5## are numbers that start with a 5 but the # can be anything from 0-9 (three numbers total).

Stopping here without reading another word:

\b[A-Z]{2,4} 5\d\d\b

The \b is only necessary if you need to exclude patterns that might have additional characters at beginning or end.

:: reading on ::

Yup, that's what you've got, except that \s would also include tabs, hard spaces (which it seems you do have), and maybe line breaks (depending on setup). Probably safer to force it to [ �] using the two kinds of spaces as literal characters.

stripos

Why do you have this here? It probably doesn't affect the code, but isn't the whole point that they have to be capital letters? php dot net says terrifyingly that "stripos"

may also return a non-Boolean value which evaluates to FALSE

It's possible they say this all the time (they do for strpos) but the first thing I'd check is what exact value is in fact being returned. It sure looks as if somewhere along the line 005 is getting reinterpreted as the integer value 5, doesn't it?

penders

12:03 am on Nov 23, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

(Just caught this thread in passing...)

My first (niggling) thought... XML file being parsed/manipulated with regex - why? Why not use an XML parser? I'm curious.

...and go backward 25 characters

Sorry, probably missed something, but where does 25 come from? Is the XML so well formed that this is consistent?

may also return a non-Boolean value which evaluates to FALSE

If the string/needle you are looking for matches at the start of the string (ie. position 0 - zero) then this will evaluate to false. TBH it doesn't look like this will happen in the above though.

lorax

9:13 pm on Nov 25, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks for the replies. I was able to solve it on my own before EOB on Friday.

Here's the current working version

// Remove XML Doctype
$findthese = array('<?xml version="1.0"?>','<courseinfo>','</courseinfo>');
$data = trim(str_replace($findthese,'',$data));

// Remove Graduate courses
$i=0;
$graduate = '/\<course\scode\=\"([A-Z]{2,4}\s5[0-9][0-9])\"\>/';
preg_match_all ($graduate,$data,$matches);
while($i<=count($matches[0])){
$startpos = strpos($data,$matches[0][$i]);
if ($startpos < 0) $startpos = 0;
if (!stripos($data, "AP 5")) {
$beginning = "<course";
$ending = "</course>";
$beginning_pos = strpos($data, $beginning, $startpos);
$middle_pos = $beginning_pos + strlen($beginning)-8;
$ending_pos = strpos($data, $ending, $beginning_pos + 1);
$data = substr_replace($data, "", $middle_pos, $ending_pos - $middle_pos);
}
$i++;
}

$target = '/<course\scode="([A-Z]{2,4})\s([0-5][0-9][0-9])">/';
$clean1 = preg_replace($target,'',$data);
$findthese = array('<![CDATA[',']]>','<course>','</course>');
$output = str_replace('</div>','<a href="#top" class="inlinebtt">back to top</a></div>',str_replace($findthese,'',$clean1));
$output = substr($output,0, strrpos($output,"div>")+4);

return $output;

fclose($fp);
unset($data);

Thanks!