homepage Welcome to WebmasterWorld Guest from 50.17.66.61
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
help with regex
Includes something it shouldn't
lorax

WebmasterWorld Administrator lorax us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4625436 posted 6:35 pm on Nov 22, 2013 (gmt 0)

I'm pulling in an XML feed and pushing it all into a variable $data.

Then I look for a string in the form of "XX(XX) 5##" where the X are uppercase letters min 2 but max 4 and the 5## are numbers that start with a 5 but the # can be anything from 0-9 (three numbers total).

ex: HI 520

Then I look for the beginning of the XML tag "<course" by locating the position of the string above and go backward 25 characters which puts me slightly before the opening XML tag.

Then I locate the position of the first instance of the opening tag. And do the same for the closing tag "</course>" with offsets.

Then I run a substring replace to remove all courses that have a numeric code that starts with 5 (like 500 or 595) EXCEPT those that start with AP (like AP 500).

The code works great in general BUT for some reason removes a course with the code "EN 005". Which leads me to thinking I've got an issue with my REGEX?


$i=0;
$graduate = '/([A-Z]{2,4}\s(5[0-9][0-9]))/';
preg_match_all ($graduate,$data,$matches);
// echo count($matches[0]);
while($i<=count($matches[0])){
$startpos = strpos($data,$matches[0][$i])-25;
// echo $startpos."<br/>";
if (!stripos($data, "AP 5")) {
$beginning = "<course";
$ending = "</course>";
$beginning_pos = strpos($data, $beginning, $startpos);
$middle_pos = $beginning_pos + strlen($beginning)-7;
$ending_pos = strpos($data, $ending, $beginning_pos + 1)+9;
$data = substr_replace($data, "", $middle_pos, $ending_pos - $middle_pos);
// echo "i'm in";
}
$i++;
}

 

lorax

WebmasterWorld Administrator lorax us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4625436 posted 7:06 pm on Nov 22, 2013 (gmt 0)

Here's a snippet of the problem source XML

<courseinfo>
<course code="EN 005">
<![CDATA[
<div class="courseblock"> <p class="courseblocktitle"><strong>EN&#160;005. Basic English. 3 Credits.</strong></p> <p class="courseblockdesc"> A review of the fundamentals of composition designed to raise the student's command of English to the college level. Required for those whose tests and records demonstrate weakness in diction, spelling, grammar, punctuation and organization. Offered fall semester only. Students assigned to EN 005 must successfully complete the course before enrolling in EN 101. This course will not meet any degree requirements and cannot be used as an elective.<br /> </p> </div>
]]>
</course>
<course code="EN 101">
<![CDATA[
<div class="courseblock"> <p class="courseblocktitle"><strong>EN&#160;101. Composition and Literature I. 3 Credits.</strong></p> <p class="courseblockdesc"> EN 101 is devoted chiefly to the principles of written organization, exposition, argumentation, and research.<br /> </p> </div>
]]>
</course>

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4625436 posted 10:45 pm on Nov 22, 2013 (gmt 0)

"XX(XX) 5##" where the X are uppercase letters min 2 but max 4 and the 5## are numbers that start with a 5 but the # can be anything from 0-9 (three numbers total).

Stopping here without reading another word:

\b[A-Z]{2,4} 5\d\d\b

The \b is only necessary if you need to exclude patterns that might have additional characters at beginning or end.

:: reading on ::

Yup, that's what you've got, except that \s would also include tabs, hard spaces (which it seems you do have), and maybe line breaks (depending on setup). Probably safer to force it to [  ] using the two kinds of spaces as literal characters.

stripos
Why do you have this here? It probably doesn't affect the code, but isn't the whole point that they have to be capital letters? php dot net says terrifyingly that "stripos"
may also return a non-Boolean value which evaluates to FALSE

It's possible they say this all the time (they do for strpos) but the first thing I'd check is what exact value is in fact being returned. It sure looks as if somewhere along the line 005 is getting reinterpreted as the integer value 5, doesn't it?

penders

WebmasterWorld Senior Member penders us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4625436 posted 12:03 am on Nov 23, 2013 (gmt 0)

(Just caught this thread in passing...)

My first (niggling) thought... XML file being parsed/manipulated with regex - why? Why not use an XML parser? I'm curious.

...and go backward 25 characters

Sorry, probably missed something, but where does 25 come from? Is the XML so well formed that this is consistent?

may also return a non-Boolean value which evaluates to FALSE


If the string/needle you are looking for matches at the start of the string (ie. position 0 - zero) then this will evaluate to false. TBH it doesn't look like this will happen in the above though.

lorax

WebmasterWorld Administrator lorax us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4625436 posted 9:13 pm on Nov 25, 2013 (gmt 0)

Thanks for the replies. I was able to solve it on my own before EOB on Friday.

Here's the current working version


// Remove XML Doctype
$findthese = array('<?xml version="1.0"?>','<courseinfo>','</courseinfo>');
$data = trim(str_replace($findthese,'',$data));

// Remove Graduate courses
$i=0;
$graduate = '/\<course\scode\=\"([A-Z]{2,4}\s5[0-9][0-9])\"\>/';
preg_match_all ($graduate,$data,$matches);
while($i<=count($matches[0])){
$startpos = strpos($data,$matches[0][$i]);
if ($startpos < 0) $startpos = 0;
if (!stripos($data, "AP 5")) {
$beginning = "<course";
$ending = "</course>";
$beginning_pos = strpos($data, $beginning, $startpos);
$middle_pos = $beginning_pos + strlen($beginning)-8;
$ending_pos = strpos($data, $ending, $beginning_pos + 1);
$data = substr_replace($data, "", $middle_pos, $ending_pos - $middle_pos);
}
$i++;
}

$target = '/<course\scode="([A-Z]{2,4})\s([0-5][0-9][0-9])">/';
$clean1 = preg_replace($target,'',$data);
$findthese = array('<![CDATA[',']]>','<course>','</course>');
$output = str_replace('</div>','<a href="#top" class="inlinebtt">back to top</a></div>',str_replace($findthese,'',$clean1));
$output = substr($output,0, strrpos($output,"div>")+4);

return $output;

fclose($fp);
unset($data);


Thanks!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved