Forum Moderators: coopster

Message Too Old, No Replies

Parsing Text With Begin and End Token Strings

How to parse text in between tokens into an array

         

geckofuel

11:39 pm on Feb 6, 2007 (gmt 0)

10+ Year Member



This is probably a basic question, but I've done some digging around and haven't found an answer.

Say I have a text file with the following data:

<begintoken>username<endtoken>
<begintoken>username2<endtoken>
<begintoken>username3<endtoken>

I'd like to get an array where each element of the array consists of the text between the token strings.

Something like this:

$array=extractData($string,$begin,$end);

echo $array[1] would return "username"
echo $array[2] would return "username2"
echo $array[3] would return "username3"

Are there any functions that do this out of the box? Am I just overlooking one of the basic PHP String functions?

cameraman

1:05 am on Feb 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's several ways to do it, here's but one:
$s = "<begintoken>username<endtoken><begintoken>username2<endtoken><begintoken>username3<endtoken>";
$sa = explode('<endtoken>',$s);
array_pop($sa); // Because of the last <endtoken> we'll end up with one extra, empty array element so discard it
$b = strlen('<begintoken>'); // Get length of beginning token
// Here, <begintoken> is on beginning of each element, so strip it off
while(list($i,$s) = each($sa))
$sa[$i] = substr($s,$b);

It would be easier if the string were in the form:
username<token>username2<token>username3

Then to convert to array it's just one line:
$ary = explode('<token>',$s);

geckofuel

10:56 am on Feb 7, 2007 (gmt 0)

10+ Year Member



Cameraman,
I follow everything to here:

while(list($i,$s) = each($sa))
$sa[$i] = substr($s,$b);

What's going on here? Is $s in the while loop the same $s that contains the original string earlier in the function?

cameraman

11:09 am on Feb 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, no, I re-used the variable so it got reassigned (losing the original array). I apologize; I wasn't paying enough attention.

The while loop iterates through each of the elements in the array. $i gets the index of the element and $s gets the contents of the element, so first pass:
$i = 0
$s = '<begintoken>username'

Then the next line uses the substring function to ignore the first x characters (the length of the token), and puts it back into the array generated by the explode.
So on first pass,
$sa[0] = 'username'

geckofuel

11:16 am on Feb 7, 2007 (gmt 0)

10+ Year Member



Ah, so I got it to work, but in the process some needed functionality emerged that I hadn't thought of before. What if the endtoken is a common endtoken and the only original part of the begin-end code is the begintoken.

In other words, what if my file contains:

<begintoken1>username</endtoken>
<begintoken2>description</endtoken>
<begintoken1>username</endtoken>
<begintoken2>description</endtoken>

So notice that the endtokens are the same, and the only indicator of difference is the begin token.

Unfortunately, this is the situation I'm in (wish the begin and end tokens were both original).

geckofuel

11:21 am on Feb 7, 2007 (gmt 0)

10+ Year Member



Another question:
What if you have junk data in your original string:

$s = "<begintoken>username<endtoken>junkdatajunkdata<begintoken>username2<endtoken>somemorejunkdataofunknownlength<begintoken>username3<endtoken>";

cameraman

11:23 am on Feb 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If the begintoken is actually surrounded by <> and not just for your example, you could explode on the > character. If not, and all of the different begintokens are the same length, you could use the substr function to break them apart. If they're not the same length but you know what all of the possible ones are, you could use the strpos function to determine which one is specified, then break it apart.

cameraman

11:26 am on Feb 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For your last post, if the junk data is in between differing begintokens, you'll pretty much have to search for each of them. Someone didn't make it easy for you.

geckofuel

11:36 am on Feb 7, 2007 (gmt 0)

10+ Year Member



By searcing for each one, do you mean:

Find the first begintoken then find the first endtoken immediately proceeding it, then extract the data between the two points?

In PHP is there a built in function that returns the first instance of any given string token (rather than the last instance)? Or is there something that achieves this pseudocode:

foreach($begintoken in $string){

extractStringBetween($begintoken and $firstoccurrenceofendtoken within $string);

}

Nutter

12:42 pm on Feb 7, 2007 (gmt 0)

10+ Year Member



How 'bout? (Haven't tested it, but I've done something similar)

$arr_matches = array();
preg_match_all('/<begintoken>(.*)<\/endtoken>/ig', $string, $arr_matches);
print_r($arr_matches);