Welcome to WebmasterWorld Guest from 174.129.151.95

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

group similar data in a string?

   
10:20 pm on Apr 25, 2008 (gmt 0)

5+ Year Member



Hello all
I didn't know a good way of naming my question, so I am getting straight to the point:
I have a string like the following:

-----MMMMM------IIIII----MMM---OOOO---I---MMMM-----

Is there any way I can gather and group my information and get, for example:


6-10:M
17-21:I
26-28:M
32-35:O
39:I
43-46:M

Thank you in advance.
9:54 pm on Apr 26, 2008 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



The pattern you require will depend on what you are trying to locate in the string but I guess I would probably use preg_match_all [php.net] with some optional offset flags. It sounded like you want to get only repeated letter patterns so although a simple [a-z]+ pattern would work on your original string it would fail on a string where grouped characters are not all the same. Note my addition to clarify:
//$subject = '-----MMMMM------IIIII----MMM---OOOO---I---MMMM-----'; 
//$pattern = '/[a-z]+/i'; // Works fine on the original subject string
$subject = '-----MMMMM------IIIII----MMM---OOOO---I---MMMM-----XXZZXX---MMMM';
//$pattern = '/[a-z]+/i'; // Fails in this case
$pattern = '/\b((\w)\2+¦\w)(?:\b)/';
preg_match_all(
$pattern,
$subject,
$matches,
PREG_SET_ORDER ¦ PREG_OFFSET_CAPTURE
);
print "$subject\n";
print str_repeat('1234567890', 7) . "\n";
print_r($matches) . "\n";
foreach ($matches as $match) {
$length = strlen($match[0][0]);
$start = $match[0][1] + 1; // adjust for zero-based indexing
$end = $match[0][1] + $length;
print "$start - $end:" . $match[0][0][0] . "\n";
}

The pattern reads:
Find a word boundary followed by either a repeated letter or a single letter followed by another word boundary. The ?: simply says not to capture the subpattern. I could have left it off in this particular case with no ill effects.

The other printing code in the middle was left there so you could analyze how the patterns are captured and how the offsets work. Details are on the manual pages in the link.

Note: The forum breaks the pipe symbol so you must rekey it if you copy/paste the code