Forum Moderators: coopster
This bugs me for some time: let's say i have a huge text. No HTML, simple plain text with punctuation and say 20.000 characters. How would you slice it in smaller chunks in a way that each created element to be max let's say 1000 characters but each to be ended with a full sentence?
Exploding the text from punctuation (like dot-space or questionmark-space) and then recreating arrays with string length <=1000 ?
Any tips? Would be appreciated
$pattern = "/[^ \r\n][^.!?]{0,999}[.?!]/";
$matches = array();
#
preg_match_all($pattern, $string, $matches);
echo '<pre>'; print_r($matches); echo '</pre>';
This should grab all sentences UNDER 1000 characters and then you can do whatever you want with them. I'm quite busy right now, but I'll play around with this later. :)
A note: I'm not asking this because I desperately want to earn money with this script. Just a friendly conversation because this is an interesting side of text manipulation i believe and seemingly no one made such a script yet :)... Alternatively I picked the wrong keywords for google, wouldn't be the first case.
[pre]
function ssplit($s) {
$end = array('.', '!', '?');
$max = 1000; #max characters
$sentences = array();
#
$last=0; $beg=0; $i=0; $count=0;
$s = str_replace(array("\n","\r","\t"), ' ', $s);
$length = strlen($s);
for($i = 0; $i < $length; $i++) {
$c = $s[$i];
if(in_array($c, $end)) {
$last = $i;
}
if($count >= $max ¦¦ $i == $length-1) {
$sentences[] = trim(substr($s, $beg, $last-$beg+1));
$beg = $last+1;
$count = 0;
}
$count++;
}
return $sentences;
}
echo '<pre>'; print_r(ssplit($string)); echo '</pre>';
[/pre]
See if that helps. :)
[pre]
function ssplit($s) {
$end = array('.', '!', '?');
$max = 1000; #max characters
$sentences = array();
$last=0; $beg=0; $i=0;
$s = str_replace(array("\n","\r","\t"), ' ', $s);
while(isset($s[$i])) {
if(in_array($s[$i++], $end)) {
$last = $i;
}
if(($i-$beg) >= $max ¦¦ !isset($s[$i])) {
$sentences[] = trim(substr($s, $beg, $last-$beg+1));
$beg = $last+1;
}
}
return $sentences;
}
[/pre]
[edited by: eelixduppy at 1:36 am (utc) on Jan. 23, 2009]
function ssplit_reg($s, $max = 1000){
$pattern = "/[^ \r\n][^.!?]{0,999}[.?!]/";
preg_match_all($pattern, $s, $matches);
$sentences = $matches[0];
$return = array();
$block = 0;
foreach($sentences as $sent){
if(!isset($return[$block])){
$return[$block] = $sent.' ';
} else if((strlen($return[$block].$sent)) <= $max){
$return[$block] .= $sent.' ';
} else {
$return[$block] = trim($return[$block]);
$block++;
$return[$block] = $sent.' ';
}
}
return $return;
}
I compared the two functions on a 13,000 word text block which shuffled the sentence order on each iteration to ensure each time the function was called it had to new data to process and got these results:
ssplit function completed 50 iterations in ~9.47 seconds
ssplit_reg function completed 50 iterations in ~1.68 seconds
$s = str_replace(array("\n","\r","\t"), ' ', $s);
This was only for formatting issues. Try removing that and testing again. I'm curious to see that the results are.
BTW, thanks for writing that function with regex. Pretty neat. :)
I named your first function eelixduppy and the duplicate with just that one line removed was named eelixduppy2, and ran the test:
eelixduppy script completed 50 iterations in 9.526716 seconds
eelixduppy2 script completed 50 iterations in 9.472521 seconds
eelixduppy2 script completed in 0.5721% less time