Text explosion to specific sized array elements

Forum Moderators: coopster

Message Too Old, No Replies

Text explosion to specific sized array elements

methode

10:55 am on Jan 21, 2009 (gmt 0)

Hi,

This bugs me for some time: let's say i have a huge text. No HTML, simple plain text with punctuation and say 20.000 characters. How would you slice it in smaller chunks in a way that each created element to be max let's say 1000 characters but each to be ended with a full sentence?

Exploding the text from punctuation (like dot-space or questionmark-space) and then recreating arrays with string length <=1000 ?

Any tips? Would be appreciated

eelixduppy

6:06 pm on Jan 21, 2009 (gmt 0)

Interesting question. Without iterating through each character one at a time to find this (which is very possible and would be the only solution that will work perfectly, I think) here is a regex solution that is similar:


$pattern = "/[^ \r\n][^.!?]{0,999}[.?!]/";
$matches = array();
#
preg_match_all($pattern, $string, $matches);
echo '<pre>'; print_r($matches); echo '</pre>';

This should grab all sentences UNDER 1000 characters and then you can do whatever you want with them. I'm quite busy right now, but I'll play around with this later. :)

methode

6:24 pm on Jan 21, 2009 (gmt 0)

Hm, didn't think about regexp. Interesting approach, really; much shorter than trying to explode it from different type of punctuation matches for example.
However this will fill the $matches array with all the sentences, and the harder part of the script (i believe), reconstructing an array which elements are no more than $x character long is still missing. Not complaining, just saying :)

A note: I'm not asking this because I desperately want to earn money with this script. Just a friendly conversation because this is an interesting side of text manipulation i believe and seemingly no one made such a script yet :)... Alternatively I picked the wrong keywords for google, wouldn't be the first case.

eelixduppy

2:08 am on Jan 22, 2009 (gmt 0)

Threw something together real quick:


[pre]
function ssplit($s) {
  $end = array('.', '!', '?');
  $max = 1000; #max characters
  $sentences = array();
  #
  $last=0; $beg=0; $i=0; $count=0;
  $s = str_replace(array("\n","\r","\t"), ' ', $s);
  $length = strlen($s);
  for($i = 0; $i < $length; $i++) {
    $c = $s[$i];
    if(in_array($c, $end)) {
      $last = $i;
    }
    if($count >= $max �� $i == $length-1) {
      $sentences[] = trim(substr($s, $beg, $last-$beg+1));
      $beg = $last+1;
      $count = 0;
    }
    $count++;
  }
 return $sentences;
}
echo '<pre>'; print_r(ssplit($string)); echo '</pre>';
[/pre]

See if that helps. :)

eelixduppy

6:05 am on Jan 22, 2009 (gmt 0)

Here, this gets rid of one iteration through all the text finding the length of it to speed things up. I also removed the unneeded $count variable. I like my code efficient ;)


[pre]
function ssplit($s) {
  $end = array('.', '!', '?');
  $max = 1000; #max characters
  $sentences = array();
  $last=0; $beg=0; $i=0;
  $s = str_replace(array("\n","\r","\t"), ' ', $s);
  while(isset($s[$i])) {
    if(in_array($s[$i++], $end)) {
      $last = $i;
    }
    if(($i-$beg) >= $max �� !isset($s[$i])) {
      $sentences[] = trim(substr($s, $beg, $last-$beg+1));
      $beg = $last+1;
    }
   }
 return $sentences;
}
[/pre]

[edited by: eelixduppy at 1:36 am (utc) on Jan. 23, 2009]

methode

7:13 pm on Jan 22, 2009 (gmt 0)

OK, nifty function. Now wondering why the script I got is more than 150 lines long :s

Thanks. Anybody else with other approach? I'd be really interested how others would solve this.

whoisgregg

1:02 am on Jan 28, 2009 (gmt 0)

I actually think you were much better off using the regex in the first place eelixduppy. It's faster to loop through a sentence array just once using strlen to do the calculations:

function ssplit_reg($s, $max = 1000){
 � $pattern = "/[^ \r\n][^.!?]{0,999}[.?!]/"; 
 � preg_match_all($pattern, $s, $matches); 
 � $sentences = $matches[0];
 � $return = array();
 � $block = 0;
 � foreach($sentences as $sent){
 � � if(!isset($return[$block])){
 � � � $return[$block] = $sent.' ';
 � � } else if((strlen($return[$block].$sent)) <= $max){
 � � � $return[$block] .= $sent.' ';
 � � } else {
 � � � $return[$block] = trim($return[$block]);
 � � � $block++;
 � � � $return[$block] = $sent.' ';
 � � }
 � }
 � return $return;
}

I compared the two functions on a 13,000 word text block which shuffled the sentence order on each iteration to ensure each time the function was called it had to new data to process and got these results:

ssplit function completed 50 iterations in ~9.47 seconds
ssplit_reg function completed 50 iterations in ~1.68 seconds

eelixduppy

2:27 am on Jan 28, 2009 (gmt 0)

Thanks for that. I wonder what the time would have been if you removed this line reducing the iteration through the string to just one:


$s = str_replace(array("\n","\r","\t"), ' ', $s);

This was only for formatting issues. Try removing that and testing again. I'm curious to see that the results are.

BTW, thanks for writing that function with regex. Pretty neat. :)

whoisgregg

1:15 am on Jan 29, 2009 (gmt 0)

Don't thank me, I used your regex code. ;)

I named your first function eelixduppy and the duplicate with just that one line removed was named eelixduppy2, and ran the test:

eelixduppy script completed 50 iterations in 9.526716 seconds
eelixduppy2 script completed 50 iterations in 9.472521 seconds
eelixduppy2 script completed in 0.5721% less time