Scraping headlines from a text file

Forum Moderators: coopster

Message Too Old, No Replies

Scraping headlines from a text file

encyclo

12:31 am on Sep 5, 2004 (gmt 0)

I've got a programming logic problem (as is often the case with me!) about exactly how to go about the following:

I've got a set of headlines and accompanying stories stored in a text file. The file looks like this:

++ Headline 1 
 
A few lines of text 
 
++ Headline 2 
 
A few lines of text 
 
++ Headline 3 
 
A few lines of text 
 
++ Headline 4 
 
A few lines of text 
(etc. etc.)

What I need to do is end up with an array of the first three headlines from the page, without the plus signs before, so I can display them on another page of the site.

I know I can use fopen to open the file, substr($value,0,3); to remove the plus signs, but I'm not sure of the best procedure for putting it all together.

How would you go about ths process in the simplest way? (Pseudo code would be just fine - I can look up the functions, I just need to get the logic!)

Many thanks!

jatar_k

12:17 pm on Sep 5, 2004 (gmt 0)

open the file
as long as I don't have 3 headlines get another line

if line has ++ 
 strip the ++ 
 store it in my array 
else get next line

repeat until there are 3 headlines

Does that make any sense?

encyclo

8:57 pm on Sep 5, 2004 (gmt 0)

Does that make any sense?

Perfect sense - thanks, jatar_k, that's just what I was looking for.

Now I've just got to code the thing!

mincklerstraat

4:54 pm on Sep 6, 2004 (gmt 0)

You could also use regexes:
$file = filegetcontents(textfile.txt);
$matches = preg_match('#^++([.]*)$', $file);
$i = 0;
foreach($matches as $v){
echo $v[1].'<br />';
$i++;
if($i > 3) break;
}

- no promises the regex will work, you probably need to escape them plus signs, that'd give you:
$matches = preg_match('#^\+\+([.]*)$', $file); instead.

vincevincevince

5:16 pm on Sep 6, 2004 (gmt 0)

use explode() by "++"

then loop through each member of the new array, splitting at the first \n to give [1] the headline and [2] the content, you could easily use preg_match() and pattern "/(.*?)\n(.*)/" to do that split

encyclo

6:02 pm on Sep 6, 2004 (gmt 0)

OK, trying some of the ideas here, I've got this very basic setup:

<?php 
$file = file_get_contents($_SERVER["DOCUMENT_ROOT"].'/data/file.txt'); 
$headlines = explode("\n++ ", $file); 
$result = preg_match("/(.*?)\n(.*)/",$headlines); 
echo $result[0]; 
echo "<br>"; 
echo $result[1]; 
echo "<br>"; 
echo $result[2]; 
?>

It seems to block with the

preg_match

, which I seem to have got wrong. If I just echo

$headlines

, I get the array printed OK, but with all the text on the intermediate lines (which is what the

preg_match

is supposed to stop)... With the

preg_match

, I get nothing.

One other thing, I have some lines in the text file which start with three plus signs, followed by a space (+++ ): is it possible to select only those lines which start with 2 plus signs followed by a space?

vincevincevince

6:37 pm on Sep 6, 2004 (gmt 0)


<?php 
$file = file_get_contents($_SERVER["DOCUMENT_ROOT"].'/data/file.txt'); 
preg_match("/\+\+(.*?)\n.*?\+\+(.*?)\n.*?\+\+(.*?)\n/ms",$file,$headlines);
echo $headlines[1]; 
echo "<br>"; 
echo $headlines[2]; 
echo "<br>"; 
echo $headlines[3]; 
?>

mincklerstraat

8:05 pm on Sep 6, 2004 (gmt 0)

erhm ... see that above I rather sloppily missed some stuff, like closing the regular expression with #, and yes indeed, it's file_get_contents() ... uh, sorry!

encyclo

8:26 pm on Sep 6, 2004 (gmt 0)

Thanks vincevincevince - that works just perfectly, and I've adapted it successfully into my page.

I altered the regex to look for a carriage return followed by two plus signs and a space:

preg_match("/\n\+\+ (.*?)\n.*?\n\+\+ (.*?)\n.*?\n\+\+ (.*?)\n/ms",$file,$headlines);

That took care of the lines which start with three plus signs which matched the previous pattern.