Forum Moderators: coopster

Message Too Old, No Replies

Scraping headlines from a text file

         

encyclo

12:31 am on Sep 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've got a programming logic problem (as is often the case with me!) about exactly how to go about the following:

I've got a set of headlines and accompanying stories stored in a text file. The file looks like this:

++ Headline 1 

A few lines of text

++ Headline 2

A few lines of text

++ Headline 3

A few lines of text

++ Headline 4

A few lines of text
(etc. etc.)

What I need to do is end up with an array of the first three headlines from the page, without the plus signs before, so I can display them on another page of the site.

I know I can use fopen to open the file, substr($value,0,3); to remove the plus signs, but I'm not sure of the best procedure for putting it all together.

How would you go about ths process in the simplest way? (Pseudo code would be just fine - I can look up the functions, I just need to get the logic!)

Many thanks!

jatar_k

12:17 pm on Sep 5, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



open the file
as long as I don't have 3 headlines get another line

if line has ++ 
strip the ++
store it in my array
else get next line

repeat until there are 3 headlines

Does that make any sense?

encyclo

8:57 pm on Sep 5, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does that make any sense?

Perfect sense - thanks, jatar_k, that's just what I was looking for.

Now I've just got to code the thing!

mincklerstraat

4:54 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


You could also use regexes:
$file = filegetcontents(textfile.txt);
$matches = preg_match('#^++([.]*)$', $file);
$i = 0;
foreach($matches as $v){
echo $v[1].'<br />';
$i++;
if($i > 3) break;
}

- no promises the regex will work, you probably need to escape them plus signs, that'd give you:
$matches = preg_match('#^\+\+([.]*)$', $file); instead.

vincevincevince

5:16 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



use explode() by "++"

then loop through each member of the new array, splitting at the first \n to give [1] the headline and [2] the content, you could easily use preg_match() and pattern "/(.*?)\n(.*)/" to do that split

encyclo

6:02 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, trying some of the ideas here, I've got this very basic setup:

<?php 
$file = file_get_contents($_SERVER["DOCUMENT_ROOT"].'/data/file.txt');
$headlines = explode("\n++ ", $file);
$result = preg_match("/(.*?)\n(.*)/",$headlines);
echo $result[0];
echo "<br>";
echo $result[1];
echo "<br>";
echo $result[2];
?>

It seems to block with the

preg_match
, which I seem to have got wrong. If I just echo
$headlines
, I get the array printed OK, but with all the text on the intermediate lines (which is what the
preg_match
is supposed to stop)... With the
preg_match
, I get nothing.

One other thing, I have some lines in the text file which start with three plus signs, followed by a space (+++ ): is it possible to select only those lines which start with 2 plus signs followed by a space?

vincevincevince

6:37 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




<?php
$file = file_get_contents($_SERVER["DOCUMENT_ROOT"].'/data/file.txt');
preg_match("/\+\+(.*?)\n.*?\+\+(.*?)\n.*?\+\+(.*?)\n/ms",$file,$headlines);
echo $headlines[1];
echo "<br>";
echo $headlines[2];
echo "<br>";
echo $headlines[3];
?>

mincklerstraat

8:05 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



erhm ... see that above I rather sloppily missed some stuff, like closing the regular expression with #, and yes indeed, it's file_get_contents() ... uh, sorry!

encyclo

8:26 pm on Sep 6, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks vincevincevince - that works just perfectly, and I've adapted it successfully into my page.

I altered the regex to look for a carriage return followed by two plus signs and a space:

preg_match("/\n\+\+ (.*?)\n.*?\n\+\+ (.*?)\n.*?\n\+\+ (.*?)\n/ms",$file,$headlines);

That took care of the lines which start with three plus signs which matched the previous pattern.