homepage Welcome to WebmasterWorld Guest from 54.205.144.54
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Counting instances of word in text files.
Davo1977




msg:3804188
 11:55 am on Dec 10, 2008 (gmt 0)

I am new to Perl and learning from the basics. I'm trying to devise some code that counts the number of instances of 'the' in a file.

#third.pl FILE
open(MYINPUTFILE,"database.txt") or die "database.txt not found!\n";
while (<MYINPUTFILE >) {
chop;
tr/;:,.!?-//d;
foreach $w (split) {
if ($w eq 'the') {
print "$.\n";
$score++;
}
}
}
print "\n'the' occurs $score++ times\n";

Assume database.txt is a file that is an essay with 35 instances of 'the' in the text.

I am displaying this on my the command display and it reads -
'the' occurs times. ?

Why doesn't the display show the amount of instances?

Thanks for helping mr.

[edited by: phranque at 10:56 am (utc) on Dec. 11, 2008]
[edit reason] disabled smileys ;) [/edit]

 

janharders




msg:3804406
 5:00 pm on Dec 10, 2008 (gmt 0)

A few things, if you don't mind:

open(MYINPUTFILE,"database.txt") or die "database.txt not found!\n";

It's better to use the three-argument-style to open files, eg
open(FH, '<', 'file.txt')
It's clearer and you won't have to bother with possible problems with filenames from variables, use ('<' opens for reading, just as the default would if you omit it)
Also: it's good you check for errors, but why don't you tell what the error really was? If the file cannot be opened, the errorcode will be in $!, so something like this is better
open(MYINPUTFILE,'<', "database.txt") or die "database.txt could not be opened: " . $! . "!\n";

while (<MYINPUTFILE >) {

here's the major problem (and it's just a typo): the space before the closing >.
I'd recommend not to use $_ too much, but rather to read the line into a non-special variable, e.g.
while (my $line = <MYINPUTFILE>) {

chop;

chop will cut the last character of the string, regardless what that is, while chomp will only do this if it's a linebreak. use chomp, it'll safe you a lot of trouble debugging why your lines are mutilated when you read them from a file and didn't remember you alread chop'ed ;)
chomp $line;
will do that (or just "chomp;" if you stay with using $_)

tr/;:,.!?-//d;

I'd personally, allthough it's probably slower, use a regexp here:
$line =~ s/[;:,.!?-]/ /gis;
but yours will do the job as well, I think.

foreach $w (split) {

you should tell split where to split, e.g.
foreach $w (split(/ /, $line)) {

print "\n'the' occurs $score++ times\n";
from my experience, I'd advise against using variables within quotes. while perl can and will handle it in simple cases, it won't be able to do so in others - if you get used to concating strings and variables, you'll never get in trouble:
print "\n'the' occurs " . $score++ . " times\n";

As a whole, i'd do this like
my $score = 0;
open(MYINPUTFILE,'<', "database.txt") or die "database.txt could not be opened: " . $! . "!\n";
while (my $line = <MYINPUTFILE>) {
chomp $line;
$line =~ s/[;:,.!?-]/ /gis;
foreach $w (split(/ /, $line)) {
if ($w eq 'the') {
print "$.\n";
$score++;
}
}
}
print "\n'the' occurs " . $score++ . " times\n";

while a database.txt looking like this
hello this is the file to be read by the script. just to make it interesting, have a the. directly followed by a non-space-character., it prints:

1
1
1

'the' occurs 3 times

hope that helps. the cost of the help is the unrequested advise ;)

krugs




msg:3811769
 6:22 am on Dec 20, 2008 (gmt 0)

Not sure of Davo1977 cares anymore, but here is the easy way to do what you are trying to do:


open (my $IN , '<', 'file.txt') or die "$!";
my $text = do {local $/; <$IN>};#slurp file into a scalar
my $n = 0;
$n++ while ($text =~ m/\bthe\b/gi);
print $n;

phranque




msg:3811839
 12:12 pm on Dec 20, 2008 (gmt 0)

/\bthe\b/

that won't count a " the" at the end of a line.
you should define or use a more general class of whitespace.

krugs




msg:3811952
 7:28 pm on Dec 20, 2008 (gmt 0)

I suggest that next time you test the code.

Quoted from perlretut:

[perldoc.perl.org...]


An anchor useful in basic regexps is the word anchor \b . This matches a boundary between a word character and a non-word character \w\W or \W\w :
$x = "Housecat catenates house and cat";
$x =~ /cat/; # matches cat in 'housecat'
$x =~ /\bcat/; # matches cat in 'catenates'
$x =~ /cat\b/; # matches cat in 'housecat'
$x =~ /\bcat\b/; # matches 'cat' at end of string

Note in the last example, the end of the string is considered a word boundary.

\b correctly matches the end of a line, with or without a newline or whatever the OS considers an end of record character.

-krugs

phranque




msg:3812060
 1:20 am on Dec 21, 2008 (gmt 0)

humbly corrected - too tired to read regular expressions properly when i posted that.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved