Counting instances of word in text files.

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Counting instances of word in text files.

Davo1977

11:55 am on Dec 10, 2008 (gmt 0)

I am new to Perl and learning from the basics. I'm trying to devise some code that counts the number of instances of 'the' in a file.

#third.pl FILE
open(MYINPUTFILE,"database.txt") or die "database.txt not found!\n";
while (<MYINPUTFILE >) {
chop;
tr/;:,.!?-//d;
foreach $w (split) {
if ($w eq 'the') {
print "$.\n";
$score++;
}
}
}
print "\n'the' occurs $score++ times\n";

Assume database.txt is a file that is an essay with 35 instances of 'the' in the text.

I am displaying this on my the command display and it reads -
'the' occurs times. ?

Why doesn't the display show the amount of instances?

Thanks for helping mr.

[edited by: phranque at 10:56 am (utc) on Dec. 11, 2008]
[edit reason] disabled smileys ;) [/edit]

janharders

5:00 pm on Dec 10, 2008 (gmt 0)

A few things, if you don't mind:

open(MYINPUTFILE,"database.txt") or die "database.txt not found!\n";

It's better to use the three-argument-style to open files, eg

open(FH, '<', 'file.txt')

It's clearer and you won't have to bother with possible problems with filenames from variables, use ('<' opens for reading, just as the default would if you omit it)
Also: it's good you check for errors, but why don't you tell what the error really was? If the file cannot be opened, the errorcode will be in $!, so something like this is better

open(MYINPUTFILE,'<', "database.txt") or die "database.txt could not be opened: " . $! . "!\n";

while (<MYINPUTFILE >) {

here's the major problem (and it's just a typo): the space before the closing >.
I'd recommend not to use $_ too much, but rather to read the line into a non-special variable, e.g.

while (my $line = <MYINPUTFILE>) {

chop;

chop will cut the last character of the string, regardless what that is, while chomp will only do this if it's a linebreak. use chomp, it'll safe you a lot of trouble debugging why your lines are mutilated when you read them from a file and didn't remember you alread chop'ed ;)

chomp $line;

will do that (or just "chomp;" if you stay with using $_)

tr/;:,.!?-//d;

I'd personally, allthough it's probably slower, use a regexp here:

$line =~ s/[;:,.!?-]/ /gis;

but yours will do the job as well, I think.

foreach $w (split) {

you should tell split where to split, e.g.

foreach $w (split(/ /, $line)) {

print "\n'the' occurs $score++ times\n";

from my experience, I'd advise against using variables within quotes. while perl can and will handle it in simple cases, it won't be able to do so in others - if you get used to concating strings and variables, you'll never get in trouble:

print "\n'the' occurs " . $score++ . " times\n";

As a whole, i'd do this like

my $score = 0;
open(MYINPUTFILE,'<', "database.txt") or die "database.txt could not be opened: " . $! . "!\n";
while (my $line = <MYINPUTFILE>) {
chomp $line;
$line =~ s/[;:,.!?-]/ /gis;
foreach $w (split(/ /, $line)) {
if ($w eq 'the') {
print "$.\n";
$score++;
}
}
}
print "\n'the' occurs " . $score++ . " times\n";

while a database.txt looking like this

hello this is the file to be read by the script. just to make it interesting, have a the. directly followed by a non-space-character.

, it prints:

1
1
1
'the' occurs 3 times

hope that helps. the cost of the help is the unrequested advise ;)

krugs

6:22 am on Dec 20, 2008 (gmt 0)

Not sure of Davo1977 cares anymore, but here is the easy way to do what you are trying to do:


open (my $IN , '<', 'file.txt') or die "$!"; 
my $text = do {local $/; <$IN>};#slurp file into a scalar 
my $n = 0; 
$n++ while ($text =~ m/\bthe\b/gi); 
print $n;

phranque

12:12 pm on Dec 20, 2008 (gmt 0)

/\bthe\b/

that won't count a " the" at the end of a line.
you should define or use a more general class of whitespace.

krugs

7:28 pm on Dec 20, 2008 (gmt 0)

I suggest that next time you test the code.

Quoted from perlretut:

[perldoc.perl.org...]

An anchor useful in basic regexps is the word anchor \b . This matches a boundary between a word character and a non-word character \w\W or \W\w :
$x = "Housecat catenates house and cat";
$x =~ /cat/; # matches cat in 'housecat'
$x =~ /\bcat/; # matches cat in 'catenates'
$x =~ /cat\b/; # matches cat in 'housecat'
$x =~ /\bcat\b/; # matches 'cat' at end of string
Note in the last example, the end of the string is considered a word boundary.

\b correctly matches the end of a line, with or without a newline or whatever the OS considers an end of record character.

-krugs

phranque

1:20 am on Dec 21, 2008 (gmt 0)

humbly corrected - too tired to read regular expressions properly when i posted that.