Welcome to WebmasterWorld Guest from 54.197.94.141

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

manipulating text files

   
10:09 am on Dec 11, 2008 (gmt 0)

5+ Year Member



I need to determine some statistics about a text file.
I presume a good way to do this is use a long list of regexes in a script which will extract the following information -
- How many times the word 'and' appear in my text.
- character count including punctuation marks and excluding
whitespace,
- word count (all characters in the text file),
- the amount of lines,
- paragraphs
- and sentences inside (where a full-stop is applied)

I need to understand this before i start manipulating HTML files

Here is what i have come up with so far

use strict;
my $chars = 0;
my $words = 0;
my $line = 0;
my $lines = 0;
open(MYINPUTFILE,"database.txt") die("Could not open file!");
while (my $line = <MYINPUTFILE>) {
chomp $line;
$line =~ s/[;:,.!?-]/ /gis;
foreach $w (split(/ /, $line)) {
if ($w eq 'the') {
print "$.\n";
$and++;
}
}
}
print "\n'the' occurs " . $and . " times\n";
$line = $words++ {
$chars += length;
}
close(INFILE);
print "Found $lines lines, $words words and $chars characters.\n";

I would like people to reply with any further ideas?
Then i can plan on manipulating text in a HTML file..

[edited by: phranque at 10:36 am (utc) on Dec. 11, 2008]
[edit reason] disabled smileys ;) [/edit]

10:55 am on Dec 11, 2008 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



you might want to look at some of the Lingua::EN modules on CPAN such as:
Lingua::EN::Fathom
Lingua::EN::Sentence
Lingua::EN::Splitter
11:09 am on Dec 11, 2008 (gmt 0)

5+ Year Member



im fairly new to Perl
So how can i get this information. Any links?

Does it explain the code?

11:27 am on Dec 12, 2008 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



CPAN is your friend!

go to search.cpan.org and enter the names of the modules in the search box.
there is documentation and sample code for each included.

you should always check cpan.org to see if there is an existing tool you can use or extend or at least get some ideas before developing your own.

when you get to manipulating html, you can use one of the modules that parses html such as HTML::TagParser

having said all that, i don't want to discourage you from doing something simply as a learning experience and we can continue with your posted code.
it's just that many things have already been done in several flavors of perl, so you can do your learning with more power.
=8)

11:48 am on Dec 12, 2008 (gmt 0)

5+ Year Member



Those modules are very helpful, but my problem is trying to merge various ideas together.
There are modules which workout how to determine the quantity of characters (including punction marks, excluding whitespace), sentences, words, lines on the statistics of a file.
And also on how to extract the amount of times the word 'and' appears and how to replace it.

But my problem stems from how i can merge this detail together in one script and use the information.

Maybe i need to look more into the structure of data in perl, or more on regular expressions. I dont know?

Can you help me? Maybe an example.
As i dont want to cheat my way to glory!

6:38 am on Dec 20, 2008 (gmt 0)

5+ Year Member



You probably want to look at using HTML aware modules instead of trying to invent a set of tools to try and manipulate HTML documents.
12:14 pm on Dec 20, 2008 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], krugs!
4:41 am on Jan 15, 2009 (gmt 0)

5+ Year Member



haha I have the same problem. To count the characters in a file I have...

while ($line = <IN>)
{
$count+=length($line);
}
print "Characters: $count\n";

This works, but I trying to do it by using regular expressions or pattern matching whatever the term is. I try and find a way soon I hope

[edited by: phranque at 12:22 pm (utc) on Jan. 15, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

5:32 am on Jan 15, 2009 (gmt 0)

5+ Year Member



using the length() function is the best way if you intend to count everything. You would only want to use a regular expression if you were counting some type of patterns. You should also start a new thread next time. Posting a new question in an existing thread is considered poor forum etiquette.