homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

manipulating text files

 10:09 am on Dec 11, 2008 (gmt 0)

I need to determine some statistics about a text file.
I presume a good way to do this is use a long list of regexes in a script which will extract the following information -
- How many times the word 'and' appear in my text.
- character count including punctuation marks and excluding
- word count (all characters in the text file),
- the amount of lines,
- paragraphs
- and sentences inside (where a full-stop is applied)

I need to understand this before i start manipulating HTML files

Here is what i have come up with so far

use strict;
my $chars = 0;
my $words = 0;
my $line = 0;
my $lines = 0;
open(MYINPUTFILE,"database.txt") die("Could not open file!");
while (my $line = <MYINPUTFILE>) {
chomp $line;
$line =~ s/[;:,.!?-]/ /gis;
foreach $w (split(/ /, $line)) {
if ($w eq 'the') {
print "$.\n";
print "\n'the' occurs " . $and . " times\n";
$line = $words++ {
$chars += length;
print "Found $lines lines, $words words and $chars characters.\n";

I would like people to reply with any further ideas?
Then i can plan on manipulating text in a HTML file..

[edited by: phranque at 10:36 am (utc) on Dec. 11, 2008]
[edit reason] disabled smileys ;) [/edit]



 10:55 am on Dec 11, 2008 (gmt 0)

you might want to look at some of the Lingua::EN modules on CPAN such as:


 11:09 am on Dec 11, 2008 (gmt 0)

im fairly new to Perl
So how can i get this information. Any links?

Does it explain the code?


 11:27 am on Dec 12, 2008 (gmt 0)

CPAN is your friend!

go to search.cpan.org and enter the names of the modules in the search box.
there is documentation and sample code for each included.

you should always check cpan.org to see if there is an existing tool you can use or extend or at least get some ideas before developing your own.

when you get to manipulating html, you can use one of the modules that parses html such as HTML::TagParser

having said all that, i don't want to discourage you from doing something simply as a learning experience and we can continue with your posted code.
it's just that many things have already been done in several flavors of perl, so you can do your learning with more power.


 11:48 am on Dec 12, 2008 (gmt 0)

Those modules are very helpful, but my problem is trying to merge various ideas together.
There are modules which workout how to determine the quantity of characters (including punction marks, excluding whitespace), sentences, words, lines on the statistics of a file.
And also on how to extract the amount of times the word 'and' appears and how to replace it.

But my problem stems from how i can merge this detail together in one script and use the information.

Maybe i need to look more into the structure of data in perl, or more on regular expressions. I dont know?

Can you help me? Maybe an example.
As i dont want to cheat my way to glory!


 6:38 am on Dec 20, 2008 (gmt 0)

You probably want to look at using HTML aware modules instead of trying to invent a set of tools to try and manipulate HTML documents.


 12:14 pm on Dec 20, 2008 (gmt 0)

welcome to WebmasterWorld [webmasterworld.com], krugs!


 4:41 am on Jan 15, 2009 (gmt 0)

haha I have the same problem. To count the characters in a file I have...

while ($line = <IN>)
print "Characters: $count\n";

This works, but I trying to do it by using regular expressions or pattern matching whatever the term is. I try and find a way soon I hope

[edited by: phranque at 12:22 pm (utc) on Jan. 15, 2009]
[edit reason] disabled graphic smileys ;) [/edit]


 5:32 am on Jan 15, 2009 (gmt 0)

using the length() function is the best way if you intend to count everything. You would only want to use a regular expression if you were counting some type of patterns. You should also start a new thread next time. Posting a new question in an existing thread is considered poor forum etiquette.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved