Welcome to WebmasterWorld Guest from

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

parsing a FASTA file with Perl

help needed



7:17 pm on Jul 14, 2009 (gmt 0)

5+ Year Member

hi guys

i want to parse a collection of GenBank records of DNA sequence ... i want to end up with the ID (NM_\d+), and the capital letters in the record ... plus I want a count of each group of capital letters and a count of the lower case letters ... from records like the following:

> NM_001614 range=chr17:79476999-79479827 strand=-
(MOD'S NOTE: the above 2 lines should actually be one string without the newline)
> NM_005052 range=chr17:79989532-79992079 strand=+
(MOD'S NOTE: the above 3 lines should actually be one string without the newlines)

the output for the first record would look something like:
35 base exon
12 base intron
40 base exon
40 base intron
12 base exon

im hoping you guys find this interesting enough to help with?

with fond memories

[edited by: phranque at 9:01 pm (utc) on July 17, 2009]
[edit reason] fixed sidescroll [/edit]


7:07 pm on Jul 19, 2009 (gmt 0)

5+ Year Member

Example of processing this big string:


use 5.006;
use strict;
use warnings;

my $str='AAAaaAAaBB';
while ($str=~/([A-Z]+)([a-z]*)/g) {
my $exon=$1;
my $intron=$2;
print "$exon\n";
print length($exon)." base exon\n";
print length($intron)." base intron\n";


7:39 pm on Jul 19, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member

have you looked at the modules avaibale at cpan.org [search.cpan.org]?


8:18 pm on Jul 19, 2009 (gmt 0)

5+ Year Member

thanks chorny ... my feeble efforts yielded:

#!/usr/bin/perl -w

$line = 'CGTAAcgtggttGGTTaaacctGTGTTCCCT';
$linelength = length($line);
$i = 0;
print "the genome sequence is \n$line\n";
print "there are $linelength bases\n\n";

#find exons ... UPPERCASE A,C,G or T

while($line =~ /([ACGT]+)/g)
$i += 1;

#print position of end of match and print match

print "exon $i ends at base " . pos($line) . ".\n";
print "$1\n";
$exonlength = length($1);
print "exon is $exonlength bases\n";
print "there were $i exons\n\n";

#now do same for introns ... lowercase a,c,g or t

$i = 0;
while($line =~ /([acgt]+)/g)
$i += 1;

#print position of end of match

print "intron $i ends at base " . pos($line) . ".\n";
print "$1\n";
$intronlength = length($1);
print "intron is $intronlength bases\n";
print "there were $i introns\n";

the assignment of $1 and $2 sequentially in the same loop is what i needed ... much appreciated ... will let you know how it works with my real records ...

janharders: i will look for cpan modules ... thanks


5:34 pm on Jul 22, 2009 (gmt 0)

5+ Year Member

i am getting pretty much what i need to make progress on designing pcr assays ... not so elegant, but it does the job ... thanks again


use 5.006;
use warnings;

# multirecords.txt contains UCSC records from the Genome Browser Tables ...
# [genome.ucsc.edu...] (for human genome)
# each record is two lines ... the Fasta descriptor line with the refseq ID
# should have "NM_(digits)" for curated transcripts ... otherwise edit this script
# for example, change /NM_\d+/ to /N\w_\d+/ to include provisional and other transcripts
# The second line contains the exon and intron sequences
# exons are in UPPERCASE ... introns are in lowercase

open(INLINE, '<multirecords.txt');
open(CHORNY, '>chornyparsed.txt');
open(CHORNY2, '>chornyparsed2.txt');

# chornyparsed3 will contain only the minimal information on transcript structure
# for purposes of selecting optimal transcripts for designing assays

open(CHORNY3, '>chornyparsed3.txt');

while ($line = <INLINE>)
if($line =~ /NM_\d+/)
$ID = $&;
print CHORNY "\n\n>$ID\n";
print CHORNY2 "\n\n>$ID\n";
print CHORNY3 "\n\n>$ID\n";
print CHORNY "$exon\n";
print CHORNY length($exon)." base exon\n";
print CHORNY length($intron)." base intron\n";
print CHORNY2 "$exon\t\t\t";
print CHORNY2 length($intron);
print CHORNY3 length($exon)."\t";
print CHORNY3 length($intron)."\n";

[edited by: phranque at 6:09 am (utc) on July 23, 2009]
[edit reason] disabled graphic smileys ;) [/edit]


Featured Threads

Hot Threads This Week

Hot Threads This Month