homepage Welcome to WebmasterWorld Guest from 54.196.201.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
parsing a FASTA file with Perl
help needed
RudyS




msg:3952466
 7:17 pm on Jul 14, 2009 (gmt 0)

hi guys

i want to parse a collection of GenBank records of DNA sequence ... i want to end up with the ID (NM_\d+), and the capital letters in the record ... plus I want a count of each group of capital letters and a count of the lower case letters ... from records like the following:

> NM_001614 range=chr17:79476999-79479827 strand=-
TTCTTCCGCCGCTCCGCCGTCGCGTTTCTCTGCCGgtgagcgccccgATGGAAGAAGAGATCGCCGCGCTGGTCATTGACAATGGCG
gtgagtggatggcgccgcggggctccgtcgtcccctgcagGGCGTGACTCAG
(MOD'S NOTE: the above 2 lines should actually be one string without the newline)
> NM_005052 range=chr17:79989532-79992079 strand=+
TGTCTCCGGCCGATCGCTCGGCGCTCGGGTCCCGGgtgagtgcgtggcttgccctgtccgcagCGCCGTGGGGAAGACATGCTTGCTGATCA
GCTACACGACCAACGCCTTCCCCGGAGAGTACATCCCCACCGTgtgagtgtcccaagacacaggccagcagggctgggggtttctgggacccct
cccaagccctgaccctgccctcactgctctgcagTTTTGACAACTACTCTGCCAACGTGATGGTGGACGGGAAACCAGTCAACTTGGGGCTGTG
(MOD'S NOTE: the above 3 lines should actually be one string without the newlines)

the output for the first record would look something like:
NM_001614
TTCTTCCGCCGCTCCGCCGTCGCGTTTCTCTGCCG
35 base exon
12 base intron
ATGGAAGAAGAGATCGCCGCGCTGGTCATTGACAATGGCG
40 base exon
40 base intron
GGCGTGACTCAG
12 base exon

im hoping you guys find this interesting enough to help with?

thanks
with fond memories
RudyS

[edited by: phranque at 9:01 pm (utc) on July 17, 2009]
[edit reason] fixed sidescroll [/edit]

 

chorny




msg:3955374
 7:07 pm on Jul 19, 2009 (gmt 0)

Example of processing this big string:


#!/usr/bin/perl

use 5.006;
use strict;
use warnings;

my $str='AAAaaAAaBB';
while ($str=~/([A-Z]+)([a-z]*)/g) {
my $exon=$1;
my $intron=$2;
print "$exon\n";
print length($exon)." base exon\n";
print length($intron)." base intron\n";
}

janharders




msg:3955395
 7:39 pm on Jul 19, 2009 (gmt 0)

have you looked at the modules avaibale at cpan.org [search.cpan.org]?

RudyS




msg:3955405
 8:18 pm on Jul 19, 2009 (gmt 0)

thanks chorny ... my feeble efforts yielded:

#!/usr/bin/perl -w

$line = 'CGTAAcgtggttGGTTaaacctGTGTTCCCT';
$linelength = length($line);
$i = 0;
print "the genome sequence is \n$line\n";
print "there are $linelength bases\n\n";

#find exons ... UPPERCASE A,C,G or T

while($line =~ /([ACGT]+)/g)
{
$i += 1;

#print position of end of match and print match

print "exon $i ends at base " . pos($line) . ".\n";
print "$1\n";
$exonlength = length($1);
print "exon is $exonlength bases\n";
}
print "there were $i exons\n\n";

#now do same for introns ... lowercase a,c,g or t

$i = 0;
while($line =~ /([acgt]+)/g)
{
$i += 1;

#print position of end of match

print "intron $i ends at base " . pos($line) . ".\n";
print "$1\n";
$intronlength = length($1);
print "intron is $intronlength bases\n";
}
print "there were $i introns\n";

the assignment of $1 and $2 sequentially in the same loop is what i needed ... much appreciated ... will let you know how it works with my real records ...

janharders: i will look for cpan modules ... thanks

RudyS




msg:3957439
 5:34 pm on Jul 22, 2009 (gmt 0)

chorny
i am getting pretty much what i need to make progress on designing pcr assays ... not so elegant, but it does the job ... thanks again
rudy

#!/usr/bin/perl

use 5.006;
use warnings;

# multirecords.txt contains UCSC records from the Genome Browser Tables ...
# [genome.ucsc.edu...] (for human genome)
# each record is two lines ... the Fasta descriptor line with the refseq ID
# should have "NM_(digits)" for curated transcripts ... otherwise edit this script
# for example, change /NM_\d+/ to /N\w_\d+/ to include provisional and other transcripts
# The second line contains the exon and intron sequences
# exons are in UPPERCASE ... introns are in lowercase

open(INLINE, '<multirecords.txt');
open(CHORNY, '>chornyparsed.txt');
open(CHORNY2, '>chornyparsed2.txt');

# chornyparsed3 will contain only the minimal information on transcript structure
# for purposes of selecting optimal transcripts for designing assays

open(CHORNY3, '>chornyparsed3.txt');

while ($line = <INLINE>)
{
if($line =~ /NM_\d+/)
{
$ID = $&;
print CHORNY "\n\n>$ID\n";
print CHORNY2 "\n\n>$ID\n";
print CHORNY3 "\n\n>$ID\n";
print CHORNY3 "EXON\tINTRON\n";
}
while($line=~/([ACGT]+)([acgt]*)/g)
{
$exon=$1;
$intron=$2;
print CHORNY "$exon\n";
print CHORNY length($exon)." base exon\n";
print CHORNY length($intron)." base intron\n";
print CHORNY2 "$exon\t\t\t";
print CHORNY2 length($intron);
print CHORNY3 length($exon)."\t";
print CHORNY3 length($intron)."\n";
}
}

[edited by: phranque at 6:09 am (utc) on July 23, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved