homepage Welcome to WebmasterWorld Guest from 54.224.202.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Splitting a txt file into smaller text files.
brlinga




msg:3844804
 1:52 am on Feb 8, 2009 (gmt 0)

I am trying to Split text files of the below format into 17 new text files using PERL.
Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.

Below is the sample of the file.I have added *CUT* indicating new file
*******************************************************************
[EDIT: deleted several hundred lines of data dump including specifics]
**********************************************************************

I could code for first few fields since they appear in exact same line every time...For the rest, I dont get how to do pattern matching and writing into files simultaneously..Would greatly appreciate some help...

while(($intext=<FILE>))
{
$count++;
#print "$count\n";
#print $intext;

if ($count==17)
{
open(country, ">>Country.txt");
print country "$Text\t\t$intext\n";
close(country);
}
if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
}
if ($count==21)
{
open(patentee, ">>Patentee.txt");
print patentee "$Text\t\t$intext\n";
close(patentee);
}
if ($count==23)
{
open(date, ">>Date.txt");
print date "$Text\t\t$intext\n";
close(date);
}
if($intext =~ /0 patents/)
{
print "No Patent Found \n You may want to delete the $Text.txt file that has been Generated\n";
}
}
close(FILE);

[edited by: phranque at 9:33 am (utc) on Feb. 8, 2009]
[edit reason] massive data dump [/edit]

 

krugs




msg:3844844
 6:00 am on Feb 8, 2009 (gmt 0)

Is this school work?

brlinga




msg:3844845
 6:19 am on Feb 8, 2009 (gmt 0)

Nope...this is for a Prof for whom I am working...
Did you get what I was trying to convey...or shud i repost it..
Would greatly appreciate if you got inputs..

krugs




msg:3844850
 6:33 am on Feb 8, 2009 (gmt 0)

post another example of the file without the *CUT* stuff in it.

brlinga




msg:3844852
 6:48 am on Feb 8, 2009 (gmt 0)

Heres another example; I would be processing about 1000 files like these a day...and need to convert this huge txt to 17 smaller txt based on their headings(as explained above):

[US Patent & Trademark Office, Patent Full Text and Image Database]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[Bottom]

[View Shopping Cart] [Add to Shopping Cart]

[Image]

( 1 of 1 )

--------------------------------------------------

United States Patent

[EDIT: massive data dump with specifics]

The present disclosure includes that contained in
the appended claims as well as that of the
foregoing description. Although this invention has
been described in its preferred form with a certain
degree of particularity, it is understood that the
present disclosure of the preferred form has been
made only by way of example and that numerous
changes in the details of construction and the
combination and arrangement of parts may be
resorted to without departing from the spirit and
scope of the invention.

* * * * *

--------------------------------------------------

[Image]

[View Shopping Cart] [Add to Shopping Cart]

[Top]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[edited by: phranque at 9:38 am (utc) on Feb. 8, 2009]
[edit reason] removed specifics [/edit]

callivert




msg:3844854
 7:13 am on Feb 8, 2009 (gmt 0)

so, basically, you're trying to scrape the US patents database, then reformat it so that all identifying information (such as the owner of the patent) is stripped away from the description.
Is that right?

brlinga




msg:3844859
 7:26 am on Feb 8, 2009 (gmt 0)

Nope..I am trying to separate each field tagging each of them with its patent no. (NOTE: if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
} )

$Text is its patent no.

I am not trying to strip it off its identifying information in anyways.Its just for some research to get some statistics.

krugs




msg:3844867
 8:04 am on Feb 8, 2009 (gmt 0)

Is $Text already defined before you start processing the file?

krugs




msg:3844869
 8:09 am on Feb 8, 2009 (gmt 0)

in both of the files there is a line of dashes and then a blank line and then a description of some kind:

--------------------------------------------------

Sweatband

Is the line (Sweatband) always on the same line and is it always just one line?

krugs




msg:3844870
 8:22 am on Feb 8, 2009 (gmt 0)

What about these areas/sections of the file, do they get printed to a file?

Assignee: (this heading is in one file but not the other)
Foreign Patent Documents
Primary Examiner:
Assistant Examiner:
Attorney, Agent or Firm:

You did not list them in the headings:


Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.

brlinga




msg:3844882
 9:02 am on Feb 8, 2009 (gmt 0)

Yes $Text is there from before...

And the Line with Sweatband is always exactly there, but it can be two lines sometimes...

And like you have mentioned Foreign Patent document,Primary Examiner, Assistant Examiner, Attorney Agent or firm are to be printed to separate files each..

I greatly appreciate your time and interest..Thanks

krugs




msg:3844905
 10:04 am on Feb 8, 2009 (gmt 0)

THis is not well tested and will more than likely need additional work, but it seems very close:



use strict;
use warnings;

my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

my @headings = (
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

my $Text = '123,456,789';
my $isopen;

open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";
OUTTERLOOP:
while (chomp(my $intext = <FILE>)){
next OUTTERLOOP if ($intext =~ /^[ -]*$/);
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
next OUTTERLOOP;
}
INNERLOOP:
while (chomp(my $intext = <FILE>)){
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t$intext\n";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

sub static_output {
my ($heading, $intext) = @_;
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}

change the paths to files before trying. Make sure to try on some test files and note any problems. I will check back after getting some sleep.

***** You need to change the pipes in the code. For some odd reason this forum changes them to double-pipes. This forum also does not format code well making it hard to read.

[edited by: phranque at 10:23 am (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

krugs




msg:3845074
 5:45 pm on Feb 8, 2009 (gmt 0)

Any feedback on the code brlinga?

krugs




msg:3845095
 6:47 pm on Feb 8, 2009 (gmt 0)

Well, I had a chance to try the code and spotted a problem or two so I edited the code. Here is the new code.



use strict;
use warnings;

# The fixed headings
my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

# The non-fixed headings
my @headings = (
'Assignee',
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

# Just for testing the script
my $Text = '123,456,789';

# A binary flag to determine if a file is opened or closed
my $isopen;

# Open the input file
open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";

####################################################
# OUTTERLOOP gets the sections of the file
# (%headings) that are always on the same line.
####################################################

OUTTERLOOP:
while (my $intext = <FILE>){
chomp $intext;
next OUTTERLOOP if ($intext =~ /^[ -]*$/);# skip blank lines and lines with only dashes
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
}
next OUTTERLOOP if ($. < 28);
################################################3
# INNERLOOP gets the sections (@headings)
# that might occur on different lines of
# the file and maybe of varying numbers of lines.
################################################
INNERLOOP:
while (my $intext = <FILE>){
chomp $intext;
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
# Uncomment next line for debugging
#print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

#######################################
# sub static_output prints the fixed
# sections to a file
#######################################
sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}


[edited by: phranque at 7:32 pm (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

brlinga




msg:3845275
 1:47 am on Feb 9, 2009 (gmt 0)

Amazing man...its working..
I greatly appreciate your time and interest..
thanks again...

brlinga




msg:3845339
 4:12 am on Feb 9, 2009 (gmt 0)

Hey Krugs...theres some problem..
The code works well for some patents and doesent work for some..Could there be some problem with the logic..

This is the error thats popping up:
readline() on closed filehandle FILE at C:\Perl\bin\upgrade.pl line 63.

[edited by: phranque at 5:58 am (utc) on Feb. 9, 2009]
[edit reason] specifics [/edit]

krugs




msg:3845711
 5:27 pm on Feb 9, 2009 (gmt 0)

I'll take a look at the code a bit a later today and see if I can determine the problem. Right now I am at work.

brlinga




msg:3845893
 8:59 pm on Feb 9, 2009 (gmt 0)

thanks buddy am counting on you

krugs




msg:3846013
 11:06 pm on Feb 9, 2009 (gmt 0)

the readline() problem is here:


sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
close (FILE);
return 0;
}

this line:

close (FILE);

needs to be changed to:

close ($OUT);

brlinga




msg:3846033
 11:44 pm on Feb 9, 2009 (gmt 0)

Thanks man...
And is there a way we can get all the files in CSV format(Excel Readable)...so that there are no lines like this "------------------".
And ca we have the data in 2 columns..this way..

Assignee.txt --> Patent No. Assignee
Country.txt --> Patent No. Country
Description.txt-->Patent No. the description...

And did you check out the full code?u likd it?
And apologies fr being so lame...am just a beginner..

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved