Welcome to WebmasterWorld Guest from 54.242.63.214

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Splitting a txt file into smaller text files.

     

brlinga

1:52 am on Feb 8, 2009 (gmt 0)

5+ Year Member



I am trying to Split text files of the below format into 17 new text files using PERL.
Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.

Below is the sample of the file.I have added *CUT* indicating new file
*******************************************************************
[EDIT: deleted several hundred lines of data dump including specifics]
**********************************************************************

I could code for first few fields since they appear in exact same line every time...For the rest, I dont get how to do pattern matching and writing into files simultaneously..Would greatly appreciate some help...

while(($intext=<FILE>))
{
$count++;
#print "$count\n";
#print $intext;

if ($count==17)
{
open(country, ">>Country.txt");
print country "$Text\t\t$intext\n";
close(country);
}
if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
}
if ($count==21)
{
open(patentee, ">>Patentee.txt");
print patentee "$Text\t\t$intext\n";
close(patentee);
}
if ($count==23)
{
open(date, ">>Date.txt");
print date "$Text\t\t$intext\n";
close(date);
}
if($intext =~ /0 patents/)
{
print "No Patent Found \n You may want to delete the $Text.txt file that has been Generated\n";
}
}
close(FILE);

[edited by: phranque at 9:33 am (utc) on Feb. 8, 2009]
[edit reason] massive data dump [/edit]

krugs

6:00 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Is this school work?

brlinga

6:19 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Nope...this is for a Prof for whom I am working...
Did you get what I was trying to convey...or shud i repost it..
Would greatly appreciate if you got inputs..

krugs

6:33 am on Feb 8, 2009 (gmt 0)

5+ Year Member



post another example of the file without the *CUT* stuff in it.

brlinga

6:48 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Heres another example; I would be processing about 1000 files like these a day...and need to convert this huge txt to 17 smaller txt based on their headings(as explained above):

[US Patent & Trademark Office, Patent Full Text and Image Database]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[Bottom]

[View Shopping Cart] [Add to Shopping Cart]

[Image]

( 1 of 1 )

--------------------------------------------------

United States Patent

[EDIT: massive data dump with specifics]

The present disclosure includes that contained in
the appended claims as well as that of the
foregoing description. Although this invention has
been described in its preferred form with a certain
degree of particularity, it is understood that the
present disclosure of the preferred form has been
made only by way of example and that numerous
changes in the details of construction and the
combination and arrangement of parts may be
resorted to without departing from the spirit and
scope of the invention.

* * * * *

--------------------------------------------------

[Image]

[View Shopping Cart] [Add to Shopping Cart]

[Top]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[edited by: phranque at 9:38 am (utc) on Feb. 8, 2009]
[edit reason] removed specifics [/edit]

callivert

7:13 am on Feb 8, 2009 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



so, basically, you're trying to scrape the US patents database, then reformat it so that all identifying information (such as the owner of the patent) is stripped away from the description.
Is that right?

brlinga

7:26 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Nope..I am trying to separate each field tagging each of them with its patent no. (NOTE: if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
} )

$Text is its patent no.

I am not trying to strip it off its identifying information in anyways.Its just for some research to get some statistics.

krugs

8:04 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Is $Text already defined before you start processing the file?

krugs

8:09 am on Feb 8, 2009 (gmt 0)

5+ Year Member



in both of the files there is a line of dashes and then a blank line and then a description of some kind:

--------------------------------------------------

Sweatband

Is the line (Sweatband) always on the same line and is it always just one line?

krugs

8:22 am on Feb 8, 2009 (gmt 0)

5+ Year Member



What about these areas/sections of the file, do they get printed to a file?

Assignee: (this heading is in one file but not the other)
Foreign Patent Documents
Primary Examiner:
Assistant Examiner:
Attorney, Agent or Firm:

You did not list them in the headings:


Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.

brlinga

9:02 am on Feb 8, 2009 (gmt 0)

5+ Year Member



Yes $Text is there from before...

And the Line with Sweatband is always exactly there, but it can be two lines sometimes...

And like you have mentioned Foreign Patent document,Primary Examiner, Assistant Examiner, Attorney Agent or firm are to be printed to separate files each..

I greatly appreciate your time and interest..Thanks

krugs

10:04 am on Feb 8, 2009 (gmt 0)

5+ Year Member



THis is not well tested and will more than likely need additional work, but it seems very close:



use strict;
use warnings;

my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

my @headings = (
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

my $Text = '123,456,789';
my $isopen;

open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";
OUTTERLOOP:
while (chomp(my $intext = <FILE>)){
next OUTTERLOOP if ($intext =~ /^[ -]*$/);
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
next OUTTERLOOP;
}
INNERLOOP:
while (chomp(my $intext = <FILE>)){
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t$intext\n";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

sub static_output {
my ($heading, $intext) = @_;
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}


change the paths to files before trying. Make sure to try on some test files and note any problems. I will check back after getting some sleep.

***** You need to change the pipes in the code. For some odd reason this forum changes them to double-pipes. This forum also does not format code well making it hard to read.

[edited by: phranque at 10:23 am (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

krugs

5:45 pm on Feb 8, 2009 (gmt 0)

5+ Year Member



Any feedback on the code brlinga?

krugs

6:47 pm on Feb 8, 2009 (gmt 0)

5+ Year Member



Well, I had a chance to try the code and spotted a problem or two so I edited the code. Here is the new code.



use strict;
use warnings;

# The fixed headings
my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

# The non-fixed headings
my @headings = (
'Assignee',
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

# Just for testing the script
my $Text = '123,456,789';

# A binary flag to determine if a file is opened or closed
my $isopen;

# Open the input file
open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";

####################################################
# OUTTERLOOP gets the sections of the file
# (%headings) that are always on the same line.
####################################################

OUTTERLOOP:
while (my $intext = <FILE>){
chomp $intext;
next OUTTERLOOP if ($intext =~ /^[ -]*$/);# skip blank lines and lines with only dashes
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
}
next OUTTERLOOP if ($. < 28);
################################################3
# INNERLOOP gets the sections (@headings)
# that might occur on different lines of
# the file and maybe of varying numbers of lines.
################################################
INNERLOOP:
while (my $intext = <FILE>){
chomp $intext;
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
# Uncomment next line for debugging
#print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

#######################################
# sub static_output prints the fixed
# sections to a file
#######################################
sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}


[edited by: phranque at 7:32 pm (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

brlinga

1:47 am on Feb 9, 2009 (gmt 0)

5+ Year Member



Amazing man...its working..
I greatly appreciate your time and interest..
thanks again...

brlinga

4:12 am on Feb 9, 2009 (gmt 0)

5+ Year Member



Hey Krugs...theres some problem..
The code works well for some patents and doesent work for some..Could there be some problem with the logic..

This is the error thats popping up:
readline() on closed filehandle FILE at C:\Perl\bin\upgrade.pl line 63.

[edited by: phranque at 5:58 am (utc) on Feb. 9, 2009]
[edit reason] specifics [/edit]

krugs

5:27 pm on Feb 9, 2009 (gmt 0)

5+ Year Member



I'll take a look at the code a bit a later today and see if I can determine the problem. Right now I am at work.

brlinga

8:59 pm on Feb 9, 2009 (gmt 0)

5+ Year Member



thanks buddy am counting on you

krugs

11:06 pm on Feb 9, 2009 (gmt 0)

5+ Year Member



the readline() problem is here:


sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
close (FILE);
return 0;
}

this line:

close (FILE);

needs to be changed to:

close ($OUT);

brlinga

11:44 pm on Feb 9, 2009 (gmt 0)

5+ Year Member



Thanks man...
And is there a way we can get all the files in CSV format(Excel Readable)...so that there are no lines like this "------------------".
And ca we have the data in 2 columns..this way..

Assignee.txt --> Patent No. Assignee
Country.txt --> Patent No. Country
Description.txt-->Patent No. the description...

And did you check out the full code?u likd it?
And apologies fr being so lame...am just a beginner..

 

Featured Threads

Hot Threads This Week

Hot Threads This Month