Welcome to WebmasterWorld Guest from 54.196.243.192

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Splitting a txt file into smaller text files.

     
1:52 am on Feb 8, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


I am trying to Split text files of the below format into 17 new text files using PERL.
Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.

Below is the sample of the file.I have added *CUT* indicating new file
*******************************************************************
[EDIT: deleted several hundred lines of data dump including specifics]
**********************************************************************

I could code for first few fields since they appear in exact same line every time...For the rest, I dont get how to do pattern matching and writing into files simultaneously..Would greatly appreciate some help...

while(($intext=<FILE>))
{
$count++;
#print "$count\n";
#print $intext;

if ($count==17)
{
open(country, ">>Country.txt");
print country "$Text\t\t$intext\n";
close(country);
}
if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
}
if ($count==21)
{
open(patentee, ">>Patentee.txt");
print patentee "$Text\t\t$intext\n";
close(patentee);
}
if ($count==23)
{
open(date, ">>Date.txt");
print date "$Text\t\t$intext\n";
close(date);
}
if($intext =~ /0 patents/)
{
print "No Patent Found \n You may want to delete the $Text.txt file that has been Generated\n";
}
}
close(FILE);

[edited by: phranque at 9:33 am (utc) on Feb. 8, 2009]
[edit reason] massive data dump [/edit]

6:00 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


Is this school work?
6:19 am on Feb 8, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Nope...this is for a Prof for whom I am working...
Did you get what I was trying to convey...or shud i repost it..
Would greatly appreciate if you got inputs..
6:33 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


post another example of the file without the *CUT* stuff in it.
6:48 am on Feb 8, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Heres another example; I would be processing about 1000 files like these a day...and need to convert this huge txt to 17 smaller txt based on their headings(as explained above):

[US Patent & Trademark Office, Patent Full Text and Image Database]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[Bottom]

[View Shopping Cart] [Add to Shopping Cart]

[Image]

( 1 of 1 )

--------------------------------------------------

United States Patent

[EDIT: massive data dump with specifics]

The present disclosure includes that contained in
the appended claims as well as that of the
foregoing description. Although this invention has
been described in its preferred form with a certain
degree of particularity, it is understood that the
present disclosure of the preferred form has been
made only by way of example and that numerous
changes in the details of construction and the
combination and arrangement of parts may be
resorted to without departing from the spirit and
scope of the invention.

* * * * *

--------------------------------------------------

[Image]

[View Shopping Cart] [Add to Shopping Cart]

[Top]

[Home] [Boolean Search] [Manual Search]
[Number Search] [Help]

[edited by: phranque at 9:38 am (utc) on Feb. 8, 2009]
[edit reason] removed specifics [/edit]

7:13 am on Feb 8, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Nov 30, 2006
posts:685
votes: 0


so, basically, you're trying to scrape the US patents database, then reformat it so that all identifying information (such as the owner of the patent) is stripped away from the description.
Is that right?
7:26 am on Feb 8, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Nope..I am trying to separate each field tagging each of them with its patent no. (NOTE: if ($count==19)
{
open(patent, ">>PatentNo.txt");
print patent "$Text\t\t$intext\n";
close(patent);
} )

$Text is its patent no.

I am not trying to strip it off its identifying information in anyways.Its just for some research to get some statistics.

8:04 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


Is $Text already defined before you start processing the file?
8:09 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


in both of the files there is a line of dashes and then a blank line and then a description of some kind:

--------------------------------------------------

Sweatband

Is the line (Sweatband) always on the same line and is it always just one line?

8:22 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


What about these areas/sections of the file, do they get printed to a file?

Assignee: (this heading is in one file but not the other)
Foreign Patent Documents
Primary Examiner:
Assistant Examiner:
Attorney, Agent or Firm:

You did not list them in the headings:


Based on Headings like {Abstract, Inventors,current US Class, Current international class, Appl. No., Filed, Field of Search, US patent Documents,Claims, Description}.
9:02 am on Feb 8, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Yes $Text is there from before...

And the Line with Sweatband is always exactly there, but it can be two lines sometimes...

And like you have mentioned Foreign Patent document,Primary Examiner, Assistant Examiner, Attorney Agent or firm are to be printed to separate files each..

I greatly appreciate your time and interest..Thanks

10:04 am on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


THis is not well tested and will more than likely need additional work, but it seems very close:



use strict;
use warnings;

my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

my @headings = (
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

my $Text = '123,456,789';
my $isopen;

open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";
OUTTERLOOP:
while (chomp(my $intext = <FILE>)){
next OUTTERLOOP if ($intext =~ /^[ -]*$/);
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
next OUTTERLOOP;
}
INNERLOOP:
while (chomp(my $intext = <FILE>)){
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t$intext\n";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

sub static_output {
my ($heading, $intext) = @_;
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}


change the paths to files before trying. Make sure to try on some test files and note any problems. I will check back after getting some sleep.

***** You need to change the pipes in the code. For some odd reason this forum changes them to double-pipes. This forum also does not format code well making it hard to read.

[edited by: phranque at 10:23 am (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

5:45 pm on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


Any feedback on the code brlinga?
6:47 pm on Feb 8, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


Well, I had a chance to try the code and spotted a problem or two so I edited the code. Here is the new code.



use strict;
use warnings;

# The fixed headings
my %headings = (
17 => 'Country',
19 => 'PatentNo',
21 => 'Patentee',
23 => 'Date',
27 => 'Title',
);

# The non-fixed headings
my @headings = (
'Assignee',
'Abstract',
'Inventors',
'Current U.S. Class',
'Current International Class',
'Appl. No.',
'Filed',
'Field of Search',
'U.S. Patent Documents',
'Claims',
'Description',
'Primary Examiner',
'Assistant Examiner',
'Attorney, Agent or Firm',
);

# Just for testing the script
my $Text = '123,456,789';

# A binary flag to determine if a file is opened or closed
my $isopen;

# Open the input file
open(FILE, '<', 'c:/perl_test/patent.txt') or die "$!";

####################################################
# OUTTERLOOP gets the sections of the file
# (%headings) that are always on the same line.
####################################################

OUTTERLOOP:
while (my $intext = <FILE>){
chomp $intext;
next OUTTERLOOP if ($intext =~ /^[ -]*$/);# skip blank lines and lines with only dashes
if ($. == 17 $. == 19 $. == 21 $. == 23 $. == 27) {
static_output($.,$intext);
}
next OUTTERLOOP if ($. < 28);
################################################3
# INNERLOOP gets the sections (@headings)
# that might occur on different lines of
# the file and maybe of varying numbers of lines.
################################################
INNERLOOP:
while (my $intext = <FILE>){
chomp $intext;
foreach my $heading (@headings) {
if ($intext =~ /^$heading:?/) {
$isopen = 0 if (close OUT);
# Uncomment next line for debugging
#print ">>>>> $heading\n";
(my $filename = $heading) =~ tr/ /_/;
$isopen = open(OUT, ">>", "c:/perl_test/dump/$filename.txt") or die "$!";
print OUT "$Text\t\t";
last;
next INNERLOOP;
}
}
print OUT "$intext\n" if $isopen;
}
}
print "++++finished++++\n";

#######################################
# sub static_output prints the fixed
# sections to a file
#######################################
sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
return(0);
}


[edited by: phranque at 7:32 pm (utc) on Feb. 8, 2009]
[edit reason] disabled graphic smileys ;) [/edit]

1:47 am on Feb 9, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Amazing man...its working..
I greatly appreciate your time and interest..
thanks again...
4:12 am on Feb 9, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts: 9
votes: 0


Hey Krugs...theres some problem..
The code works well for some patents and doesent work for some..Could there be some problem with the logic..

This is the error thats popping up:
readline() on closed filehandle FILE at C:\Perl\bin\upgrade.pl line 63.

[edited by: phranque at 5:58 am (utc) on Feb. 9, 2009]
[edit reason] specifics [/edit]

5:27 pm on Feb 9, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


I'll take a look at the code a bit a later today and see if I can determine the problem. Right now I am at work.
8:59 pm on Feb 9, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts:9
votes: 0


thanks buddy am counting on you
11:06 pm on Feb 9, 2009 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 20, 2008
posts:92
votes: 0


the readline() problem is here:


sub static_output {
my ($heading, $intext) = @_;
# Uncomment next line for debugging
#print "++++++$headings{$heading}\n";
open(my $OUT, ">>", "c:/perl_test/dump/$headings{$heading}.txt") or die "$!";
print $OUT "$Text\t\t$intext\n";
if ($heading == 27) {
chomp(my $next_line = <FILE>);
if ($next_line =~ /\S/) {#appears to have more data
print $OUT "$Text\t\t$next_line\n";
}
}
close (FILE);
return 0;
}

this line:

close (FILE);

needs to be changed to:

close ($OUT);

11:44 pm on Feb 9, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 8, 2009
posts:9
votes: 0


Thanks man...
And is there a way we can get all the files in CSV format(Excel Readable)...so that there are no lines like this "------------------".
And ca we have the data in 2 columns..this way..

Assignee.txt --> Patent No. Assignee
Country.txt --> Patent No. Country
Description.txt-->Patent No. the description...

And did you check out the full code?u likd it?
And apologies fr being so lame...am just a beginner..