Forum Moderators: coopster & phranque

Message Too Old, No Replies

Apply Regex to complete File or List

foreign character cleaning regex

         

sanuk

6:58 am on Aug 22, 2003 (gmt 0)

10+ Year Member



Hi,

I am running a small local directory (search-engine).
But I am having problems with "Foreign Characters" and/or "unprintable "characters" in a normal Text-database.

Due to these foreign characters I cannot split the @list into different $lines - as soon as the perl-script reaches that "unprintable character" it simply stops.

WHAT DOES NOT WORK:
open (LIST, "database.txt");
@list = <LIST>;
close(LIST);
foreach $line(@list) {
chomp($line);
$line =~ tr/a-zA-Z0-9\#\&\;\,\.\-\_\:\\\/\¦\$\=\?\~\'\%/ /cs;
($url,$title,$description,$keywords)=split(/\¦/, $line);
...THE perl script will process the lines
...untill it reaches a line that contains
...that foreign or unprintable character
...then the script stops!
...Even applying a Regex to clean the line does not work

What I need is the way that I first can clean the complete File or list for certain characters.
I made the following code but THIS DOES NOT WORK:

CODE I MADE BUT DOES NOT WORK:
open (LIST, "database.txt");
@list = <LIST>;
close(LIST);
@list =~ tr/a-zA-Z0-9\#\&\;\,\.\-\_\:\\\/\¦\$\=\?\~\'\%/ /cs;
.... Then write back to a file

Who can help me to apply a regex to a complete file or List?

Thanks and regards,
Sanuk

myself

7:36 am on Aug 22, 2003 (gmt 0)

10+ Year Member



Keep known characters instead of removing unknown:

$str =~ s/([^0-9a-z]+)/ /igm;

Note also that you cannot do like this:


@list =~ tr/.../.../;

sanuk

8:02 am on Aug 22, 2003 (gmt 0)

10+ Year Member



Hi,

That's the whole problem!
I can not apply a Regex on $str or $line
as the presence of "foreign characters" in the file (@list)
does not allow me to split the list in strings, being called $str or $line or $an anything

I need to find the system to apply a regex to the complete file: @list or <LIST> before splitting into strings $str or $ligne!

PS: The foreign and/or unprintable characters are shown in the flatfile text-database as "white squares"

Regards,
Sanuk

myself

8:27 am on Aug 22, 2003 (gmt 0)

10+ Year Member



File can contain any data—no character in outer file can stop perl. Would you send me that file (or some part of it)?

sanuk

3:20 pm on Aug 22, 2003 (gmt 0)

10+ Year Member



Hi Myself,

I have send by Sticky Mail a very small test flatfile of 3 lines, where in every line I have included this "unprintable" or "foreign" character.

If Sticky Mail strips the character, let me know, then I will upload a small test-file to my server and sticky Mail You the url where to download.

It is not that perl stops!
But the script stops splitting the lines when it reaches this character and then goes on executing the rest of the script.

Let's say we have such a character on line 3,700 from a 25,000 line flatfile, then the script will split lines until it reaches this character - then stops splitting and will continue with the script, which is writing to a new file - But it will only write 3,700 lines to file and not 25,000 lines.
And the last line (3,700) stops exactly where this character was.

Regards,
Sanuk

timster

7:05 pm on Aug 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How about this?

open (LIST, "/path/to/file.txt") or die "Can't open input";
while (<LIST>) {

s/[^\s\w\#\&\;\,\.\-\_\:\\\/\\$\=\?\~\'\% ]//g

and print "Deleted suspect character with ordinal value: ",ord($&); # This might not report all whacky characters
push(@list, $_);
}

If it works, let me know what character it blames...

[edited by: jatar_k at 7:15 pm (utc) on Aug. 22, 2003]
[edit reason] turned off smiles [/edit]

sanuk

8:31 pm on Aug 22, 2003 (gmt 0)

10+ Year Member



Hi Timster,

Thanks for the reply.
No, your little script does not work
First I tought it worked because it returnd value 124
but then I saw that the escaped pipe symbol was not in your regex.
And as the delimiter of the flat database is the pipe, this is what gave 124.

Any one that ones a 3 line test-database (2 Kbytes) in text format with those characters inside, just Sticky-Mail me, I will send you the url of my site where to dowload.

Regards,
Sanuk

myself

7:58 am on Aug 23, 2003 (gmt 0)

10+ Year Member



Hi, sunuk,
I tried your data. Funny that acńidently that unprintable character has the code 26 that means EOF (end of file). To avoid perl stop reading at the position of EOF just tell it to use binary mode:

binmode FILEHANDLE;

The following code prints all the words:

open F, "test";
binmode F;
foreach my $str (<F>){
$str =~ s{[^0-9a-zA-Z]+}{ }g;
my @words = split / /, $str;
print join ', ', @words;
print "\n";
}
close F;

And here is the resilt:

http, www, intracen, org, photos, picture, frames, and, paintings, modern, antiques
http, www, intracen, org, watches, Watches, and, Clocks, antique, wooden, swiss
http, www, intracen, org, artwork, Art, collections, sculpture, antique, pottery

sanuk

10:31 am on Aug 23, 2003 (gmt 0)

10+ Year Member



Hi,

Thanks Myself,

So how can I take a 20 Mbytes file "database.txt" and clean out those characters and save again that database to let's say "database_clean.txt"

Could You help me with a little perl code for this?

Regards,
Sanuk

myself

6:35 pm on Aug 23, 2003 (gmt 0)

10+ Year Member




open INPUT, "original_file_name.txt";
open OUTPUT, ">new_file_name.txt";
binmode INPUT;
binmpde OUTPUT;
while (my $str = <INPUT>){
$str =~ s{[^0-9a-zA-Z]+}{ }g;
print OUTPUT $str, "\n";
}
close INPUT;
close OUTPUT;

That will remove any characters that are neither digits nor letters.

sanuk

7:46 pm on Aug 24, 2003 (gmt 0)

10+ Year Member



Hi,

Thanks Myself!
Works great

Regards,
Sanuk