I am running a small local directory (search-engine).
But I am having problems with "Foreign Characters" and/or "unprintable "characters" in a normal Text-database.
Due to these foreign characters I cannot split the @list into different $lines - as soon as the perl-script reaches that "unprintable character" it simply stops.
WHAT DOES NOT WORK:
open (LIST, "database.txt");
@list = <LIST>;
close(LIST);
foreach $line(@list) {
chomp($line);
$line =~ tr/a-zA-Z0-9\#\&\;\,\.\-\_\:\\\/\¦\$\=\?\~\'\%/ /cs;
($url,$title,$description,$keywords)=split(/\¦/, $line);
...THE perl script will process the lines
...untill it reaches a line that contains
...that foreign or unprintable character
...then the script stops!
...Even applying a Regex to clean the line does not work
What I need is the way that I first can clean the complete File or list for certain characters.
I made the following code but THIS DOES NOT WORK:
CODE I MADE BUT DOES NOT WORK:
open (LIST, "database.txt");
@list = <LIST>;
close(LIST);
@list =~ tr/a-zA-Z0-9\#\&\;\,\.\-\_\:\\\/\¦\$\=\?\~\'\%/ /cs;
.... Then write back to a file
Who can help me to apply a regex to a complete file or List?
Thanks and regards,
Sanuk
That's the whole problem!
I can not apply a Regex on $str or $line
as the presence of "foreign characters" in the file (@list)
does not allow me to split the list in strings, being called $str or $line or $an anything
I need to find the system to apply a regex to the complete file: @list or <LIST> before splitting into strings $str or $ligne!
PS: The foreign and/or unprintable characters are shown in the flatfile text-database as "white squares"
Regards,
Sanuk
I have send by Sticky Mail a very small test flatfile of 3 lines, where in every line I have included this "unprintable" or "foreign" character.
If Sticky Mail strips the character, let me know, then I will upload a small test-file to my server and sticky Mail You the url where to download.
It is not that perl stops!
But the script stops splitting the lines when it reaches this character and then goes on executing the rest of the script.
Let's say we have such a character on line 3,700 from a 25,000 line flatfile, then the script will split lines until it reaches this character - then stops splitting and will continue with the script, which is writing to a new file - But it will only write 3,700 lines to file and not 25,000 lines.
And the last line (3,700) stops exactly where this character was.
Regards,
Sanuk
open (LIST, "/path/to/file.txt") or die "Can't open input";
while (<LIST>) {
s/[^\s\w\#\&\;\,\.\-\_\:\\\/\\$\=\?\~\'\% ]//g
and print "Deleted suspect character with ordinal value: ",ord($&); # This might not report all whacky characters
push(@list, $_);
}
If it works, let me know what character it blames...
[edited by: jatar_k at 7:15 pm (utc) on Aug. 22, 2003]
[edit reason] turned off smiles [/edit]
Thanks for the reply.
No, your little script does not work
First I tought it worked because it returnd value 124
but then I saw that the escaped pipe symbol was not in your regex.
And as the delimiter of the flat database is the pipe, this is what gave 124.
Any one that ones a 3 line test-database (2 Kbytes) in text format with those characters inside, just Sticky-Mail me, I will send you the url of my site where to dowload.
Regards,
Sanuk
binmode FILEHANDLE;
open F, "test";
binmode F;
foreach my $str (<F>){
$str =~ s{[^0-9a-zA-Z]+}{ }g;
my @words = split / /, $str;
print join ', ', @words;
print "\n";
}
close F;
http, www, intracen, org, photos, picture, frames, and, paintings, modern, antiques
http, www, intracen, org, watches, Watches, and, Clocks, antique, wooden, swiss
http, www, intracen, org, artwork, Art, collections, sculpture, antique, pottery