I have a script to create a little search engine,and I would create a filter in order to prevent that certain terms are considered suitables for the search.This in order to avoid to get search results for "stop words" such as "any" "and" "for" and so on.
The script provides a length filter that cuts off all typed words shorter than a certain length,but it can't work at all as there are,in my case,short terms that,instead are suitable for the search,such as "kit" or "cap" and so on...So I must give up this idea.
Alternatively,the script provides also a smut filter that matches the input words with the ones included in an exclusion list (external file)and if a match is found,it censores the results;
Obviously I would need to change this output:I need, instead,if a match is found,simply the input word/s is/are ignored for the search.
This allows that if someone types in a "stop word" it is not considerd for the search but if it is included in a record it is shown (otherwise,all records containing stop words such us "and" or "for" would be censored!...)
Could someone help me to edit properly this filter please?
Any helpful reply will be very appreciated.
Here is the smut filter:
# Get the smut filter informationunless (open (DATA,"$smutfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@smutinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $smutline (@smutinfo){
$smutfilter = $smutfilter.$smutline;
@smutwords = split (/::/,$smutfilter);
}
Here is the search routine:
# Routine for 'words' searchif ($FORM{'mode'} eq "words") {
chomp($FORM{'keywords'});
$searchstring=$FORM{'keywords'};
@words = split (/ /,$searchstring);
foreach $word (@words) {
$wordlength = length($word);
if ($wordlength < $minword) {
$word = split (/ /,$searchstring);
}
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline = $entries;
} else {
$currentline = $position;
}
$found="0";
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<TABLE WIDTH=500><TR><TD ALIGN=LEFT><FONT $font SIZE=2>";
until ($found > 9 ¦¦ $currentline == 0) {
foreach $word (@words) {
if ($input[$currentline] =~ /$word/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on" && $match == 0) {
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
$smut = 0;
}
if ($safekey eq "off" && $match == 0) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
}
}
}
--$currentline;
$match = 0;
}
}
If you need I also could post the whole script
Thanks in advance guys
Sincerely
I am not trying to sound like a know-it-all, I am far from the best educated perl coder, but your code is very poorly written and has a number of security problems.
Sorry, I do not understand : if you do not search for word "and" and a record contains "blood, sweat and tears", why this record'd be "censored", if you search for "blood" for example?
You touched the central point of the question:
It's just what I don't want happens.But for the current filter setting it's so.This because the smut filter works either ignoring any smut word typed in by the user,and prevents that any record containing smut words is shown.I need,instead,that the filter only stops the typed words,loosing any effectiveness on the records returning.
To better clarify my need,and answering to Perl_diver,I cannot rely on the length filter,because,in my case there are three letters words that could be suitable for the search...
So,ideas on how to edit properly that smut filter?
Thanks again
Sincerely
open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);
That's all you need to keep stop words list in memory. No need in senseless foreach loop provided below (and 2-nd curly bracket of this had to be placed a line above imho).
Keep stopwords in a file and use the simplest format :
word0
word1
word2
.... etc.
= Why use '::' as separator? (or I can not understand the idea, really.) \n is the best separator here.
You need to create an appropriate regular expression for your purposes (do not know your exact conditions and requirements). Suppose : while($search_phrase =~ m/([a-zA-Z0-9]+)/g) or simply (\w+), or use \b, or....
Next, suppose you have a search phrase entered by a user, e.g. "with the beatles" (you may use case sensitive search or ignore this (this regex needs more time to work), of course; we do not touch this question now.)
ok. we have :
while($search_phrase =~ m/([a-zA-Z0-9]+)/g) {
my($next_word) = $1; #better to keep it in temp. privare var.
my($sw) = 0; # tmp flag
#Is this $next_word a stop word?
for(@smutinfo) {
chomp($_);
if($_ eq $next_word) { # we found that $next_word
$sw = 1;#... is a stop word
last
}
}
if($sw) { #... no need to search for this one
next
}
else {
...do your search
}
}
If you have "with", "the"... etc.... in your stopwords file, $sw flag will be raised twice, and you'll search for "beatles" only in your db (or file). So, you'll find "Beatles for sale" in any case too, regardless "for" is a stopword for sure.
If you need to search for exact phrase, you need not check for stop words at all imho.
To better clarify my need,and answering to Perl_diver,I cannot rely on the length filter,because,in my case there are three letters words that could be suitable for the search...
What length filter? If you don't want words shorter than three letters use grep() on your origianl array of search terms:
if ($FORM{'mode'} eq "words") {
chomp($FORM{'keywords'});#<- is this really necessary?
$searchstring=$FORM{'keywords'};
# here $searchstring needs to be validated and html/javascript needs to be escaped
# don't use split(/ /,$searchstring)
# use split(/\s+/,$searchstring)
@words = grep(length($_) > 2, split (/\s+/,$searchstring));
As regard to the stop word list handling,well;
This would be the filter:
open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);
That replaces the current one,right?
And this would be the expression:
while($search_phrase =~ m/([a-zA-Z0-9]+)/g) {
my($next_word) = $1; #better to keep it in temp. privare var.
my($sw) = 0; # tmp flag#Is this $next_word a stop word?
for(@smutinfo) {
chomp($_);
if($_ eq $next_word) { # we found that $next_word
$sw = 1;#... is a stop word
last
}
}if($sw) { #... no need to search for this one
next
}
else {
...do your search
}
}
Right?
Where should I put it precisely in the above search array in order to avoid synthax errors?
I'm a bit confused...
Perl is quite unknown for me,so I need a further help...
here is the whole script:
###############################
#!/usr/local/bin/perl
$base = '/home/bla/public_html/search/base.txt';
# Change this to the PATH (not the URL) of the head.txt file
# (include the filename)
$headfile = '/home/bla/public_html/search/head.txt';
# Change this to the PATH (not the URL) of the foot.txt file
# (include the filename)
$footfile = '/home/bla/public_html/search/foot.txt';
# Change this to the PATH (not the URL) of the respond.txt file
# (include the filename)
$respondfile = '/home/bla/public_html/search/respond.txt';
# Change this to the PATH (not the URL) of the smut.txt file
# Any word found in smut.txt is assumed to be adult material
# therefore you can control what is censored and what isn't
# (include the filename)
$smutfile = '/home/bla/public_html/search/smut.txt';
# Change this to the URL of this script
# (include the filename)
$scripturl = 'http://www.blabla.com/cgi-bin/search.cgi';
# Edit this one to choose the font for the search results
# DO NOT use " or any special characters
# Use below for an example of what is allowed
# Also do not set a font size as the script does this automatically
$font = 'FACE=arial,helvetica COLOR=000000';
# Change this to the minimum search word length
# This is to exclude searches for "the", "and", "a", etc.
$minword = '3';
# Enter the maximum number of characters you want to allow for
# the 'title' field for new site submissions
$maxtitle = '50';
# Enter the maximum number of characters you want to allow for
# the 'description' field for new site submissions
$maxdescription = '150';
# Enter the maximum number of characters you want to allow for
# the 'keywords' field for new site submissions
$maxkeywords = '50';
# How many URLs do you want displayed on the New URLs page
$numnew = '3';
# If you want to use flock to avoid corrupt files by double access
# leave this line as is...if you don't then change the 1 to a 0
$uselock = '1';
# If you want to automatically send an autorespond e-mail to visitors
# who submit their URL to the database then leave this line as is
# If you don't, then change the 1 to a 0
$userespond = '1';
# Get the form variables
if ($ENV{'REQUEST_METHOD'} eq 'GET') {
$buffer = $ENV{'QUERY_STRING'};
}
else {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}
# Break em up into a format the script can read
@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
# Get the heading information
unless (open (DATA,"$headfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@headinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $headline (@headinfo){
$heading = $heading.$headline;
}
# Get the footer information
unless (open (DATA,"$footfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@footinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $footline (@footinfo){
$footer = $footer.$footline;
}
####Get the smut filter information:Should I replace the code below:
unless (open (DATA,"$smutfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@smutinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $smutline (@smutinfo){
$smutfilter = $smutfilter.$smutline;
@smutwords = split (/::/,$smutfilter);
#### ...With this one?
open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);
# Determine what part of the script we need
if ($FORM{'action'} eq "showadd") {
&showadd;
}
if ($FORM{'action'} eq "addurl") {
&addurl;
}
if ($FORM{'action'} eq "newurls"){
&newurls;
}
if ($FORM{'action'} eq "randomurl"){
&randomurl;
}
# Assign shorter variable names
# (Laziness on my part - but I find the longer
# a script gets the more work typing long
# variable names becomes.)
$position = $FORM{'code'};
$addshow = 0;
$noshow = 0;
$match = 0;
if ($FORM{'safe'} ne "on") {
$safekey = "off";
} else {
$safekey = "on";
}
# Begin the search process and output the results
unless (open (DATA,"$base")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@input = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
#### Routine for 'words' search:Where should I put your code?
if ($FORM{'mode'} eq "words") {
$searchstring=$FORM{'keywords'};
@words = split (/ /,$searchstring);
foreach $word (@words) {##<<<<<<<<<<<<<...here?
$wordlength = length($word);
if ($wordlength < $minword) {
&stringshort;
}
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline = $entries;
} else {
$currentline = $position;
}
$found="0";
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<FONT $font SIZE=2>";
until ($found > 9 ¦¦ $currentline == 0) {
foreach $word (@words) {
if ($input[$currentline] =~ /$word/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on" && $match == 0) {##<<<<<<<<<...here?
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {##<<<<<<<<<<...here?
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
$smut = 0;
}
if ($safekey eq "off" && $match == 0) {##<<<<<<<<<...here?
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
}
}
}
--$currentline;
$match = 0;
}
}
####...Replacing the whole routine?
# Routine for 'phrases' search
if ($FORM{'mode'} eq "phrases") {
$searchstring=$FORM{'keywords'};
$wordlength = length($FORM{'keywords'});
if ($wordlength < $minword) {
&phrase;
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline=$entries;
} else {
$currentline = $position;
}
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<FONT $font SIZE=2>";
until ($found > 9 ¦¦ $currentline == 0) {
if ($input[$currentline] =~ /$FORM{'keywords'}/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on") {
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
}
$smut = 0;
}
if ($safekey eq "off") {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
}
}
}
--$currentline;
}
}
print "</FONT>";
&footer;
################# SUBROUTINES ######################
sub heading {
print "Content-type: text/html\n\n";
print "$heading";
}
sub footer {
$keyencode=$FORM{'keywords'};
$keyencode =~ tr/ /+/;
if ($found > 9) {
$position=$currentline;
print "<CENTER><FONT $font size=3><A HREF=\"$scripturl?keywords=$keyencode&code=$position&mode=$FORM{'mode'}&safe=$safekey \"><B>More Results</B></A><BR><HR WIDTH=400></FONT></CENTER>";
}
else {
unless ($addshow == 1) {
print "<CENTER><FONT $font SIZE=2><B>End of Results.</B><BR><HR WIDTH=400></FONT></CENTER>\n";
}
}
unless ($noshow == 1) {
unless ($addshow == 1) {
print "<CENTER><P><FORM METHOD=post ACTION=$scripturl><TABLE><TR><TD VALIGN=TOP><FONT $font SIZE=3><B>Search For :</B></FONT></TD><TD><INPUT TYPE=TEXT NAME=keywords SIZE=25 VALUE=\"$FORM{'keywords'}\"><BR><FONT $font SIZE=2><B>Mode :</B><INPUT TYPE=\"radio\" NAME=\"mode\" VALUE=\"words\" CHECKED>Words<INPUT TYPE=\"radio\" NAME=\"mode\" VALUE=\"phrases\">Phrase<BR><B>Safe : </B><INPUT TYPE=\"checkbox\" NAME=\"safe\" CHECKED>Omit Offensive Slang</FONT></TD><TD VALIGN=TOP ALIGN=CENTER WIDTH=60><INPUT TYPE=SUBMIT VALUE=\"Search!\"></TD></TR></TABLE></FORM><p></CENTER>\n";
}
if ($FORM{'keywords'} ne "") {
print "<CENTER><FONT $font size=2><B>Search for \"$FORM{'keywords'}\" in these search engines...<br><A HREF=\"http://www.altavista.com/cgi-bin/query?pg=q&what=web&q=$keyencode\">AltaVista</A> <A HREF=\"http://search.dejanews.com/dnquery.xp?query=$keyencode&defaultOp=AND&svcclass=dncurrent&maxhits=20\">DejaNews</A> <A HREF=\"http://search.excite.com/search.gw?search=$keyencode\">Excite</A> <A HREF=\"http://guide-p.infoseek.com/Titles/?qt=$keyencode\">GO Network</A> <A HREF=\"http://www.hotbot.com/?MT=$keyencode&SM=MC&DV=0&LG=any&DC=10&DE=2&_v=2&OPs=MDRTP&Search.x=38&Search.y=15\">HotBot</A> <A HREF=\"http://www.lycos.com/cgi-bin/pursuit?query=$keyencode&maxhits=20\">Lycos</A> <A HREF=\"http://www.webcrawler.com/cgi-bin/WebQuery?searchText=$keyencode&maxHits=20\">WebCrawler</A> <A HREF=\"http://search.yahoo.com/bin/search?p=$keyencode\">Yahoo!</A></B><P></CENTER>\n";
}
}
&generate;
print "$footer";
exit;
}
############################################
Thanks for your precious help
Sincerely