homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
exclusion list
How should I manage it?
specter




msg:432419
 2:07 pm on Mar 18, 2006 (gmt 0)

Hi everyone,

I have a script to create a little search engine,and I would create a filter in order to prevent that certain terms are considered suitables for the search.This in order to avoid to get search results for "stop words" such as "any" "and" "for" and so on.
The script provides a length filter that cuts off all typed words shorter than a certain length,but it can't work at all as there are,in my case,short terms that,instead are suitable for the search,such as "kit" or "cap" and so on...So I must give up this idea.

Alternatively,the script provides also a smut filter that matches the input words with the ones included in an exclusion list (external file)and if a match is found,it censores the results;
Obviously I would need to change this output:I need, instead,if a match is found,simply the input word/s is/are ignored for the search.
This allows that if someone types in a "stop word" it is not considerd for the search but if it is included in a record it is shown (otherwise,all records containing stop words such us "and" or "for" would be censored!...)
Could someone help me to edit properly this filter please?
Any helpful reply will be very appreciated.

Here is the smut filter:

# Get the smut filter information

unless (open (DATA,"$smutfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@smutinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $smutline (@smutinfo){
$smutfilter = $smutfilter.$smutline;
@smutwords = split (/::/,$smutfilter);
}

Here is the search routine:

# Routine for 'words' search

if ($FORM{'mode'} eq "words") {
chomp($FORM{'keywords'});
$searchstring=$FORM{'keywords'};
@words = split (/ /,$searchstring);
foreach $word (@words) {
$wordlength = length($word);
if ($wordlength < $minword) {
$word = split (/ /,$searchstring);
}
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline = $entries;
} else {
$currentline = $position;
}
$found="0";
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<TABLE WIDTH=500><TR><TD ALIGN=LEFT><FONT $font SIZE=2>";
until ($found > 9 $currentline == 0) {
foreach $word (@words) {
if ($input[$currentline] =~ /$word/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on" && $match == 0) {
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
$smut = 0;
}
if ($safekey eq "off" && $match == 0) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;

}
}
}
}
--$currentline;
$match = 0;
}
}

If you need I also could post the whole script

Thanks in advance guys

Sincerely

 

perl_diver




msg:432420
 6:56 pm on Mar 18, 2006 (gmt 0)

Personally, I would not allow search words less than three in length (at to it so of a I no etc etc etc), words that are three bytes (or letters) in length are acceptable.

I am not trying to sound like a know-it-all, I am far from the best educated perl coder, but your code is very poorly written and has a number of security problems.

Moby_Dim




msg:432421
 7:26 pm on Mar 18, 2006 (gmt 0)

Sorry, I do not understand : if you do not search for word "and" and a record contains "blood, sweat and tears", why this record'd be "censored", if you search for "blood" for example?

specter




msg:432422
 11:35 pm on Mar 18, 2006 (gmt 0)

Sorry, I do not understand : if you do not search for word "and" and a record contains "blood, sweat and tears", why this record'd be "censored", if you search for "blood" for example?

You touched the central point of the question:

It's just what I don't want happens.But for the current filter setting it's so.This because the smut filter works either ignoring any smut word typed in by the user,and prevents that any record containing smut words is shown.I need,instead,that the filter only stops the typed words,loosing any effectiveness on the records returning.

To better clarify my need,and answering to Perl_diver,I cannot rely on the length filter,because,in my case there are three letters words that could be suitable for the search...

So,ideas on how to edit properly that smut filter?

Thanks again

Sincerely

Moby_Dim




msg:432423
 9:44 am on Mar 19, 2006 (gmt 0)

Well, let me propose my own vision of the solution.

open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);

That's all you need to keep stop words list in memory. No need in senseless foreach loop provided below (and 2-nd curly bracket of this had to be placed a line above imho).

Keep stopwords in a file and use the simplest format :
word0
word1
word2
.... etc.
= Why use '::' as separator? (or I can not understand the idea, really.) \n is the best separator here.

You need to create an appropriate regular expression for your purposes (do not know your exact conditions and requirements). Suppose : while($search_phrase =~ m/([a-zA-Z0-9]+)/g) or simply (\w+), or use \b, or....

Next, suppose you have a search phrase entered by a user, e.g. "with the beatles" (you may use case sensitive search or ignore this (this regex needs more time to work), of course; we do not touch this question now.)

ok. we have :

while($search_phrase =~ m/([a-zA-Z0-9]+)/g) {
my($next_word) = $1; #better to keep it in temp. privare var.
my($sw) = 0; # tmp flag

#Is this $next_word a stop word?

for(@smutinfo) {
chomp($_);
if($_ eq $next_word) { # we found that $next_word
$sw = 1;#... is a stop word
last
}
}

if($sw) { #... no need to search for this one
next
}
else {
...do your search
}
}

If you have "with", "the"... etc.... in your stopwords file, $sw flag will be raised twice, and you'll search for "beatles" only in your db (or file). So, you'll find "Beatles for sale" in any case too, regardless "for" is a stopword for sure.

If you need to search for exact phrase, you need not check for stop words at all imho.

perl_diver




msg:432424
 8:49 pm on Mar 19, 2006 (gmt 0)


To better clarify my need,and answering to Perl_diver,I cannot rely on the length filter,because,in my case there are three letters words that could be suitable for the search...

What length filter? If you don't want words shorter than three letters use grep() on your origianl array of search terms:

if ($FORM{'mode'} eq "words") {
chomp($FORM{'keywords'});#<- is this really necessary?
$searchstring=$FORM{'keywords'};
# here $searchstring needs to be validated and html/javascript needs to be escaped
# don't use split(/ /,$searchstring)
# use split(/\s+/,$searchstring)

@words = grep(length($_) > 2, split (/\s+/,$searchstring));


specter




msg:432425
 11:27 pm on Mar 20, 2006 (gmt 0)

Thanks for your replies guys.

As regard to the stop word list handling,well;

This would be the filter:

open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);

That replaces the current one,right?

And this would be the expression:

while($search_phrase =~ m/([a-zA-Z0-9]+)/g) {
my($next_word) = $1; #better to keep it in temp. privare var.
my($sw) = 0; # tmp flag

#Is this $next_word a stop word?

for(@smutinfo) {
chomp($_);
if($_ eq $next_word) { # we found that $next_word
$sw = 1;#... is a stop word
last
}
}

if($sw) { #... no need to search for this one
next
}
else {
...do your search
}
}

Right?
Where should I put it precisely in the above search array in order to avoid synthax errors?

Moby_Dim




msg:432426
 8:59 pm on Mar 21, 2006 (gmt 0)

Specter, i do not know the meaning of some variables in your snippet (note here that #use strict pragma is very important for code maintainability); some mistakes are there too (e.g. foreach $word (@words) loop). Easier to create a new search procedure to replace "...do your search" words.

specter




msg:432427
 8:22 pm on Mar 22, 2006 (gmt 0)

Well,

I'm a bit confused...
Perl is quite unknown for me,so I need a further help...
here is the whole script:

###############################

#!/usr/local/bin/perl

$base = '/home/bla/public_html/search/base.txt';

# Change this to the PATH (not the URL) of the head.txt file
# (include the filename)

$headfile = '/home/bla/public_html/search/head.txt';

# Change this to the PATH (not the URL) of the foot.txt file
# (include the filename)

$footfile = '/home/bla/public_html/search/foot.txt';

# Change this to the PATH (not the URL) of the respond.txt file
# (include the filename)

$respondfile = '/home/bla/public_html/search/respond.txt';

# Change this to the PATH (not the URL) of the smut.txt file
# Any word found in smut.txt is assumed to be adult material
# therefore you can control what is censored and what isn't
# (include the filename)

$smutfile = '/home/bla/public_html/search/smut.txt';

# Change this to the URL of this script
# (include the filename)

$scripturl = 'http://www.blabla.com/cgi-bin/search.cgi';

# Edit this one to choose the font for the search results
# DO NOT use " or any special characters
# Use below for an example of what is allowed
# Also do not set a font size as the script does this automatically

$font = 'FACE=arial,helvetica COLOR=000000';

# Change this to the minimum search word length
# This is to exclude searches for "the", "and", "a", etc.

$minword = '3';

# Enter the maximum number of characters you want to allow for
# the 'title' field for new site submissions

$maxtitle = '50';

# Enter the maximum number of characters you want to allow for
# the 'description' field for new site submissions

$maxdescription = '150';

# Enter the maximum number of characters you want to allow for
# the 'keywords' field for new site submissions

$maxkeywords = '50';

# How many URLs do you want displayed on the New URLs page

$numnew = '3';

# If you want to use flock to avoid corrupt files by double access
# leave this line as is...if you don't then change the 1 to a 0

$uselock = '1';

# If you want to automatically send an autorespond e-mail to visitors
# who submit their URL to the database then leave this line as is
# If you don't, then change the 1 to a 0

$userespond = '1';

# Get the form variables

if ($ENV{'REQUEST_METHOD'} eq 'GET') {
$buffer = $ENV{'QUERY_STRING'};
}
else {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}

# Break em up into a format the script can read

@pairs = split(/&/, $buffer);
foreach $pair (@pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}

# Get the heading information

unless (open (DATA,"$headfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@headinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $headline (@headinfo){
$heading = $heading.$headline;
}

# Get the footer information

unless (open (DATA,"$footfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@footinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $footline (@footinfo){
$footer = $footer.$footline;
}

####Get the smut filter information:Should I replace the code below:

unless (open (DATA,"$smutfile")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@smutinfo = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);
foreach $smutline (@smutinfo){
$smutfilter = $smutfilter.$smutline;
@smutwords = split (/::/,$smutfilter);

#### ...With this one?

open(DATA, "<$smutfile") or &error("bla-bla about the reason : $!");
if($uselock) {flock(DATA,1)} # note - no need for "2"; no need for seek() too
@smutinfo = <DATA>;
close(DATA);

# Determine what part of the script we need

if ($FORM{'action'} eq "showadd") {
&showadd;
}

if ($FORM{'action'} eq "addurl") {
&addurl;
}

if ($FORM{'action'} eq "newurls"){
&newurls;
}

if ($FORM{'action'} eq "randomurl"){
&randomurl;
}

# Assign shorter variable names
# (Laziness on my part - but I find the longer
# a script gets the more work typing long
# variable names becomes.)

$position = $FORM{'code'};
$addshow = 0;
$noshow = 0;
$match = 0;
if ($FORM{'safe'} ne "on") {
$safekey = "off";
} else {
$safekey = "on";
}

# Begin the search process and output the results

unless (open (DATA,"$base")) {die (&error);}
if ($uselock eq '1') {
flock DATA, 2;
seek DATA, 0, 0;
}
@input = <DATA>;
if ($uselock eq '1') {
flock DATA, 8;
}
close (DATA);

#### Routine for 'words' search:Where should I put your code?

if ($FORM{'mode'} eq "words") {
$searchstring=$FORM{'keywords'};
@words = split (/ /,$searchstring);
foreach $word (@words) {##<<<<<<<<<<<<<...here?
$wordlength = length($word);
if ($wordlength < $minword) {
&stringshort;
}
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline = $entries;
} else {
$currentline = $position;
}
$found="0";
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<FONT $font SIZE=2>";
until ($found > 9 $currentline == 0) {
foreach $word (@words) {
if ($input[$currentline] =~ /$word/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on" && $match == 0) {##<<<<<<<<<...here?
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {##<<<<<<<<<<...here?
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
$smut = 0;
}
if ($safekey eq "off" && $match == 0) {##<<<<<<<<<...here?
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
++$match;
}
}
}
}
--$currentline;
$match = 0;
}
}

####...Replacing the whole routine?

# Routine for 'phrases' search

if ($FORM{'mode'} eq "phrases") {
$searchstring=$FORM{'keywords'};
$wordlength = length($FORM{'keywords'});
if ($wordlength < $minword) {
&phrase;
}
&heading;
$entries = @input;
if ($position == 0) {
$currentline=$entries;
} else {
$currentline = $position;
}
print "<CENTER><FONT $font SIZE=3><B>Search Results : </B>'$FORM{'keywords'}'<P></FONT></CENTER>";
print "<HR WIDTH=400>";
print "<FONT $font SIZE=2>";
until ($found > 9 $currentline == 0) {
if ($input[$currentline] =~ /$FORM{'keywords'}/i) {
@data = split (/::/,$input[$currentline]);
if ($data[4] ne "") {
if ($safekey eq "on") {
foreach $smutword (@smutwords) {
if ($input[$currentline] =~ /$smutword/i) {
$smut = 1;
}
}
unless ($smut == 1) {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
}
$smut = 0;
}
if ($safekey eq "off") {
print "<A HREF=\"$data[0]\"><B>$data[1]</B></A><BR>";
print "$data[4]<BR>";
print "<I>$data[0]</I><P>";
++$found;
}
}
}
--$currentline;
}
}
print "</FONT>";
&footer;

################# SUBROUTINES ######################

sub heading {
print "Content-type: text/html\n\n";
print "$heading";
}

sub footer {
$keyencode=$FORM{'keywords'};
$keyencode =~ tr/ /+/;
if ($found > 9) {
$position=$currentline;
print "<CENTER><FONT $font size=3><A HREF=\"$scripturl?keywords=$keyencode&code=$position&mode=$FORM{'mode'}&safe=$safekey \"><B>More Results</B></A><BR><HR WIDTH=400></FONT></CENTER>";
}
else {
unless ($addshow == 1) {
print "<CENTER><FONT $font SIZE=2><B>End of Results.</B><BR><HR WIDTH=400></FONT></CENTER>\n";
}
}
unless ($noshow == 1) {
unless ($addshow == 1) {
print "<CENTER><P><FORM METHOD=post ACTION=$scripturl><TABLE><TR><TD VALIGN=TOP><FONT $font SIZE=3><B>Search For :</B></FONT></TD><TD><INPUT TYPE=TEXT NAME=keywords SIZE=25 VALUE=\"$FORM{'keywords'}\"><BR><FONT $font SIZE=2><B>Mode :</B><INPUT TYPE=\"radio\" NAME=\"mode\" VALUE=\"words\" CHECKED>Words<INPUT TYPE=\"radio\" NAME=\"mode\" VALUE=\"phrases\">Phrase<BR><B>Safe : </B><INPUT TYPE=\"checkbox\" NAME=\"safe\" CHECKED>Omit Offensive Slang</FONT></TD><TD VALIGN=TOP ALIGN=CENTER WIDTH=60><INPUT TYPE=SUBMIT VALUE=\"Search!\"></TD></TR></TABLE></FORM><p></CENTER>\n";
}
if ($FORM{'keywords'} ne "") {
print "<CENTER><FONT $font size=2><B>Search for \"$FORM{'keywords'}\" in these search engines...<br><A HREF=\"http://www.altavista.com/cgi-bin/query?pg=q&what=web&q=$keyencode\">AltaVista</A> <A HREF=\"http://search.dejanews.com/dnquery.xp?query=$keyencode&defaultOp=AND&svcclass=dncurrent&maxhits=20\">DejaNews</A> <A HREF=\"http://search.excite.com/search.gw?search=$keyencode\">Excite</A> <A HREF=\"http://guide-p.infoseek.com/Titles/?qt=$keyencode\">GO Network</A> <A HREF=\"http://www.hotbot.com/?MT=$keyencode&SM=MC&DV=0&LG=any&DC=10&DE=2&_v=2&OPs=MDRTP&Search.x=38&Search.y=15\">HotBot</A> <A HREF=\"http://www.lycos.com/cgi-bin/pursuit?query=$keyencode&maxhits=20\">Lycos</A> <A HREF=\"http://www.webcrawler.com/cgi-bin/WebQuery?searchText=$keyencode&maxHits=20\">WebCrawler</A> <A HREF=\"http://search.yahoo.com/bin/search?p=$keyencode\">Yahoo!</A></B><P></CENTER>\n";
}
}
&generate;
print "$footer";
exit;
}

############################################

Thanks for your precious help

Sincerely

perl_diver




msg:432428
 4:14 am on Mar 23, 2006 (gmt 0)


If you need I also could post the whole script

go ahead and post the whole script.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved