Welcome to WebmasterWorld Guest from 54.166.87.123

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl Code: I want to only spider html pages with my site search

but how?

   
10:01 pm on Nov 23, 2005 (gmt 0)

10+ Year Member



I am using a site search that spiders my site. However, the spider indexes ALL webpages. The spider has a perl configuration that it will follow. I got the perl coding to only index html and no named extensions. Here is an example of a working perl script:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

Great, it works! Also, I can index "non extension" files by using this code:


test_url => sub {
my $url = shift;
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

However, for some reason, I can't get the script to only spider "html" files only. I tried this:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
return 0;
},

But the spider indexes ALL garbage pages when I use this script. Can anyone help me figure out what is wrong with the perl code above? I would like to use a perl script to spider ONLY html files? The logical methods aren't working for some reason.

2:34 am on Nov 24, 2005 (gmt 0)

10+ Year Member



That is not a working script, it is a snippet from one. The pattern match looks fine, you probably need to look at the rest of the program.
2:43 am on Nov 24, 2005 (gmt 0)

10+ Year Member



I swear to you, it doesn't work.

Something is not right with the snippet of code.