test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},
Great, it works! Also, I can index "non extension" files by using this code:
test_url => sub {
my $url = shift;
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},
However, for some reason, I can't get the script to only spider "html" files only. I tried this:
test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
return 0;
},
But the spider indexes ALL garbage pages when I use this script. Can anyone help me figure out what is wrong with the perl code above? I would like to use a perl script to spider ONLY html files? The logical methods aren't working for some reason.