Perl Code: I want to only spider html pages with my site search

I am using a site search that spiders my site. However, the spider indexes ALL webpages. The spider has a perl configuration that it will follow. I got the perl coding to only index html and no named extensions. Here is an example of a working perl script:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

Great, it works! Also, I can index "non extension" files by using this code:


test_url => sub {
my $url = shift;
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

However, for some reason, I can't get the script to only spider "html" files only. I tried this:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
return 0;
},

But the spider indexes ALL garbage pages when I use this script. Can anyone help me figure out what is wrong with the perl code above? I would like to use a perl script to spider ONLY html files? The logical methods aren't working for some reason.

Perl Code: I want to only spider html pages with my site search

but how?

chopin2256

wruppert

chopin2256

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week