Welcome to WebmasterWorld Guest from 54.226.133.245

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Perl Code: I want to only spider html pages with my site search

but how?

     
10:01 pm on Nov 23, 2005 (gmt 0)

Full Member

10+ Year Member

joined:May 25, 2005
posts:220
votes: 0


I am using a site search that spiders my site. However, the spider indexes ALL webpages. The spider has a perl configuration that it will follow. I got the perl coding to only index html and no named extensions. Here is an example of a working perl script:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

Great, it works! Also, I can index "non extension" files by using this code:


test_url => sub {
my $url = shift;
# any files that have a dot are not html:
return 1 unless $url->path =~ /\./;
return 0;
},

However, for some reason, I can't get the script to only spider "html" files only. I tried this:


test_url => sub {
my $url = shift;
return 1 if $url->path =~ /\.html?$/; # .html or .htm
return 0;
},

But the spider indexes ALL garbage pages when I use this script. Can anyone help me figure out what is wrong with the perl code above? I would like to use a perl script to spider ONLY html files? The logical methods aren't working for some reason.

2:34 am on Nov 24, 2005 (gmt 0)

Full Member

10+ Year Member

joined:July 23, 2003
posts:227
votes: 0


That is not a working script, it is a snippet from one. The pattern match looks fine, you probably need to look at the rest of the program.
2:43 am on Nov 24, 2005 (gmt 0)

Full Member

10+ Year Member

joined:May 25, 2005
posts:220
votes: 0


I swear to you, it doesn't work.

Something is not right with the snippet of code.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members