This script lets you replace HTML tags and change the values in attributes. Imagine you wanted to replace all div tags which have a class attribute of quote to p tags. Or you would like to change all links pointing to aaron.html to stevie.php. Using this little script makes such tasks very easy.
Some examples
Here´s how you would do these things:
./tagrep /path/to/html/files div p TAGNAME class=quote
./tagrep /path/to/html/files aaron.html stevie.php href
Syntax
tagrep takes quite a few arguments on the command line.
Usage: ./tagrep <dir> <pattern> <replacement> <attribute¦TAGNAME>
[<(id¦class)=identifier>] [<empty_tags>]
Here is what they are:
Code
Here is the code of this script. Simply save it as tagrep and adjust the path to the Perl [perl.com] binary as needed.
#!/usr/bin/perl -w
# $Id: tagrep,v 1.2 2003/02/28 22:42:45 af Exp $
#
use [perldoc.com] strict [perldoc.com];
#
use [perldoc.com] File::Find [perldoc.com];
use [perldoc.com] HTML::Parser [perldoc.com];
#
my [perldoc.com] ($type, $id) = split [perldoc.com] /=/, $ARGV[4] if $ARGV[4];
my [perldoc.com] $empty = '';
$empty = join [perldoc.com] '', '¦', $ARGV[5] if $ARGV[5];
my [perldoc.com] $replace_end = 0;
#
my [perldoc.com] $p = HTML::Parser->new(api_version => 3);
$p->handler( start => sub {
my [perldoc.com] $tag = shift [perldoc.com];
my [perldoc.com] $attr = shift [perldoc.com];
my [perldoc.com] $changed = 0;
$replace_end++ if $replace_end>=1000 and $tag!~ /br$empty/o;
print [perldoc.com](OUT shift), return [perldoc.com] if $type
and (!exists [perldoc.com] $attr->{$type} or $attr->{$type}!~ /$id/o);
if ($ARGV[3] eq 'TAGNAME') {
$tag = $ARGV[2], $changed++, $replace_end=1000
if $tag =~ /$ARGV[1]/o;
} else {
$attr->{$ARGV[3]} =~ s!$ARGV[1]!$ARGV[2]!go,
$changed++ if exists [perldoc.com] $attr->{$ARGV[3]};
}
print [perldoc.com](OUT shift [perldoc.com]), return [perldoc.com] unless $changed;
print [perldoc.com] OUT '<', $tag, ' ', join [perldoc.com](' ', map [perldoc.com] {
join [perldoc.com] '', $_, '="', $attr->{$_}, '"'
} keys [perldoc.com] %$attr), '>'; }, "tagname,attr,text");
$p->handler( end => sub [perldoc.com] {
print [perldoc.com](OUT "</$ARGV[2]>"), return [perldoc.com]
if $ARGV[3] eq 'TAGNAME' and shift [perldoc.com] =~ /$ARGV[1]/o
and $replace_end==1000;
$replace_end--;
print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,text");
$p->handler( default => sub [perldoc.com] { print [perldoc.com] OUT shift [perldoc.com] }, "text");
#
find(sub [perldoc.com] {
return [perldoc.com] if -d $File::Find::name;
return [perldoc.com] unless $File::Find::name =~ /\.html?$/;
open [perldoc.com] 'OUT', ">$File::Find::name.temp"
or die [perldoc.com] "Can't open $File::Find::name.temp: $!\n";
print [perldoc.com] "Examining $File::Find::name...\n";
$p->parse_file($File::Find::name)
or die [perldoc.com] "Can't parse file: $File::Find::name.temp: $!\n";
close [perldoc.com] 'OUT';
rename [perldoc.com]($File::Find::name, "$File::Find::name.orig");
rename [perldoc.com]("$File::Find::name.temp", $File::Find::name);
}, $ARGV[0] ¦¦ die [perldoc.com]
"Usage: $0 <dir> <pattern> <replacement> <attribute¦TAGNAME>
[<(id¦class)=identifier>]\n");
Please use at your own risk. I tested this script with quite a few of my HTML files. But they were mine and others may have a different style. The script should work for those as well, but you never know.
However, your original files will be saved as filename.orig. So you can always go back to that version. Just don´t run the script twice in a row since then it will override your original files and you´ll never get those back unless you made some backup.
Please post when this script does not work for you.
Andreas
This script lets you do search and replace operations on the content of HTML elements only. Imagine you wanted to replace a person´s name with a link to their website. Of course this should be done only if that name is not already within the content of an a element. Or suppose you wanted to do this but sometimes you used the person´s full name but sometimes only their first name.
Using this little script makes such tasks very easy.
Some examples
Here´s how you would do these things:
./htmlrep /path/to/html/files "Aaron C\." "<a href=''>Aaron C.</a>" body a
./htmlrep /path/to/html/files "(Aaron(\s+C\.)?)" "<a href=''>\$1</a>" body a
You can make your search case-insensitive by prepending your pattern with (?i). If you wanted to replace every title that contains aaron with Stevie B. you would call htmlrep like so:
./htmlrep /path/to/html/files "^(?i).*aaron.*$" "Stevie B." title
Syntax
htmlrep takes quite a few arguments on the command line.
Usage: ./htmlrep <dir> <pattern> <replacement> <elements+> <elements->
[<attribute=value>]
Here is what they are:
Code
Here is the code of this script. Simply save it as htmlrep and adjust the path to the Perl [perl.com] binary as needed.
#!/usr/bin/perl -w
# $Id: htmlrep,v 1.4 2003/03/01 13:51:37 af Exp $
#
use [perldoc.com] strict;
#
use [perldoc.com] File::Find [perldoc.com];
use [perldoc.com] HTML::Parser [perldoc.com];
#
my [perldoc.com] $sub = eval [perldoc.com] "sub [perldoc.com] { s [perldoc.com]!$ARGV[1]!$ARGV[2]!g }";
die [perldoc.com] $@ if $@;
#
my [perldoc.com] ($type, $id) = split /=/, $ARGV[5] if $ARGV[5];
my [perldoc.com] $r = 0;
my [perldoc.com] $rr = 0;
my [perldoc.com] $p = HTML::Parser [perldoc.com]->new(api_version => 3);
$p->handler( start => sub [perldoc.com] {
my [perldoc.com] $attr = pop [perldoc.com];
print [perldoc.com](OUT pop [perldoc.com]), $r=0, return [perldoc.com] if shift [perldoc.com] =~ /$ARGV[3]/o and $type
and ((exists $attr->{$type} and $attr->{$type}!~ /$id/o)
or!exists $attr->{$type});
$r=1 if shift [perldoc.com] =~ /$ARGV[3]/o;
$rr=$r, $r=0 if $ARGV[4] and shift [perldoc.com] =~ /$ARGV[4]/o;
print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,tagname,text,attr");
$p->handler( end => sub [perldoc.com] {
$r=0 if shift [perldoc.com] =~ /$ARGV[3]/o;
$r=$rr if $ARGV[4] and shift [perldoc.com] =~ /$ARGV[4]/o;
print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,text");
$p->handler( text => sub [perldoc.com] {
print [perldoc.com](OUT shift [perldoc.com]), return [perldoc.com] unless $r;
local $_ = shift [perldoc.com];
&$sub;
print [perldoc.com] OUT }, "text");
$p->handler( default => sub [perldoc.com] { print [perldoc.com] OUT shift [perldoc.com] }, "text");
#
find(sub [perldoc.com] {
return [perldoc.com] if -d $File::Find::name;
return [perldoc.com] unless $File::Find::name =~ /\.html?$/;
open [perldoc.com] 'OUT', ">$File::Find::name.temp"
or die [perldoc.com] "Can't open $File::Find::name.temp: $!\n";
$r = 0;
print [perldoc.com] "Examining $File::Find::name...\n";
$p->parse_file($File::Find::name)
or die [perldoc.com] "Can't parse file: $File::Find::name.temp: $!\n";
close [perldoc.com] 'OUT';
rename [perldoc.com]($File::Find::name, "$File::Find::name.orig");
rename [perldoc.com]("$File::Find::name.temp", $File::Find::name);
}, $ARGV[0] ¦¦ die [perldoc.com]
"Usage: $0 <dir> <pattern> <replacement> <elements+> [<elements->]
[<attribute=value>]\n");
Andreas
In Bag-O-Tricks for PHP II - some code snippets that should be helpful for all in creating dynamic sites - Highlighting Search Terms [webmasterworld.com] I explained how to highlight search terms on dynamically generated pages. Now how about highlighting search terms on static pages.
Obviously this will need some kind of script to alter the pages according to the search terms used. Now you could simply change all your static pages to PHP [php.net] and use the above mentioned method. There is a better way that will not require you to change your existing static pages. It uses mod_rewrite [httpd.apache.org] to transparently call a highlighting script when the referrer is a search engine.
.htaccess
The following two rules need to go into your .htaccess file.
RewriteCond [httpd.apache.org] %{HTTP_REFERER} google¦alltheweb¦brisbane [NC]
RewriteRule [httpd.apache.org] \.html$ hst.pl?file=%{REQUEST_FILENAME};%{QUERY_STRING}
script
This is the script that performs the highligting:
#!/usr/bin/Perl [perl.com]
#
use [perldoc.com] strict [perldoc.com];
use [perldoc.com] HTML::Parser [perldoc.com];
#
print [perldoc.com] <<END;
Content-Type: text/html
__EMPTY__LINE__
END
#
my %para = split [perldoc.com] /;¦=¦&/, $ENV{QUERY_STRING};
#
my $i = 'a';
my @rules = ();
foreach (split [perldoc.com](/[\s+]+/, $para{'q'})) {
push [perldoc.com] @rules, "\$x =~ s¦\\b$_\\b¦<em class='$i'>$_</em>¦i;";
$i++;
}
my $rules = join [perldoc.com] '', @rules;
my $m = eval [perldoc.com] "sub {my \$x = shift; $rules; return \$x;}";
#
my $p = HTML::Parser->new(api_version => 3);
$p->handler(default => sub { print shift }, "text");
$p->handler(start => sub
{ print [perldoc.com]('<style type="text/CSS [w3.org]">em.a{}em.b{}em.c{}</style>',
pop);
return [perldoc.com] unless shift eq 'body';
shift->handler(text => sub { print &$m(shift) }, "dtext");
}, "tagname,self,text");
#
$p->parse_file($para{file});
Styling highlighting
By putting some CSS [w3.org] rules into the empty em.a ... em.n container you can style each highlighted search term individually.
Andreas
Care to elaborate on that?
If you can answer these questions and show me that coding in PHP [php.net] is indeed 10 times faster then I will happily abandon Perl [perl.com] forever and switch to PHP [php.net].
I believe that both bags have their pros and cons and that especially the first two scripts in this bag are coded and run way faster in Perl [perl.com] than in PHP [php.net].
I´m looking forward to your answers, Allen.
Andreas
While this is certainly not a very safe way to prevent spam bots from picking up your email addresses it works for the time being since those bots are written with speed in mind so they generally do not decode entities before parsing the html code.
s [perldoc.com]/((?:mailto:)?[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+\.[a-zA-Z]{2,3})/encode_email($1)/ge;
#
sub [perldoc.com] encode_email {
my @email = split [perldoc.com] //, shift [perldoc.com];
for (my $i=0;$i<$#email;$i++) {
$email[$i] = sprintf [perldoc.com] "&#%d;", ord [perldoc.com]($email[$i]);
}
return [perldoc.com] join [perldoc.com] '', @email;
}
The script will look for anything that looks like a mailto: link and convert it into numeric entities. Browsers will decode the link while spam bots will not.
This method should be ok in the EU as well where you are required to put a working and easy to find and use email address on your site. I believe that JavaScript solutions will not suffice since you need JavaScript enabled to view the email address.
Andreas
I'd like to implement that script but I haven't done my Perl [perl.com] homework. Where do I put the script?
encode_email directly as well. Andreas
Here are two ways to improve on the initial version.
If you want to be able to expand variables in your replacement string you can change the following line
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$ARGV[2]!go,
to read like so:
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$ARGV[2]!g,
$attr->{$ARGV[3]} =~ s [perldoc.com]!(\$[a-zA-Z0-9_:]+)!$1!gee,
You could then do something like this:
./tagrep /var/www/html/aaron-carter/ '^friends.cgi$' \
'friends.cgi?file=$File::Find::name' href
This will add the file name as the file parameter to the URIs in your href [w3.org] attributes.
The most flexibility is gained when you can run arbitrary Perl [perl.com] code to build your replacement string. To allow for that replace these lines in the original script
$attr->{$ARGV[3]} =~ s!$ARGV[1]!$ARGV[2]!go,
$changed++ if exists $attr->{$ARGV[3]};
with these:
if (exists [perldoc.com] $attr->{$ARGV[3]}) {
$changed++;
if ((my $rep) = $ARGV[2] =~ m [perldoc.com]!^sub [perldoc.com]\s*{([^}]+)!) {
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$rep!gee;
} else {
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$ARGV[2]!g;
$attr->{$ARGV[3]} =~ s [perldoc.com]!(\$[a-zA-Z0-9_:]+)!$1!gee;
}
}
You could then do something like this:
./tagrep /var/www/html/aaron-carter/ '^friends.cgi$' \
'sub{(my $n = $File::Find::name) =~ \
s!^/var/www/html!http://www.aaroncarter.com!; \
return "friends.cgi?file=$n";}' href
This will add the URI of your HTML page as the file parameter to the URIs in your href [w3.org] attributes.
<a href="friends.cgi">email a friend</a>
will be changed to
<a href="friends.cgi?file=http://www.aaroncarter.com/aaron-carter/file.html
">email a friend</a>
Andreas
Have your script send you email
A simple subroutine that you can use in your scripts to have it send you email. This can be a nice feature on some events such as security alerts, people filling in forms, known ip's visiting your site etc..
$mailprogram should be your mailprogram, ie
$mailprogram = "/usr/sbin/sendmail";
and the syntax to use the sub is:
&Email($title,$message);
or
&Email("Title of email","Hi! This is an email from my script to myself");
sub Email {
my ($title, $message) = @_;
open (MAIL,"¦$mailprogram -t");
print MAIL "To: name\@example.com\n";
print MAIL "From: script\@example.com\n";
print MAIL "Subject: $title\n\n";
print MAIL "$message \n";
close(MAIL);
}
change the pipe character ¦ to a 'regular' pipe character, this has to do with the way regular pipes are shown on WebmasterWorld
Andreas
1¦¦774079¦¦Zack¦¦3
2¦¦902345¦¦Alan¦¦8
3¦¦148271¦¦Peter¦¦106
4¦¦008593¦¦Nancy¦¦9
5¦¦954311¦¦Victor¦¦0
open the file and read into @data:
open(FILE, "<thefile.txt");
@data = <FILE>;
close (FILE);
now the loop:
foreach $line(sort{lc((split(/\¦\¦/, $a))[2]) cmp lc((split(/\¦\¦/, $b))[2])} @data){
chomp $line if $line =~ /\n$/i;
($num,$ID,$name,$count) = split(/\¦\¦/, $line);
print "$num $ID $name $count<BR>";
}
lc using cmp sorts alpha regardless of case. To sort in reverse, just swap the $a and $b variables around.
to sort by a numeric column, lets say the second column, remove the lowercase(lc) command, and use the spaceship operator, so the first line of the loop would be this instead:
foreach $line(sort{(split(/\¦\¦/, $a))[1] <=> (split(/\¦\¦/, $b))[1]} @data){
Hope this quick elaboration helps...
Dave
[edited by: jatar_k at 12:01 am (utc) on April 19, 2003]
[edit reason] no sigs thanks [/edit]
Yeah you do, the post has been edited cause I put a sig up. I tested the script before I posted it and it works, that bit is missing now. Should be:
foreach $line(sort{(split(/\¦\¦/, $a))[1] <=> (split(/\¦\¦/, $b))[1]} @data){
I always chomp for safety, so I can use the last column item without wondering if there is a newline or not.
I have quite a few groovy perl bits I've written, I'll post them up one by one.
2 vars to the sub,
&comma($count, ",") first is the number, second is the symbol, depending on what country you are in. Use a "," for US, but a "." for some euro countries.
sub comma {
$count = @_[0];
$commer = @_[1];
if (length($count) =~ /[4-6]/){$count =~ s/(\d{1,3})(\d{3})/$1$commer$2/i;}
elsif (length($count) =~ /[7-9]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})/$1$commer$2$commer$3/i;}
elsif (length($count) =~ /[10-12]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})(\d{3})/$1$commer$2$commer$3$commer$4/i;}
return $count;
}#end sub
$count = "123456789012";
print &comma($count, ",");
prints this:
123,456,789,012
If you need the whole package with images, send me a sticky and I'll send you the zip.
Please replace the broken pipes in this script with the proper ones on your keyboard.
#!/usr/bin/perl -w
$countfile = "counter.txt";
sub comma {
$count = @_[0];
$commer = @_[1];
if (length($count) =~ /[4-6]/){$count =~ s/(\d{1,3})(\d{3})/$1$commer$2/i;}
elsif (length($count) =~ /[7-9]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})/$1$commer$2$commer$3/i;}
elsif (length($count) =~ /[10-12]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})(\d{3})/$1$commer$2$commer$3$commer$4/i;}
return $count;
}#end sub
open (DATA, "<$countfile") ¦¦ &fejl("Cannot open counter");
flock (DATA, 2);
@count = <DATA>;
flock (DATA, 8);
close (DATA);
foreach $counta(@count){
$countaup = $counta + 1;
}
open (DATAB, ">$countfile") ¦¦ &fejl("Cannot open counter");
flock (DATAB, 2);
print DATAB $countaup;
flock (DATAB, 8);
close (DATAB);
@thecount = split(//, &comma($countaup, ","));
print "Content-type: text/html\n\n";
foreach $number (@thecount) {
if ($number eq ","){$number = "comma";}
push (@thenumber, "<IMG SRC=\"$number.gif\" BORDER=\"0\">");
}
print "<nobr>" , @thenumber , "</nobr>";
sub fejl {
print "Content-type: text/html\n\n";
print @_[0];
die;
}
die;
Say $url='www.domain.com/image%20dir/pic%201.jpg?hello=%22blah%20blah%22';
then
my $decoded_url = $url;
$decoded_url =~ s/%([a-f\d]{2})/pack('C', hex $1)/egi;
my $encoded_url = $decoded_url;
$encoded_url =~ s/([^\w\d;\/?:@&=+\$,-.!~*()'])/'%' . unpack('H2', $1)/eg;
Shawn