Bag-O-Tricks for Perl

tag replace - replace HTML tags or attribute values

This script lets you replace HTML tags and change the values in attributes. Imagine you wanted to replace all div tags which have a class attribute of quote to p tags. Or you would like to change all links pointing to aaron.html to stevie.php. Using this little script makes such tasks very easy.

Some examples

Here´s how you would do these things:


./tagrep /path/to/html/files div p TAGNAME class=quote 
./tagrep /path/to/html/files aaron.html stevie.php href

Syntax

tagrep takes quite a few arguments on the command line.


Usage: ./tagrep <dir> <pattern> <replacement> <attribute¦TAGNAME> 
        [<(id¦class)=identifier>] [<empty_tags>]

Here is what they are:

dir: The path to a directory containing the HTML files that you want to be searched for <pattern> or the path to a single HTML file.
pattern: The pattern to search for. This pattern will be used in a Perl [perl.com] regular expression so it has to be a valid Perl [perl.com] re pattern. The object that the pattern will be applied to will depend on the setting of <attribute¦TAGNAME>.
replacement: This is a string that will be replaced for the pattern. You may use backreferences here if your <pattern> build them using parens.
attribute¦TAGNAME: If want to search in an attribute sepcify the name. If you want to change tagnames use the keyword TAGNAME. The <pattern> will be the tagname to be replaced with <replacement>
(id¦class)=identifier: This optional parameter lets put additional restraints on your replacements. Only elements that have the given attribute of either id or class and the specified identifier will be searched for <pattern>. id=aaron will search only in the start tag of elements that have an id of aaron.
empty_tags: This is a Perl [perl.com] regular expression pattern of empty elements that you use in the HTML documents. This parameter is required when you specify TAGNAME and any additional restraints with <(id¦class)=identifier>. To replace the tagname in end tags only when their start tag has a certain id or class we need to know which end tag belongs to which start tag. This is done by counting opening and closing tags. Since HTML allows for quite a few optional end tags and even for empty elements this approach only works when we know when not to expect an end tag. Figuring that out with an algorithm is quite hard and one reason why XML [w3.org] and XHTML abandoned this approach. To make it easier on me and the script you will have to tell the script which elements do not have an end tag. If you use li elements and td elements without an end tag you would specify li¦td.

Code

Here is the code of this script. Simply save it as tagrep and adjust the path to the Perl [perl.com] binary as needed.


#!/usr/bin/perl -w 
# $Id: tagrep,v 1.2 2003/02/28 22:42:45 af Exp $ 
# 
use [perldoc.com] strict [perldoc.com]; 
# 
use [perldoc.com] File::Find [perldoc.com]; 
use [perldoc.com] HTML::Parser [perldoc.com]; 
# 
my [perldoc.com] ($type, $id) = split [perldoc.com] /=/, $ARGV[4] if $ARGV[4]; 
my [perldoc.com] $empty = ''; 
$empty = join [perldoc.com] '', '¦', $ARGV[5] if $ARGV[5]; 
my [perldoc.com] $replace_end = 0; 
# 
my [perldoc.com] $p = HTML::Parser->new(api_version => 3); 
$p->handler( start => sub { 
    my [perldoc.com] $tag = shift [perldoc.com]; 
    my [perldoc.com] $attr = shift [perldoc.com]; 
    my [perldoc.com] $changed = 0; 
    $replace_end++ if $replace_end>=1000 and $tag!~ /br$empty/o; 
    print [perldoc.com](OUT shift), return [perldoc.com] if $type 
 and (!exists [perldoc.com] $attr->{$type} or $attr->{$type}!~ /$id/o); 
    if ($ARGV[3] eq 'TAGNAME') { 
 $tag = $ARGV[2], $changed++, $replace_end=1000 
  if $tag =~ /$ARGV[1]/o; 
    } else { 
 $attr->{$ARGV[3]} =~ s!$ARGV[1]!$ARGV[2]!go, 
  $changed++ if exists [perldoc.com] $attr->{$ARGV[3]}; 
    } 
    print [perldoc.com](OUT shift [perldoc.com]), return [perldoc.com] unless $changed; 
    print [perldoc.com] OUT '<', $tag, ' ', join [perldoc.com](' ', map [perldoc.com] { 
 join [perldoc.com] '', $_, '="', $attr->{$_}, '"' 
    } keys [perldoc.com] %$attr), '>'; }, "tagname,attr,text"); 
$p->handler( end => sub [perldoc.com] { 
    print [perldoc.com](OUT "</$ARGV[2]>"), return [perldoc.com] 
 if $ARGV[3] eq 'TAGNAME' and shift [perldoc.com] =~ /$ARGV[1]/o 
  and $replace_end==1000; 
    $replace_end--; 
    print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,text"); 
$p->handler( default => sub [perldoc.com] { print [perldoc.com] OUT shift [perldoc.com] }, "text"); 
# 
find(sub [perldoc.com] { 
    return [perldoc.com] if -d $File::Find::name; 
    return [perldoc.com] unless $File::Find::name =~ /\.html?$/; 
    open [perldoc.com] 'OUT', ">$File::Find::name.temp" 
 or die [perldoc.com] "Can't open $File::Find::name.temp: $!\n"; 
    print [perldoc.com] "Examining $File::Find::name...\n"; 
    $p->parse_file($File::Find::name) 
 or die [perldoc.com] "Can't parse file: $File::Find::name.temp: $!\n"; 
    close [perldoc.com] 'OUT'; 
    rename [perldoc.com]($File::Find::name, "$File::Find::name.orig"); 
    rename [perldoc.com]("$File::Find::name.temp", $File::Find::name); 
   }, $ARGV[0] ¦¦ die [perldoc.com] 
   "Usage: $0 <dir> <pattern> <replacement> <attribute¦TAGNAME> 
        [<(id¦class)=identifier>]\n");

Please use at your own risk. I tested this script with quite a few of my HTML files. But they were mine and others may have a different style. The script should work for those as well, but you never know.

However, your original files will be saved as filename.orig. So you can always go back to that version. Just don´t run the script twice in a row since then it will override your original files and you´ll never get those back unless you made some backup.

Please post when this script does not work for you.

Andreas

#!/usr/bin/perl -w
# $Id: htmlrep,v 1.4 2003/03/01 13:51:37 af Exp $
#
use [perldoc.com] strict;
#
use [perldoc.com] File::Find [perldoc.com];
use [perldoc.com] HTML::Parser [perldoc.com];
#
my [perldoc.com] $sub = eval [perldoc.com] "sub [perldoc.com] { s [perldoc.com]!$ARGV[1]!$ARGV[2]!g }";
die [perldoc.com] $@ if $@;
#
my [perldoc.com] ($type, $id) = split /=/, $ARGV[5] if $ARGV[5];
my [perldoc.com] $r = 0;
my [perldoc.com] $rr = 0;
my [perldoc.com] $p = HTML::Parser [perldoc.com]->new(api_version => 3);
$p->handler( start => sub [perldoc.com] {
my [perldoc.com] $attr = pop [perldoc.com];
print [perldoc.com](OUT pop [perldoc.com]), $r=0, return [perldoc.com] if shift [perldoc.com] =~ /$ARGV[3]/o and $type
and ((exists $attr->{$type} and $attr->{$type}!~ /$id/o)
or!exists $attr->{$type});
$r=1 if shift [perldoc.com] =~ /$ARGV[3]/o;
$rr=$r, $r=0 if $ARGV[4] and shift [perldoc.com] =~ /$ARGV[4]/o;
print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,tagname,text,attr");
$p->handler( end => sub [perldoc.com] {
$r=0 if shift [perldoc.com] =~ /$ARGV[3]/o;
$r=$rr if $ARGV[4] and shift [perldoc.com] =~ /$ARGV[4]/o;
print [perldoc.com] OUT pop [perldoc.com]; }, "tagname,tagname,text");
$p->handler( text => sub [perldoc.com] {
print [perldoc.com](OUT shift [perldoc.com]), return [perldoc.com] unless $r;
local $_ = shift [perldoc.com];
&$sub;
print [perldoc.com] OUT }, "text");
$p->handler( default => sub [perldoc.com] { print [perldoc.com] OUT shift [perldoc.com] }, "text");
#
find(sub [perldoc.com] {
return [perldoc.com] if -d $File::Find::name;
return [perldoc.com] unless $File::Find::name =~ /\.html?$/;
open [perldoc.com] 'OUT', ">$File::Find::name.temp"
or die [perldoc.com] "Can't open $File::Find::name.temp: $!\n";
$r = 0;
print [perldoc.com] "Examining $File::Find::name...\n";
$p->parse_file($File::Find::name)
or die [perldoc.com] "Can't parse file: $File::Find::name.temp: $!\n";
close [perldoc.com] 'OUT';
rename [perldoc.com]($File::Find::name, "$File::Find::name.orig");
rename [perldoc.com]("$File::Find::name.temp", $File::Find::name);
}, $ARGV[0] ¦¦ die [perldoc.com]
"Usage: $0 <dir> <pattern> <replacement> <elements+> [<elements->]
[<attribute=value>]\n");

#!/usr/bin/Perl [perl.com]
#
use [perldoc.com] strict [perldoc.com];
use [perldoc.com] HTML::Parser [perldoc.com];
#
print [perldoc.com] <<END;
Content-Type: text/html
__EMPTY__LINE__
END
#
my %para = split [perldoc.com] /;¦=¦&/, $ENV{QUERY_STRING};
#
my $i = 'a';
my @rules = ();
foreach (split [perldoc.com](/[\s+]+/, $para{'q'})) {
push [perldoc.com] @rules, "\$x =~ s¦\\b$_\\b¦<em class='$i'>$_</em>¦i;";
$i++;
}
my $rules = join [perldoc.com] '', @rules;
my $m = eval [perldoc.com] "sub {my \$x = shift; $rules; return \$x;}";
#
my $p = HTML::Parser->new(api_version => 3);
$p->handler(default => sub { print shift }, "text");
$p->handler(start => sub
{ print [perldoc.com]('<style type="text/CSS [w3.org]">em.a{}em.b{}em.c{}</style>',
pop);
return [perldoc.com] unless shift eq 'body';
shift->handler(text => sub { print &$m(shift) }, "dtext");
}, "tagname,self,text");
#
$p->parse_file($para{file});

s [perldoc.com]/((?:mailto:)?[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+\.[a-zA-Z]{2,3})/encode_email($1)/ge;
#
sub [perldoc.com] encode_email {
my @email = split [perldoc.com] //, shift [perldoc.com];
for (my $i=0;$i<$#email;$i++) {
$email[$i] = sprintf [perldoc.com] "&#%d;", ord [perldoc.com]($email[$i]);
}
return [perldoc.com] join [perldoc.com] '', @email;
}

if (exists [perldoc.com] $attr->{$ARGV[3]}) {
$changed++;
if ((my $rep) = $ARGV[2] =~ m [perldoc.com]!^sub [perldoc.com]\s*{([^}]+)!) {
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$rep!gee;
} else {
$attr->{$ARGV[3]} =~ s [perldoc.com]!$ARGV[1]!$ARGV[2]!g;
$attr->{$ARGV[3]} =~ s [perldoc.com]!(\$[a-zA-Z0-9_:]+)!$1!gee;
}
}

foreach $line(sort{lc((split(/\¦\¦/, $a))[2]) cmp lc((split(/\¦\¦/, $b))[2])} @data){ chomp $line if $line =~ /\n$/i; ($num,$ID,$name,$count) = split(/\¦\¦/, $line); print "$num $ID $name $count<BR>"; }

sub comma { $count = @_[0]; $commer = @_[1]; if (length($count) =~ /[4-6]/){$count =~ s/(\d{1,3})(\d{3})/$1$commer$2/i;} elsif (length($count) =~ /[7-9]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})/$1$commer$2$commer$3/i;} elsif (length($count) =~ /[10-12]/){$count =~ s/(\d{1,3})(\d{3})(\d{3})(\d{3})/$1$commer$2$commer$3$commer$4/i;} return $count; }#end sub

Bag-O-Tricks for Perl

Let´s collect some nice and useful perl scripts

andreasfriedrich

Birdman

andreasfriedrich

gperrones

andreasfriedrich

Allen

andreasfriedrich

andreasfriedrich

Birdman

andreasfriedrich

andreasfriedrich

Damian

andreasfriedrich

davez1000

ShawnR

davez1000

davez1000

davez1000

ShawnR

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week