Forum Moderators: coopster & phranque

Message Too Old, No Replies

Regular expression issues

         

martin

5:13 pm on Aug 29, 2005 (gmt 0)

10+ Year Member



I have a custom template engine (too late for using a generic one from CPAN or elsewhere), it has to be able to process marked up stuff through a function that processes the contents of a tag and/or the params of that tag. The tags are html like.

There are some issues with this code though, the part which selects the html like attributes doesn't work on all versions of perl, i.e. the first (.*?) block. If I split that into two matches first w/o that block, and then one more with it works, but it's about twice as slow. Anyway even if it's not splitted it performs miserably on some boxes. Any ideas about optimizing this regex:

my $re = qr/$this->{mask_start}\Q$key\E(.*?)$this->{mask_end}(.*?)$this->{mask_start_close}\Q$key\E$this->{mask_end_close}/is;

while ($this->{template} =~ /$re/) {
my $params = $this->mask_block_params($1, $2);
my $html = &$callback($key, $params);
$this->{template} =~ s/$re/$html/is;
}

Changing the s/$re/$html/; to the same regex w/o grouping didn't make much of a difference.

wruppert

7:59 pm on Aug 29, 2005 (gmt 0)

10+ Year Member



I'm not sure I follow exactly what you are doing, especially without a concrete example.

However, my eye was caught by the substitution in the last line. My impression is that you find what you want to change, figure out what the new stuff should be, and then replace what was just found with the new stuff. If that is true, then instead of:

$this->{template} =~ s/$re/$html/is;

you could use:

$this->{template} = $` . $html . $';

$` is the part of the string before the match and $' is the part after. $& is the matched value. Using these is expensive, but you have already paid for it by using $1 and $2.

This might help, easy enough to try.

SeanW

5:23 am on Aug 30, 2005 (gmt 0)

10+ Year Member



What about writing a tokenizer like HTML::TokeParser (you might even be able to use that)? Since it's tuned to HTML you can use fewer regexps, especially with .*/.*?

Also, have a look in the perlre manpage in the "Backtracking" section, it explains why the use of the .*?s may be slow.

Sean

martin

11:05 am on Aug 30, 2005 (gmt 0)

10+ Year Member



Thanks for the suggestion $this->{template} = $` . $html . $'; took the same time to execute though.

I know that .*? is slow but it's already coded like that and not an easy thing to change. I'll see if I can use the HTML parser, that sounds like a good idea.

SeanW

12:55 pm on Aug 30, 2005 (gmt 0)

10+ Year Member




but it's already coded like that and not an easy thing to change.

Not knowing your application, it looks like you can swap out your code:

[perl]
while ($this->{template} =~ /$re/) {
my $params = $this->mask_block_params($1, $2);
my $html = &$callback($key, $params);
$this->{template} =~ s/$re/$html/is;
}
[/perl]

With something like
[perl]
my $p = new MyTokenizer($this->{template});
while (my $t = $p->get_token) {
if ($t->[0] eq 'S' and $t->[1] eq $key) {
# parse the special token
$this->{html} .= # parsed version
} else {
$this->{html} .= $t->as_html;
}
}
[/perl]

Come to think of it, you'd want to write your own tokenizer since HTML::TokeParser breaks up each token into components, and you only need to do it for special tokens.

I'm thinking of a simple scanner, you read char by char until you get to the next special character (< >) depending on if you're currently in a tag or not.

Sean

wruppert

11:54 pm on Aug 30, 2005 (gmt 0)

10+ Year Member



Have you profiled the functions used in the loop? I just made my own skeleton version of this with a 500 line template and ran it 1000 times. It takes less than a second to run. Perhaps your problem is not in the pattern matching.

martin

7:41 am on Aug 31, 2005 (gmt 0)

10+ Year Member



I'm trying the parser approach and it looks very promising.

martin

12:44 pm on Aug 31, 2005 (gmt 0)

10+ Year Member



HTML::PullParser - 0.3s
regex - 15s

amazing...

SeanW

12:52 pm on Aug 31, 2005 (gmt 0)

10+ Year Member



Sweeeeeeet :)

Sean