Forum Moderators: coopster & phranque

Message Too Old, No Replies

Why does this regexp...

work so mysteriously

         

rainborick

3:42 am on May 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm a casual Perl programmer and I'm working on a script to massage HTML pages, and among the tasks I've needed to impliment is to strip out portions of the HTML code before I execute the main task. In particular, I need to strip out embedded JavaScript. So I wrote the following regexp:

$content =~ s/(<script)(.*)(\/script>)/$sDelimStart$scriptCount$sDelimEnd/si;

which detects the script and replaces it with my own placeholder/marker so that I can restore it post-processing. Unfortunately, in my initial test, the script isn't coming through completely. The script gets gobbled up completely and replaced with my placeholder, but parenthesis are missing or perhaps being replaced with \n's. I suspect that if I tried more complex JavaScript that curly braces and other characters might go too. So, I'd appreciate any hints, fixes, or corrections. Thanks!

rainborick

6:29 pm on May 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Looks like only parenthesis are getting stripped out or altered. Other non-alphanumerics like {}[]\¦, etc. are getting processed properly. Am I wrong that (.*?) should match everything except a "\n"? I'm a regex rookie and would appreciate any help. Thanks.

wruppert

12:13 am on Jun 1, 2005 (gmt 0)

10+ Year Member



I would like to help but I'm not clear on what you are trying to do, or what the exact problem is. Can you post a short example of the input and bad output? Something like:

---------------
my $data = "<script(small script)/script>;
$data =~ (your substitution here);
print "result: " $data;

result: (the bad output)
---------------

Also, the s modifier causes the input to be treated as a single string - . matches \n.

The .* is greedy and will eat as much as it can - if you have more than 1 script block, it will eat from the start of the first to the end of the last. Use .*? for the non-greedy version. But it seems you only have one script block.

rainborick

3:19 am on Jun 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the help, but I'm embarassed to say the bug was elsewhere in my script, and I should have known it.

Another regexp higher in the script was munging the HTML data where the parenthesis resided. I was trying to do a s/xx/yy/ replace on various combinations of \r\n to account for different formats of HTML files and apparently escaped the parens on a couple of atoms by mistake. So, thanks again.

moltar

4:18 am on Jun 1, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Test with two JS in one string separated by some other text. It will eat both JS and the text. I am not sure if that is what you want. If not, then to fix it, you need to make your regex non-greedy.