Help with s&r regex

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Help with s&r regex

Joe Belmaati

4:31 pm on Dec 22, 2004 (gmt 0)

Hi,
I am trying to remove the following sequence of text from some documents, but I can't get my regex to work. Any help is MUCH appreciated:


A test
T 1999 Some text spaces and numbers
D 01012002
#

Here's what I have tried:


s/A\s+test\nT\s+(.*)\nD\s+01012002\n#\n//g;
s/A\s+test\nT\s+(.*)\nD\s+01012002\n#\n//g;
s/(A\s+test\nT\s+.*\nD\s+01012002\n#\n)//g;
s/A\A test\nT (.*)\nD 01012002\n#\n//g;

Thank you VERY much :D

Sincerely,
Joe Belmaati
Copenhagen Denmark

wruppert

6:40 pm on Dec 22, 2004 (gmt 0)

By default, Perl only matches on single lines. Your match is failing because it crosses several lines.

You can get it to do what you want by adding the "s" option along with your "g" option. The "s" option causes Perl to consider the text as one big string, regardless of newlines.

Look under "s/PATTERN/REPLACEMENT/egimosx" in perlop for details.

Your first attempt plus the "s" should work, although you might want to check for beginning of line, if appropriate:

s/^A\s+test\nT\s+(.*)\nD\s+01012002\n#\n//gs;

rocknbil

6:51 pm on Dec 22, 2004 (gmt 0)

Hoo boy I suck at regexps, usually have to do them several times to get it right. But I'm going to try this anyway in hopes that some of the perl masters further along the path can help correct my thinking. :-)

Begins with an A, multiple lines, ends with a #. It makes a difference whether you're going through a chunk in a scalar or line by line through a file. I picked line-by-line because sometimes I don't know the right thing to tell the m modifier. :-) First I came up with this quickie.

$content =~ s/[ATD#].*//g;

Works to strip those lines, but will also kill anything else that begins with an A, T, D, or #. If you have a specific pattern, you could substitute the .* with patterns. I got this to work doing a line by line read:

$content =~ s/^[ATD#](\s*\w{4}¦\s*\d+.*)*\s*$//g;

^ beginning of line
[ATD#] followed by any of these

\s*\w{4} maybe followed by a space, then 4 word chars
¦ OR
\s*\d+.* maybe a whitespace, followed by any number of digits, followed by anything

* either of these may or may not exist (this gets the # only)
*\s$ Followed by a whitespace (includes newlines) at the end of the line. $

Anxious also to see a better solution, because I know this one's probably all wrong!

Joe Belmaati

7:22 pm on Dec 22, 2004 (gmt 0)

wruppert, I couldn't get that to work.

rocknbil, your version worked, but it is too greedy in that I have some other sequences that should remain, and they contain similar patterns example


A test
T 1999 Text 5747 spaces numbers
M blah blah more text special character ��#"" and numbers
D 01012002
#

The above mentioned sequence should remain. That's why I can't really do it line by line...