Forum Moderators: coopster & phranque

Message Too Old, No Replies

perl regex question

parsing log files, and I want to reformat them

         

jeremy goodrich

3:20 pm on Jun 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



if($line =~ /^\d.+\s/) {print "$`\n";}

I'm trying to match arbitrary numbers at the beginning of files...eg...

400 216.35.116.91 Mozilla/3.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
3 216.239.46.99 Googlebot/2.1(+http://www.googlebot.com/bot.html)
10 199.172.149.204 ArchitextSpider

I'm trying to parse through a whole bunch of these, matching the arbitrary number, white space character, replace that with nothing, and then print the rest of the line.

if($line =~ /^\d.+\s/) {print "$`\n";}

as a check, I already tried changing the $` to $1 to see if that would print anything, but it didn't, so I know the regex is wrong...

Brett_Tabke

3:39 pm on Jun 23, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



The easiest way to do it, is start in multiple stages.

($host,$roast,$ghost,$most,$toast) = split(/ /,$line);

Now do what you want to each field. Especially logs where they are quite tricky to come up with good one line regexs. What you'll find, is that it is easier and far more maintainable if you break it down a step at a time.

After that, start combining the regex's until you get back to that killer one liner. Not surprisningly, 3 and 4 steps to a regex can be faster than a one line regex. Often one liners have to back track and reset when they are parsing. Whereas a 3-4 different regexs can nail the parse and the first pass across the line.

Brett_Tabke

1:48 pm on Jun 25, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Related:

Reading Log Files in Perl:
[cs.cf.ac.uk...]

sugarkane

2:06 pm on Jun 25, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



/^\d+\s/ should work.

The problem was with the dot you had before the + sign - you were effectively saying 'match a digit at the start of the line ,followed by anything 1 or more times, followed by a space'.

As perl is 'greedy' when pattern matching (that is, it'll always match as far to the end of the string as possible), your regex matched everything up until the final space.

If that makes any sense at all ;)

littleman

4:22 pm on Jun 25, 2001 (gmt 0)



If I am understanding you right, you want to capture the log entries with a specific server code, and you want to just capture the whole string from number+space till it's end. This will work:
if($line =~ /^\d+\s/) {print "$&$'\n";}
$& -> is the match
$' -> everything after

littleman

4:33 am on Jun 26, 2001 (gmt 0)



Did you get it working?

jeremy goodrich

2:18 pm on Jun 26, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



logs have been parsed and formatted, I used BT's idea.

A newbie to parsing anything with perl, it was the first idea i had on how to do the sorting of the files.

Happy to say the output is nice :)

Used this, modified:
($host,$roast,$ghost,$most,$toast) = split(/ /,$line);

into
($notneeded,$ip,$engine,$user,$agent,$info) = split(/ /,$line);

And then I printed the every variable but the first. Some user agents like fast have more spaces in them, or the slurp with mozilla at the start of the string, so I needed to print more of the variables. I notice too many variables is okay, too few doesn't work.

Thanks for all the help.

Brett_Tabke

5:05 pm on Jun 26, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Here is a nice log parser for Agents (couple spifie regexs in there):
[nihongo.org...]

Bolotomus

11:23 pm on Jul 17, 2001 (gmt 0)

10+ Year Member



$& -> is the match
$' -> everything after

eeeeek! Don't use $&, it an abomination! Using it one time in your entire program will make Perl use a different method on ALL of your regexes and incur enormous overheads, even if the code containing the $& isn't actually executed. At least, so says Friedl, the king of regexes.

When I parse logfiles I like to use \S* to grab the components.

But anyhow, there are too many interesting unsolved problems in the world for you to sweat over this one. Here's an answer from the Perl Cookbook:

while (<LOGFILE>) {
my ($client, $identuser, $authuser, $date, $time, $tz, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+) "(\S+) (.*?) (\S+)" (\S+) (\S+)$/; # keep the regex on 1 line
# do stuff here
}

Bolotomus

Bolotomus

11:31 pm on Jul 17, 2001 (gmt 0)

10+ Year Member



Hey Jeremy,

Reading over your question I see that my answer was overkill.

If this file you are reading simply looks like this

216.239.46.99 Googlebot/2.1(+http://www.googlebot.com/bot.html)

Then what you want is this:

while (<FILE>) {
my ($ip,$agent) = /^(\S+) (.*)/;
# do something
}

The (\S+) part will grab the digits and periods all in one variable, and the (.*) will grab everything else on the rest of the line (but not the newline).

Bolotomus

neil laurance

8:59 am on Aug 1, 2001 (gmt 0)



Or recursively from the command line:

find . -type f ¦ xargs perl -i.old -p -e 's/^[\d\.]+\s+//'

Cheers, Neil

Bolotomus

4:24 pm on Aug 5, 2001 (gmt 0)

10+ Year Member



Welcome Neil! Always good to have more"Unix jocks" among us!

neil laurance

8:37 pm on Aug 5, 2001 (gmt 0)



Thanks for the 'compliment' ;)