I'm trying to match arbitrary numbers at the beginning of files...eg...
400 216.35.116.91 Mozilla/3.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
3 216.239.46.99 Googlebot/2.1(+http://www.googlebot.com/bot.html)
10 199.172.149.204 ArchitextSpider
I'm trying to parse through a whole bunch of these, matching the arbitrary number, white space character, replace that with nothing, and then print the rest of the line.
if($line =~ /^\d.+\s/) {print "$`\n";}
as a check, I already tried changing the $` to $1 to see if that would print anything, but it didn't, so I know the regex is wrong...
($host,$roast,$ghost,$most,$toast) = split(/ /,$line);
Now do what you want to each field. Especially logs where they are quite tricky to come up with good one line regexs. What you'll find, is that it is easier and far more maintainable if you break it down a step at a time.
After that, start combining the regex's until you get back to that killer one liner. Not surprisningly, 3 and 4 steps to a regex can be faster than a one line regex. Often one liners have to back track and reset when they are parsing. Whereas a 3-4 different regexs can nail the parse and the first pass across the line.
Reading Log Files in Perl:
[cs.cf.ac.uk...]
The problem was with the dot you had before the + sign - you were effectively saying 'match a digit at the start of the line ,followed by anything 1 or more times, followed by a space'.
As perl is 'greedy' when pattern matching (that is, it'll always match as far to the end of the string as possible), your regex matched everything up until the final space.
If that makes any sense at all ;)
A newbie to parsing anything with perl, it was the first idea i had on how to do the sorting of the files.
Happy to say the output is nice :)
Used this, modified:
($host,$roast,$ghost,$most,$toast) = split(/ /,$line);
into
($notneeded,$ip,$engine,$user,$agent,$info) = split(/ /,$line);
And then I printed the every variable but the first. Some user agents like fast have more spaces in them, or the slurp with mozilla at the start of the string, so I needed to print more of the variables. I notice too many variables is okay, too few doesn't work.
Thanks for all the help.
eeeeek! Don't use $&, it an abomination! Using it one time in your entire program will make Perl use a different method on ALL of your regexes and incur enormous overheads, even if the code containing the $& isn't actually executed. At least, so says Friedl, the king of regexes.
When I parse logfiles I like to use \S* to grab the components.
But anyhow, there are too many interesting unsolved problems in the world for you to sweat over this one. Here's an answer from the Perl Cookbook:
while (<LOGFILE>) {
my ($client, $identuser, $authuser, $date, $time, $tz, $method, $url, $protocol, $status, $bytes) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+) "(\S+) (.*?) (\S+)" (\S+) (\S+)$/; # keep the regex on 1 line
# do stuff here
}
Bolotomus
Reading over your question I see that my answer was overkill.
If this file you are reading simply looks like this
216.239.46.99 Googlebot/2.1(+http://www.googlebot.com/bot.html)
Then what you want is this:
while (<FILE>) {
my ($ip,$agent) = /^(\S+) (.*)/;
# do something
}
The (\S+) part will grab the digits and periods all in one variable, and the (.*) will grab everything else on the rest of the line (but not the newline).
Bolotomus
find . -type f ¦ xargs perl -i.old -p -e 's/^[\d\.]+\s+//'
Cheers, Neil