Perl exact match operator for useragent string? - Perl Server Side CGI Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Perl exact match operator for useragent string?

JAB Creations

10:01 pm on Jul 24, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'm currently updating my Awstats script which needs some better useragent filters. For one I'm trying to get rid of fake "Mozilla" strings. The useragents feeding this are mostly like this...

Mozilla/3.0
Mozilla/4.0
Mozilla/5.0
Mozilla/3.0 (compatible)
Mozilla/4.0 (compatible)
Mozilla/5.0 (compatible)

Obviously they aren't descriptive to what the user is using. I'd like to figure out how to do an exact match in order to classify these as I desire. I'm not good with advanced operators in any language, getting decent in PHP and it's aiding my understanding in Perl a bit as I don't usually work with it except when I come across scripts that better suit my needs that are written in Perl.

Anyway here is a look at some of the useragent filters Awstats uses just in case it helps others understand how the program is handling the useragents.

my $regvermsie=qr/msie([+_ ]Ś)([\d\.]*)/i;
my $regverfirefox=qr/firefox\/([\d\.]*)/i;

These two filters represent the two general detection filters in the array the script is using. Agent/version or Agent version (with slash or space between agent and version). I don't know what qr/ and /i do exactly but I am guessing they mean 'if you find this match anywhere within the entire string' or something along those lines. I've tried this though it's not matching the exact number of occurrences of this specific useragent...

my $regverexampleMozilla5=="Mozilla/5.0";

Also tried === and I'm just not seeing anything about exact matches on the net for Perl. Just for clarification I this exact match would obviously not match something like...

Mozilla/5.01

...which is fine with me. Thanks!

- John

jdMorgan

10:32 pm on Jul 24, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The patterns used in the PERL code are nothing more than extended regular expressions -- similar to those used in mod_rewrite, PHP, and many other scripting and "utility" languages. Try a search for PERL regular expressions, and you should find plenty of useful info.

One thing to beware of is that the latest Netscape 9 browser (released as a beta) has been reduced to little more than a 'skin' and a few extensions on top of Mozilla Firefox, and now carries a Firefox User-agent string with "Netscape Navigator" tacked onto the end. Example from a WinXP user in the US:

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.5pre) Gecko/20070712 Firefox/2.0.0.4 Navigator/9.0b2"

Yet another anomaly to deal with... :)

Jim

JAB Creations

10:59 pm on Jul 24, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Now you've mentioned regular expressions...so here is my take on what I've read thus far...

According to Wiki...

^ Matches the beginning of a line or string.
$ Matches the end of a line or string.

So I attempted this...

my $regverexampleMozilla5=^Mozilla/5.0$;

It's obviously not valid if the script breaks.

Is there simply no direct method of detecting an exact string in Perl?

- John

jdMorgan

11:24 pm on Jul 24, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well, one problem is that you'll rarely see exactly "Mozilla/5.0" because the user-agent string has all that other stuff bolted on it, as shown in my Netscape UA example above.

Of course PERL has an exact match, but you've got to satisfy both the required regular expressions syntax, and PERL's own syntax, which is why I suggested a search for PERL regular expressions. The second result on Google, for example, neatly answers your question about "/i" on the very first page...

Jim

JAB Creations

12:22 am on Jul 25, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

to satisfy both the required regular expressions syntax, and PERL's own syntax,

I can understand this to roughly a third of what it implies. I can understand basic singular characters and what they may imply. I have roughly a third of the required understanding however my brain doesn't pick up patterns like a developer. In regards to programming my brain works best on replication, without a clear cut example to replicate I am only able to accidentally find my answer. However in regards to design I don't need to rely on math, just visuals and thus I'm able to construct what I need from scratch much easier.

So I understand the basic implications of expressions and the basic implications of operators. I am clueless how we are mixing them as I only roughly understand what defines each group from the other and in my head it is again a visual understanding in place of the logic that a developer works with.

My guesses include this...

^ Match the beginning of the line

$ Match the end of the line

So I'd adapt the string from...

Mozilla/5.0

to...

^Mozilla/5.0$

I assume we must escape slashes (I understand that this is a filtering array of some sort as I can create filters for a string but that is only my best guess as the much more general situation)...so I would adapt it as so....

^Mozilla\/5.0$

I understand in PHP that a . connects two things...working from Awstats's (key part here) already working example of this...

my $regfavico=qr/\/favicon\.ico$/i;

...it is my understanding that I must escape the dot as an operator. My adaptation mutates to this...

^Mozilla\/5\.0$

Still unless the regular expression is doing an exact match with ^ and $ it is completely unclear if I'm executing an exact match. Is this how to exact an exact match using regular expressions (minus the fact that we're mixing operators)?

Perl's page may describe to you about "/i" but it does not to me. "i" is case insensitive...so I don't need this if I'm doing exact matching (to be exact about exact, it automatically implies case sensitivity automatically as that of course is part of what exact implies). I still also do not understand what "qr" is. "/" is used to escape...so what is the point of "/i"...escaping case sensitivity?

So by my designer's visual logic this is currently my best guess...

my $regverMozilla5=^Mozilla\/5\.0$

...it breaks the script though. So hopefully you'll be able to explain to me what I'm missing, where my visual logic is failing at literal logic, and my understanding will align that way, I hope... Thanks for your help!

- John

phranque

1:03 am on Jul 25, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

a backslash rarely hurts but the first one isn't necessary.
you can escape any character with a preceding backslash, but it is only required to escape the following special characters:

[\^$.Ś?*+()

you might want to put your regular expression string in quotes:

my $regverMozilla5 = '^Mozilla/5\.0$';

now it is an exact match regular expression for:

Mozilla/5.0

phranque

1:15 am on Jul 25, 2007 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I don't know what qr/ and /i do exactly

qr is a perl operator for quoting regular expressions.
the /i is an option to ignore case sensitivity in the regexp alpha characters.

JAB Creations

2:28 am on Jul 25, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks phranque, this works perfectly!

my $regverMozilla5='^Mozilla/5\.0$';

However it wasn't without a custom string attached of mine!

It's obviously a spoof and spoofing is against my site's TOS. That translates in to no access log lines with that useragent AND a normal code (200). So it took me a moment of playing around with a temporary access log as Awstats does not display non-normal codes for things like browser hits. I changed the response codes around for a specific string (301s). Apache redirects (changed a txt file to php to enforce my TOS on my Adblock filter subscription) before PHP gets a chance to execute (not hard to figure out) so I just changed the 301 redirects to 200s for the sake of testing and it works fine. I have exactly 682 instances in my test case, and it detected exactly 682 instances.

Anyway thanks for all the help to both of you. I have a better understanding of regular expressions and I'm a major step closer to exceptionally accurate browser statistics. :)

- John