Forum Moderators: coopster

Message Too Old, No Replies

regex find phrases?

for Google Analytics

         

chewy

2:41 am on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi,

Please, how would I write a simple regex to filter in all phrases.

For instance, GA will let me use /s to find any search term that is NOT a single word by showing all terms that have 1 space.

So that would return 2 word phrases.

How would I find 3 and 4 and more multi word phrases?

Thanks,

-C

PS - Happy Turkey Day for those celebrating it tomorrow (or today I suppose, wherever you are!)

eelixduppy

3:02 am on Nov 24, 2011 (gmt 0)



I'm not sure I completely understand what you are asking. Is this going to be searching a large string and extracting multi-word phrases or do you want to just present a single phases and know if it is multi-worded or not?

Perhaps providing an example input string and what you'd like to see from it would help.

lucy24

11:39 am on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When you say /s do you mean \s ?

By utterly yawn-provoking coincidence I have only just been doing similar searches in my own logs. I didn't have to be too exact so I just said ((\w+) ){5,} to pull up anything above a certain length. That excludes any searches containing punctuation.

Do the phrases occur in some kind of environment where everything but the search terms is already screened out? Otherwise you'd be getting a hell of a lot of false positives by just looking for spaces.

There are lots of ways to do it. For example

\S+\s\S+\s\S+ for three-word phrases (which may be part of longer phrases if you don't anchor them)

or

(\S+) ){9,}\S+

for a string of 10 or more words (assuming no space after the last one in the list. If you want to be really formal, each word would be

\p{Punct}+\w+\p{Punct}+

but probably in not exactly that form, because the {bracketed} part is very dialect-specific.

Why do they say \s or possibly /s instead of using the literal space character?

[edited by: lucy24 at 11:48 am (utc) on Nov 24, 2011]

chewy

11:44 am on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What I am trying to do is mine GA for long phrases from an extremely large data set - I can see they are there, I just need to auto filter somehow.

Specifically, in the new Google Analytics, I am looking at a massive set of data returned by internal "site search".

I want to set up a filter to show just the phrases.

To get all "one word" type queries, I can use regex to exclude any queries with spaces by using "exclude \s".

To get all multiple word type queries I can use "include \s".

How would I get all 3 word queries? (specifically, how do I issue a simple regex command to filter in anything with 2 space?)

How would I get all 4 word queries? (anything with 3 phrases?).

I can do this with Excel but due to the nature of GA that it will only export something like 20,000 rows, this would take an extremely long time to manually collect the million plus queries.

Mods can place this over in the Analytics forum - but since things can be pretty quiet over there, I thought I'd start here first.

TX!

lucy24

11:49 am on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops, we overlapped. Who knew that you would rematerialize at the precise instant that I realized I needed to say more? :)

chewy

1:00 pm on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



doesn't seem to be working...

I think we'll need some more 'stuff' to make it work.

The GA interface specifically says:

Include Search Term Matching RegExp =

if it was \s+\s that would simply be any searches that matched exactly "space space" and we know that - so what I'm trying to get at is the command that includes some kind of wildcards or whatever we call them now that will return ANY string that has ANY number of spaces in it.

lucy24

8:19 pm on Nov 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Regular expressions are case-sensitive, and case is often used as a negator. \w is word, \W is non-word. \s is space, \S is non-space. In my text editor's dialect, \p{Punct} is punctuation and \P{Punct} is non-punctuation. And so on.