How can I extract the search query from a logfile entry

Forum Moderators: coopster

Message Too Old, No Replies

How can I extract the search query from a logfile entry

seeking helpful php commands for such analysis

Oliver Henniges

3:41 pm on Mar 1, 2007 (gmt 0)

Hi,

I am currently trying to take a closer look at my logfiles, analyzing it under php. It's more a sort of brain-sparring than a true project, waiting for the more busy summer months;)

Is there any elegant grep-notation to extract search phrases fom the referrer entry in the logfiles? Every single search engine seems to have its very own way of doing it, and I found that quite confusing.

So far I only used some string-commands in order to cut each line into pieces. Maybe I overread some more sophisitcated php commands?

coopster

4:11 pm on Mar 1, 2007 (gmt 0)

Would preg_grep() [php.net] work?

Finger

5:20 pm on Mar 1, 2007 (gmt 0)

As far as I know you're probably gonna have to make special cases for each search engine. You could use something like $url_vars = parse_url($referrer), and check $url_vars['host'] against the search engine domain. Then make something to extract the query for each engine. Here is a ghetto function I made a long time ago that I still use. It works well enough, but it could use some improvement, and it doesn't work for the search engines that separate queries with slashes.

function get_se_query($referrer) {
$ref = $referrer;
$query_string = false;
$se_stuff = array();
$se_stuff[] = array("google.", "q", "Google");
$se_stuff[] = array("ask.com", "q", "Ask.com");
$se_stuff[] = array("ask.co.uk", "ask", "Ask.co.uk");
$se_stuff[] = array("comcast.net", "q", "Comcast");
$se_stuff[] = array("yahoo", "p", "Yahoo");
$se_stuff[] = array("aol.com", "query", "AOL");
$se_stuff[] = array("msn.com", "q", "MSN");
$se_stuff[] = array("netscape.com", "query", "Netscape");
$se_stuff[] = array("netzero.net", "query", "NetZero");
$se_stuff[] = array("altavista.com", "q", "Altavista");
$se_stuff[] = array("mywebsearch.com", "searchfor", "Mywebsearch");
$se_stuff[] = array("alltheweb.com", "q", "Alltheweb");
$se_stuff[] = array("cnn.com", "query", "CNN");
$se_stuff[] = array("myspace.com", "q", "MySpace");
for($i=0, $size = sizeof($se_stuff); $i < $size; $i++){
if ( stristr($ref,$se_stuff[$i][0]) ) {
$symbol = $se_stuff[$i][1];
 $temp1 = explode("$symbol=", $ref, 2);
 $temp2 = explode("&", $temp1[1]);
 $string = $temp2[0];
 $query_string = urldecode($string);
}
}
return $query_string;
} // end get_se_query

Oliver Henniges

10:33 am on Mar 2, 2007 (gmt 0)

Thx so far. Yes, some form of grep would work, but I am not familiar with the exact coding (to avoid the word 'lazy';) It would take me two days to find an appropriate solution for google's results alone, and as finger's posting insinuates that'd be only a tenth of the story.

But thx for the cope-snippet. I tried to extract via strpos() and substring(). Doing it with explode will surely be faster.

jatar_k

3:56 pm on Mar 2, 2007 (gmt 0)

how about some overload?

here is some code that worked, made the server cry a bit on monster logfiles though, no regex, we had a dedicated server that we did it on

to be honest I can't remember how all of it works, but it worked ;)

you could always pick it apart and use little bits of it, some wasn't finished yet and some is commented out

if you look at the top of the code the file was named alp.php and there were 2 possible GET params to pass to it

<? 
if (!isset($_GET['what'])) { 
 echo "<div align=\"center\"><p>&nbsp;<p>Select what to parse for from the list below."; 
 echo "<p><a href=\"alp.php?what=kw\">Parse for Keywords</a>"; 
 echo "<p><a href=\"alp.php?what=dom\">Parse for Domains</a></div>"; 
 die; 
} 
// benchmark timing 
function getmicrotime($t) { 
 list($usec, $sec) = explode(" ",$t); 
 return ((float)$usec + (float)$sec); 
} 
$start = microtime(); 
// end start benchmark 
// 
$fp = fopen("currentlog.log","r"); 
$lncnt = 0; 
$lstln = 0; 
$nextpos = 0; 
$towriteln = ""; 
$refrem = array(); 
$refremcnt = array(); 
$iptab = array(); 
// not used yet 
$domexclude = array("www.example.com","example.com","example.ca"); 
// 
while ($line = fgets($fp)) { 
 $qusplit = split('"',$line); 
 $fstsp = strpos($qusplit[0]," "); 
 $qusplit[0] = substr($qusplit[0],0,$fstsp); 
 // if same as last and!= "-" 
 if ($qusplit[3]!= "-") { 
 $lstln = $lncnt - 1; 
 if (in_array($qusplit[0],$iptab) && $refrem[$lstln] == $qusplit[3]) { 
 } else { 
  if (count($iptab) == 10) { 
  if (!in_array($qusplit[0],$iptab)) { 
   $temp = $iptab[0]; 
   $iptab[0] = $iptab[9]; 
   $po = array_pop($iptab); 
   $pu = array_push($iptab,$qusplit[0]); 
   //if ($pu > 10) die("pushed it over!");?><? 
  } 
  } else { 
  if (!in_array($qusplit[0],$iptab)) { 
   array_push($iptab,$qusplit[0]); 
  } 
  } 
  //$lastip = $qusplit[0]; 
  $qs = ""; 
  $nextpos = 0; 
  if (!empty($qusplit[3])) { 
  $queryarr = parse_url($qusplit[3]); 
  // switch for which data to grab 
  switch ($what) { 
   case "kw": 
   if (!empty($queryarr['query'])) { 
    parse_str($queryarr['query'],$breakuparr); 
    // now need to find the various vars that contain the search terms 
    // just dump the string to other arr 
    // echo the resulting arr 
    // keywords counted 
    // if inarray keyword ++ else add row 
    if (isset($breakuparr['encquery'])) { 
    $qs = $breakuparr['encquery']; 
    } else if (isset($breakuparr['q'])) { 
    $qs = $breakuparr['q']; 
    } else if (isset($breakuparr['p'])) { 
    $qs = $breakuparr['p']; 
    } else if (isset($breakuparr['source']) && $breakuparr['source']!= "NSCPTop") { 
    $qs = $breakuparr['source']; 
    } else if (isset($breakuparr['query'])) { 
    $qs = $breakuparr['query']; 
    } else if (isset($breakuparr['Keywords'])) { 
    $qs = $breakuparr['Keywords']; 
    } else if (isset($breakuparr['cid']) && isset($breakuparr['s'])) { 
    $qs = $breakuparr['s']; 
    //} else if () { 
    // add purchase count 
    // add cheque receive count 
    } else { 
    $qs = ""; 
    } 
    // end query finding stuff 
    $qs = strtolower(trim($qs)); 
    if (!empty($qs)) { 
    $qs = str_replace('\"','',$qs); 
    $qs = str_replace('+','',$qs); 
    //if ($nextpos === false) { 
    if (in_array($qs,$refrem)) { 
     //$where = "found"; 
     $nextpos = array_search($qs,$refrem); 
     $refremcnt[$nextpos]++; 
    } else { 
     //$where = "not found"; 
     $nextpos = count($refrem); 
     $refrem[$nextpos] = $qs; 
     $refremcnt[$nextpos] = 1; 
    } 
    } 
   }// if (!empty($queryarr['query'])) 
   break; 
   case "dom": 
   //if ($queryarr['host']!= "www.example.com" && $queryarr['host']!= "example.com") { 
   // echo "<br>",$queryarr['host']; 
   //} 
   $qs = str_replace("www.","",$queryarr['host']); 
   if (in_array($qs,$refrem)) { 
    $nextpos = array_search($qs,$refrem); 
    $refremcnt[$nextpos]++; 
   } else { 
    $nextpos = count($refrem); 
    $refrem[$nextpos] = $qs; 
    $refremcnt[$nextpos] = 1; 
   } 
   break; 
  } // switch ($what) 
  }// if (!empty($qusplit[3])) 
  $lncnt++; 
 } 
 } 
} 
//echo "<pre>"; 
//print_r($refrem); 
echo count($refrem); 
//echo "</pre>"; 
// 
$lncnt = 0; 
$fp2 = fopen("results.csv","w+"); 
for($i=0;$i<=count($refrem);$i++) { 
 $towriteln = $refrem[$lncnt] . "," . $refremcnt[$lncnt] . "\n"; 
 fwrite($fp2,$towriteln); 
 $lncnt++; 
} 
// 
// benchmark timing 
$end = microtime(); 
$t2 = (getmicrotime($end) - getmicrotime($start)); 
// end benchmark timing 
// 
echo "<p>Total Time: <b>$t2</b>"; 
// 
echo "<p><a href=\"results.csv\">View Results File</a>"; 
?>

Oliver Henniges

7:16 pm on Mar 2, 2007 (gmt 0)

Great, this is what I had been looking for:

parse_url
parse_str

are the core componentes of your snippet as far as my question is concerned, and thus the php-commands I was missing. Actually, I handed out my manual to a friend for a while and it is really hard to look for such solutions on the resources available on the internet. Sometimes books are much better.

I tried to understand the code as far as necessary: Are you sure it interprets e.g. queries from ask correctly? Seems as if these case-distictions accounting for the various SEs are absolutely necessary. I was hoping there was a built-in command to account for them.

> made the server cry a bit on monster logfiles though, no regex, we had a dedicated server that we did it on

yes, this is what I also experienced. Sometimes I wish someone had taught me the regex syntax 20 years earlier, when my brain was a bit more flexible...

My weekly logfiles are about 5-8 MB compressed on average and after I added the code-analysis for the query-part it took several minutes on my notebook for a single such file. That won't work, so this little project is also some 'learning by doing' on performance aspects.

Anyways I wouldn't let such a script do any damage to my webservers performance: Just doing some experiments on my local WAMP-installation.

jatar_k

7:30 pm on Mar 2, 2007 (gmt 0)

there are more cases than the ones I have there but I built those from interpretation of my own logs and what my requirements were

you could easily add more cases

>> regex

well, I have gotten this far without using them very often so I don't think they are always the answer either, I've seen regex make many a server cry too ;)