Forum Moderators: coopster
I am currently trying to take a closer look at my logfiles, analyzing it under php. It's more a sort of brain-sparring than a true project, waiting for the more busy summer months;)
Is there any elegant grep-notation to extract search phrases fom the referrer entry in the logfiles? Every single search engine seems to have its very own way of doing it, and I found that quite confusing.
So far I only used some string-commands in order to cut each line into pieces. Maybe I overread some more sophisitcated php commands?
function get_se_query($referrer) {
$ref = $referrer;
$query_string = false;
$se_stuff = array();
$se_stuff[] = array("google.", "q", "Google");
$se_stuff[] = array("ask.com", "q", "Ask.com");
$se_stuff[] = array("ask.co.uk", "ask", "Ask.co.uk");
$se_stuff[] = array("comcast.net", "q", "Comcast");
$se_stuff[] = array("yahoo", "p", "Yahoo");
$se_stuff[] = array("aol.com", "query", "AOL");
$se_stuff[] = array("msn.com", "q", "MSN");
$se_stuff[] = array("netscape.com", "query", "Netscape");
$se_stuff[] = array("netzero.net", "query", "NetZero");
$se_stuff[] = array("altavista.com", "q", "Altavista");
$se_stuff[] = array("mywebsearch.com", "searchfor", "Mywebsearch");
$se_stuff[] = array("alltheweb.com", "q", "Alltheweb");
$se_stuff[] = array("cnn.com", "query", "CNN");
$se_stuff[] = array("myspace.com", "q", "MySpace");
for($i=0, $size = sizeof($se_stuff); $i < $size; $i++){
if ( stristr($ref,$se_stuff[$i][0]) ) {
$symbol = $se_stuff[$i][1];
$temp1 = explode("$symbol=", $ref, 2);
$temp2 = explode("&", $temp1[1]);
$string = $temp2[0];
$query_string = urldecode($string);
}
}
return $query_string;
} // end get_se_query
But thx for the cope-snippet. I tried to extract via strpos() and substring(). Doing it with explode will surely be faster.
here is some code that worked, made the server cry a bit on monster logfiles though, no regex, we had a dedicated server that we did it on
to be honest I can't remember how all of it works, but it worked ;)
you could always pick it apart and use little bits of it, some wasn't finished yet and some is commented out
if you look at the top of the code the file was named alp.php and there were 2 possible GET params to pass to it
<?
if (!isset($_GET['what'])) {
echo "<div align=\"center\"><p> <p>Select what to parse for from the list below.";
echo "<p><a href=\"alp.php?what=kw\">Parse for Keywords</a>";
echo "<p><a href=\"alp.php?what=dom\">Parse for Domains</a></div>";
die;
}
// benchmark timing
function getmicrotime($t) {
list($usec, $sec) = explode(" ",$t);
return ((float)$usec + (float)$sec);
}
$start = microtime();
// end start benchmark
//
$fp = fopen("currentlog.log","r");
$lncnt = 0;
$lstln = 0;
$nextpos = 0;
$towriteln = "";
$refrem = array();
$refremcnt = array();
$iptab = array();
// not used yet
$domexclude = array("www.example.com","example.com","example.ca");
//
while ($line = fgets($fp)) {
$qusplit = split('"',$line);
$fstsp = strpos($qusplit[0]," ");
$qusplit[0] = substr($qusplit[0],0,$fstsp);
// if same as last and!= "-"
if ($qusplit[3]!= "-") {
$lstln = $lncnt - 1;
if (in_array($qusplit[0],$iptab) && $refrem[$lstln] == $qusplit[3]) {
} else {
if (count($iptab) == 10) {
if (!in_array($qusplit[0],$iptab)) {
$temp = $iptab[0];
$iptab[0] = $iptab[9];
$po = array_pop($iptab);
$pu = array_push($iptab,$qusplit[0]);
//if ($pu > 10) die("pushed it over!");?><?
}
} else {
if (!in_array($qusplit[0],$iptab)) {
array_push($iptab,$qusplit[0]);
}
}
//$lastip = $qusplit[0];
$qs = "";
$nextpos = 0;
if (!empty($qusplit[3])) {
$queryarr = parse_url($qusplit[3]);
// switch for which data to grab
switch ($what) {
case "kw":
if (!empty($queryarr['query'])) {
parse_str($queryarr['query'],$breakuparr);
// now need to find the various vars that contain the search terms
// just dump the string to other arr
// echo the resulting arr
// keywords counted
// if inarray keyword ++ else add row
if (isset($breakuparr['encquery'])) {
$qs = $breakuparr['encquery'];
} else if (isset($breakuparr['q'])) {
$qs = $breakuparr['q'];
} else if (isset($breakuparr['p'])) {
$qs = $breakuparr['p'];
} else if (isset($breakuparr['source']) && $breakuparr['source']!= "NSCPTop") {
$qs = $breakuparr['source'];
} else if (isset($breakuparr['query'])) {
$qs = $breakuparr['query'];
} else if (isset($breakuparr['Keywords'])) {
$qs = $breakuparr['Keywords'];
} else if (isset($breakuparr['cid']) && isset($breakuparr['s'])) {
$qs = $breakuparr['s'];
//} else if () {
// add purchase count
// add cheque receive count
} else {
$qs = "";
}
// end query finding stuff
$qs = strtolower(trim($qs));
if (!empty($qs)) {
$qs = str_replace('\"','',$qs);
$qs = str_replace('+','',$qs);
//if ($nextpos === false) {
if (in_array($qs,$refrem)) {
//$where = "found";
$nextpos = array_search($qs,$refrem);
$refremcnt[$nextpos]++;
} else {
//$where = "not found";
$nextpos = count($refrem);
$refrem[$nextpos] = $qs;
$refremcnt[$nextpos] = 1;
}
}
}// if (!empty($queryarr['query']))
break;
case "dom":
//if ($queryarr['host']!= "www.example.com" && $queryarr['host']!= "example.com") {
// echo "<br>",$queryarr['host'];
//}
$qs = str_replace("www.","",$queryarr['host']);
if (in_array($qs,$refrem)) {
$nextpos = array_search($qs,$refrem);
$refremcnt[$nextpos]++;
} else {
$nextpos = count($refrem);
$refrem[$nextpos] = $qs;
$refremcnt[$nextpos] = 1;
}
break;
} // switch ($what)
}// if (!empty($qusplit[3]))
$lncnt++;
}
}
}
//echo "<pre>";
//print_r($refrem);
echo count($refrem);
//echo "</pre>";
//
$lncnt = 0;
$fp2 = fopen("results.csv","w+");
for($i=0;$i<=count($refrem);$i++) {
$towriteln = $refrem[$lncnt] . "," . $refremcnt[$lncnt] . "\n";
fwrite($fp2,$towriteln);
$lncnt++;
}
//
// benchmark timing
$end = microtime();
$t2 = (getmicrotime($end) - getmicrotime($start));
// end benchmark timing
//
echo "<p>Total Time: <b>$t2</b>";
//
echo "<p><a href=\"results.csv\">View Results File</a>";
?>
parse_url
parse_str
are the core componentes of your snippet as far as my question is concerned, and thus the php-commands I was missing. Actually, I handed out my manual to a friend for a while and it is really hard to look for such solutions on the resources available on the internet. Sometimes books are much better.
I tried to understand the code as far as necessary: Are you sure it interprets e.g. queries from ask correctly? Seems as if these case-distictions accounting for the various SEs are absolutely necessary. I was hoping there was a built-in command to account for them.
> made the server cry a bit on monster logfiles though, no regex, we had a dedicated server that we did it on
yes, this is what I also experienced. Sometimes I wish someone had taught me the regex syntax 20 years earlier, when my brain was a bit more flexible...
My weekly logfiles are about 5-8 MB compressed on average and after I added the code-analysis for the query-part it took several minutes on my notebook for a single such file. That won't work, so this little project is also some 'learning by doing' on performance aspects.
Anyways I wouldn't let such a script do any damage to my webservers performance: Just doing some experiments on my local WAMP-installation.
you could easily add more cases
>> regex
well, I have gotten this far without using them very often so I don't think they are always the answer either, I've seen regex make many a server cry too ;)