Forum Moderators: open
I.E. Take this:
www.mysite.com 1xx.xxx.xxx.xx - - [01/Oct/2003:01:38:53 -0700] "GET /landing-page/from-mysite.htm HTTP/1.0" 200 6399 "http://search.yahoo.com/search?p=my+sites+keywords" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; 9L5.5; 9LA; .NET CLR 1.0.3705)"
And give me this for each page:
/landing-page/from-mysite.htm
- keyword
- keyphrase
- other keyword
- etc. etc.
Does this exist?
I suppose you could cook this up with PHP and mySQL...
SN
<added> Aw sorry, I just realized that would only work if your log is in the same format as mine... Well I write it up anyways, perhaps you can use the keyword sniffing part at least.</added>
Regards, SN
SELECT
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
IF(re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google¦search.msn¦search.peoplepc¦"
,"search.earthlink¦partners.search.msn¦www.altavista¦www.gogole¦www.ntlworld¦sucheaol.aol¦www.searchalot¦"
,"wwb.ieplugin).*[?&/]q=[^\&]"),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("q=",re.data)+2),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google).*[?&/]as_epq=[^\&]"),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("as_epq=",re.data)+8),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE CONCAT("^http://(search.aol¦sucheaol.aol¦aolsearch.aol¦search.netscape¦[^\.]*.?mamma¦"
,"websearch.cnn¦hotbot¦www.business¦www.comcast¦www.hotbot¦www.attbi).*[?&/]query="),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("query=",re.data)+6),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.att¦www.worldnet).*qry=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("qry=",re.data)+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(msxml.excite¦dpxml.verizon¦www.dogpile¦www.metacrawler).*qkw=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("qkw=",re.data)+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(.{0,3}search.yahoo¦google.yahoo¦66.218.71.[0-9]{1,3}/search).*p=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("p=",re.data)+2),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(drs.yahoo).*K=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("K=",re.data)+2),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(wwWebmasterWorldebsearch¦[^\.]*.?mysearch¦search.iwon).*searchfor=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("searchfor=",re.data)+10),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(find.web.aol¦search.aol¦aolsearch.aol).*userQuery=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("userQuery=",re.data)+10),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(smartbrowse.netster¦searchtheweb.register¦partners[0-9]*\.oingo¦www.overture¦www.zuvio).*keywords=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("keywords=",LCASE(re.data))+9),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.zworks).*what=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("what=",LCASE(re.data))+6),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://([a-z]{0,3}.?search.yahoo).*va=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("va=",LCASE(re.data))+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.dogpile).*web/",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("web/",LCASE(re.data))+5),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(search.msn).*mt=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("mt=",LCASE(re.data))+4),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,""))))))))))))))
,"+"," "),"%20"," "),"%22","\""),"%2b","+"),"%2B","+"),"%2c",","),"%2C",","),"%2d","-"),"%2D","-"),"%3A",":")
AS keywords
,pi.data AS path_info
,COUNT(DISTINCT ip)AS cnt
FROM
log_pageviews AS pv
LEFT JOIN log_sessions AS s ON(s.session_id=pv.session_fk)
LEFT JOIN log_data AS ua ON(s.ua_fk=ua.data_id)
LEFT JOIN log_data as pi ON(pi.data_id=pv.path_fk)
LEFT JOIN log_data as qs ON(qs.data_id=pv.query_fk)
LEFT JOIN log_data as re ON(re.data_id=pv.referer_fk)
WHERE (dt>=SUBDATE(NOW(),INTERVAL 7 DAY))
AND (NOT(ua.data RLIKE CONCAT(
"googlebot¦robot¦Crawler¦ZyBorg¦libwww¦LWP::Simple¦"
,"Fetch API¦CJNetworkQuality¦Scooter¦WinHTTP¦Microsoft URL Control¦"
,"Gigabot¦linkchecker¦ia_archiver¦Jeeves¦Slurp¦sitecheck\.internetseer\.com¦"
,"Marvin v0.3¦NPBot¦QuepasaCreep¦lwp-trivial/1.34¦Girafabot¦"
,"Tagword¦Scrubby/2.1¦SurveyBot¦FreeFind/1.0¦FDSE robot¦SearchVIEW.net Spider¦"
,"Sqworm/2.9.85-BETA¦VoilaBot¦T-H-U-N-D-E-R-S-T-O-N-E¦MS Search 4.0 Robot¦"
,"msnbot/0.11")
)OR ua.data IS NULL)
AND re.data!=""
AND re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google¦search.msn.¦search.peoplepc¦"
,"search.earthlink¦.{0,3}search.yahoo¦wwWebmasterWorldebsearch¦aolsearch.aol¦search.netscape¦"
,"google.yahoo¦66.218.71.[0-9]{1,3}/search¦www.att¦msxml.excite¦dpxml.verizon¦"
,"partners.search.msn¦drs.yahoo¦search.aol¦[^\.]*.?mysearch¦search.iwon¦[^\.]*.?mamma¦"
,"smartbrowse.netster¦find.web.aol¦partners[0-9]*\.oingo¦websearch.cnn¦www.altavista¦"
,"searchtheweb.register¦www.gogole¦www.business¦www.dogpile¦www.overture¦www.searchalot¦"
,"www.ntlworld¦www.metacrawler¦www.comcast¦sucheaol.aol¦wwb.ieplugin¦www.hotbot¦www.worldnet.att¦"
,"www.zworks)")
AND re.data IS NOT NULL
GROUP BY keywords,path_info
HAVING keywords!=""
ORDER BY cnt DESC
LIMIT 50
;
Thanks killroy! That's gong to take some mindbending (i.e. time) for me to get my head around it, but I'll give it a good hard look.
It'll be trivial to pair it up with landing pages
What'd you mean by this? That I'll be able to mod the query to do this easily, or when you said 'let me sort it out for ya' that you have written it into the query? (my SQL is weak, though this will make it stronger!)
if you can post the CREATE query I'd be happy to modify it for you
I'm not sure what the CREATE query is that you refer to... Currently my logs are in a raw .log format (apache extended format).
Cheers - mipapage
As I said, if you post the CREATE statement of your log table I'll write yo ua query to do exactly what you wanted.
SN
Alternatively extract he regexes, should be easy to translate them to aperl or php or pcre or whatever language and run them over the apache log ifle.
SN
Extremely detailed. I find the referral keywords for landed page most useful. :)
Thanks a lot for the replies. I'll get on seeing how I can get my logs into mySQL.. Hmmm, seeing as I haven't got them into a dbase yet, would it be easier for you if you tell me the structure of the table (just fields and titles?) and then I'll stick em in that way?
Here's the scoop - my data is as above in msg1. It's taken from a shared apache environment, so the logs are filtered out from the other sites that are present, I'd imagine.
I can get the data into mySQL by parsing up the data and naming it with whatever convention you use. That way this can, as you say, 'be usefull for many'.
Regards - mipapage
[edited by: mipapage at 4:56 pm (utc) on Oct. 21, 2003]
Alternative to what I said above, I could simply create a table with the path_info and the referrer. I imagine I'm being a bit simple here but I think I would approach this by cleaning up the log in excel then bringing it into mySQL, for now, as I have yet to diveinto mySQL (read: I am a mySQL rookie).
So, that said, I could create a very simple table with ID, path_info, and referrer.
Is this sort of what you were after?
- mipapage
CREATE TABLE IF NOT EXISTS log_data
(data_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,datatype ENUM("path_info","post_data","query_string","referer","user_agent","extra_info")NOT NULL
,datahash INT UNSIGNED NOT NULL
,data TEXT NOT NULL
,UNIQUE KEY string(datatype,datahash)
);
CREATE TABLE IF NOT EXISTS log_sessions
(session_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,ip INT UNSIGNED NOT NULL
,ua_fk INT UNSIGNED
,KEY ip(ip)
);
CREATE TABLE IF NOT EXISTS log_pageviews
(pv_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,session_fk INT UNSIGNED NOT NULL
,dt DATETIME NOT NULL
,path_fk INT UNSIGNED NOT NULL
,query_fk INT UNSIGNED
,post_fk INT UNSIGNED
,referer_fk INT UNSIGNED
,extra_fk INT UNSIGNED
,KEY session_fk(session_fk)
,KEY dt(dt)
,KEY path_fk(path_fk)
,KEY query_fk(query_fk)
,KEY post_fk(post_fk)
,KEY referer_fk(referer_fk)
,KEY extra_fk(extra_fk)
);
I use this structure, to avoid duplication of user agents, path info and other string data. My old log reached over 1.4gigs with a 1.7 gigs index file, and queries were SLOW. This new format saves at least 70% of repetitive data and is MUCH faster to query.
For your purposes though, you're right. You really just need:
datetime,ip,path,referer
Excel won't really work, cos it only does 64k lines at the most (unless of courseyour log is smaller then that).
If you get it into CSV format, you can use the MYSQL command to read from file and getit straight into this simple table. Once you've done that I'll show you how to rewrite the queryto extract the data.
Good Luck!
SN
PS: I recommend you don'T use my three table split structure. It can make things complex if you don't write acomprehensive tool kit, and is probably not worth it unless you need live querying on your logs.
I haven't had much time to get into 'getting my log data into' mySQL. Can we just say that I'll create a table out of: datetime,ip,path,referer?
That way, we can close that thread off before it gets closed off by time, and for now I can implement that query on smaller logs importing via excel (new sites) and once I find a better importing solution I'll be set.
Let me know what you think!
mipapage