Forum Moderators: open

Message Too Old, No Replies

Assiging referral keywords to their landing page..

Is there something that can do this?

         

mipapage

10:13 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm after something that will analyse my logs and:

  1. Look at the referring search strings...
  2. and the associated landing pages...
  3. and spit out a list of my sites pages along with the keywords and phrases that brought people there.

I.E. Take this:

www.mysite.com 1xx.xxx.xxx.xx - - [01/Oct/2003:01:38:53 -0700] "GET /landing-page/from-mysite.htm HTTP/1.0" 200 6399 "http://search.yahoo.com/search?p=my+sites+keywords" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; 9L5.5; 9LA; .NET CLR 1.0.3705)"

And give me this for each page:

/landing-page/from-mysite.htm
  • keyword
  • keyphrase
  • other keyword
  • etc. etc.

Does this exist?
I suppose you could cook this up with PHP and mySQL...

killroy

10:19 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wrote a MySQL query that will extract keywords from search queries for about 50 or so common search engines. It'll be trivial to pair it up with landing pages... let me sort it out for ya, it's a long query, but it'll be usefull for many...

SN

<added> Aw sorry, I just realized that would only work if your log is in the same format as mine... Well I write it up anyways, perhaps you can use the keyword sniffing part at least.</added>

killroy

10:26 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, here it comes. This MySQL query lists the top 50 (by count(distinct ip) keyword/landing page combos. My log is split up acroyy a sessions, a pageviews and a datastrings table, but the first half, the keyword extraction should work even for a flat-file log. Good luck, and let me know if you need help adjusting it to your log format. if you can post the CREATE query I'd be happy to modify it for you.

Regards, SN

SELECT
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
IF(re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google¦search.msn¦search.peoplepc¦"
,"search.earthlink¦partners.search.msn¦www.altavista¦www.gogole¦www.ntlworld¦sucheaol.aol¦www.searchalot¦"
,"wwb.ieplugin).*[?&/]q=[^\&]"),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("q=",re.data)+2),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google).*[?&/]as_epq=[^\&]"),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("as_epq=",re.data)+8),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE CONCAT("^http://(search.aol¦sucheaol.aol¦aolsearch.aol¦search.netscape¦[^\.]*.?mamma¦"
,"websearch.cnn¦hotbot¦www.business¦www.comcast¦www.hotbot¦www.attbi).*[?&/]query="),
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("query=",re.data)+6),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.att¦www.worldnet).*qry=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("qry=",re.data)+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(msxml.excite¦dpxml.verizon¦www.dogpile¦www.metacrawler).*qkw=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("qkw=",re.data)+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(.{0,3}search.yahoo¦google.yahoo¦66.218.71.[0-9]{1,3}/search).*p=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("p=",re.data)+2),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(drs.yahoo).*K=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("K=",re.data)+2),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(wwWebmasterWorldebsearch¦[^\.]*.?mysearch¦search.iwon).*searchfor=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("searchfor=",re.data)+10),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(find.web.aol¦search.aol¦aolsearch.aol).*userQuery=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("userQuery=",re.data)+10),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(smartbrowse.netster¦searchtheweb.register¦partners[0-9]*\.oingo¦www.overture¦www.zuvio).*keywords=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("keywords=",LCASE(re.data))+9),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.zworks).*what=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("what=",LCASE(re.data))+6),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://([a-z]{0,3}.?search.yahoo).*va=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("va=",LCASE(re.data))+4),1,IF((@L1:=LOCATE("&",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(www.dogpile).*web/",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("web/",LCASE(re.data))+5),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,IF(re.data RLIKE "^http://(search.msn).*mt=",
SUBSTRING(@q1:=SUBSTRING(re.data,LOCATE("mt=",LCASE(re.data))+4),1,IF((@L1:=LOCATE("/",@q1)-1)!=-1,@L1,LENGTH(@q1)))
,""))))))))))))))
,"+"," "),"%20"," "),"%22","\""),"%2b","+"),"%2B","+"),"%2c",","),"%2C",","),"%2d","-"),"%2D","-"),"%3A",":")
AS keywords
,pi.data AS path_info
,COUNT(DISTINCT ip)AS cnt
FROM
log_pageviews AS pv
LEFT JOIN log_sessions AS s ON(s.session_id=pv.session_fk)
LEFT JOIN log_data AS ua ON(s.ua_fk=ua.data_id)
LEFT JOIN log_data as pi ON(pi.data_id=pv.path_fk)
LEFT JOIN log_data as qs ON(qs.data_id=pv.query_fk)
LEFT JOIN log_data as re ON(re.data_id=pv.referer_fk)
WHERE (dt>=SUBDATE(NOW(),INTERVAL 7 DAY))
AND (NOT(ua.data RLIKE CONCAT(
"googlebot¦robot¦Crawler¦ZyBorg¦libwww¦LWP::Simple¦"
,"Fetch API¦CJNetworkQuality¦Scooter¦WinHTTP¦Microsoft URL Control¦"
,"Gigabot¦linkchecker¦ia_archiver¦Jeeves¦Slurp¦sitecheck\.internetseer\.com¦"
,"Marvin v0.3¦NPBot¦QuepasaCreep¦lwp-trivial/1.34¦Girafabot¦"
,"Tagword¦Scrubby/2.1¦SurveyBot¦FreeFind/1.0¦FDSE robot¦SearchVIEW.net Spider¦"
,"Sqworm/2.9.85-BETA¦VoilaBot¦T-H-U-N-D-E-R-S-T-O-N-E¦MS Search 4.0 Robot¦"
,"msnbot/0.11")
)OR ua.data IS NULL)
AND re.data!=""
AND re.data RLIKE CONCAT("^http://([a-z]{0,3}.{0,3}.google¦search.msn.¦search.peoplepc¦"
,"search.earthlink¦.{0,3}search.yahoo¦wwWebmasterWorldebsearch¦aolsearch.aol¦search.netscape¦"
,"google.yahoo¦66.218.71.[0-9]{1,3}/search¦www.att¦msxml.excite¦dpxml.verizon¦"
,"partners.search.msn¦drs.yahoo¦search.aol¦[^\.]*.?mysearch¦search.iwon¦[^\.]*.?mamma¦"
,"smartbrowse.netster¦find.web.aol¦partners[0-9]*\.oingo¦websearch.cnn¦www.altavista¦"
,"searchtheweb.register¦www.gogole¦www.business¦www.dogpile¦www.overture¦www.searchalot¦"
,"www.ntlworld¦www.metacrawler¦www.comcast¦sucheaol.aol¦wwb.ieplugin¦www.hotbot¦www.worldnet.att¦"
,"www.zworks)")
AND re.data IS NOT NULL
GROUP BY keywords,path_info
HAVING keywords!=""
ORDER BY cnt DESC
LIMIT 50
;

killroy

10:27 am on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Damn double post :(

mipapage

2:18 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wow,

Thanks killroy! That's gong to take some mindbending (i.e. time) for me to get my head around it, but I'll give it a good hard look.

It'll be trivial to pair it up with landing pages

What'd you mean by this? That I'll be able to mod the query to do this easily, or when you said 'let me sort it out for ya' that you have written it into the query? (my SQL is weak, though this will make it stronger!)

if you can post the CREATE query I'd be happy to modify it for you

I'm not sure what the CREATE query is that you refer to... Currently my logs are in a raw .log format (apache extended format).

Cheers - mipapage

killroy

2:41 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well the first bit extracts the keywords for various search engines. once you got that you just pull the path_info line from your table, then group by both and count pageviews or IPs, as you prefer, that part is simple.

As I said, if you post the CREATE statement of your log table I'll write yo ua query to do exactly what you wanted.

SN

killroy

2:51 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah, just reread your last line. For this to work, you'll ahve to first load your logs into a MySQL database. Or, even better, configure Apache to log straight into a MySQL database instead of to file. MUCH more usable.

Alternatively extract he regexes, should be easy to translate them to aperl or php or pcre or whatever language and run them over the apache log ifle.

SN

sun818

3:53 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hi mipapage, there's a free log analysis program called "relax" that will do exactly what you want:
[webmasterworld.com...]

Extremely detailed. I find the referral keywords for landed page most useful. :)

mipapage

4:25 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



killroy,

Thanks a lot for the replies. I'll get on seeing how I can get my logs into mySQL.. Hmmm, seeing as I haven't got them into a dbase yet, would it be easier for you if you tell me the structure of the table (just fields and titles?) and then I'll stick em in that way?

Here's the scoop - my data is as above in msg1. It's taken from a shared apache environment, so the logs are filtered out from the other sites that are present, I'd imagine.

I can get the data into mySQL by parsing up the data and naming it with whatever convention you use. That way this can, as you say, 'be usefull for many'.

Regards - mipapage

[edited by: mipapage at 4:56 pm (utc) on Oct. 21, 2003]

mipapage

4:52 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Brilliant! That Relax is very good, providing that it's accurate... ;-]

I would however like to check out killroy's idea, as it would be a great fit into a *cough* CMS that I am writing now (learning and writing!)...

mipapage

8:05 pm on Oct 21, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Killroy,

Alternative to what I said above, I could simply create a table with the path_info and the referrer. I imagine I'm being a bit simple here but I think I would approach this by cleaning up the log in excel then bringing it into mySQL, for now, as I have yet to diveinto mySQL (read: I am a mySQL rookie).

So, that said, I could create a very simple table with ID, path_info, and referrer.

Is this sort of what you were after?

- mipapage

killroy

2:16 pm on Oct 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think you wanna use my structure, cos it'll be hard to produce unless you're not logging already directly into it. here it is anyways:

CREATE TABLE IF NOT EXISTS log_data
(data_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,datatype ENUM("path_info","post_data","query_string","referer","user_agent","extra_info")NOT NULL
,datahash INT UNSIGNED NOT NULL
,data TEXT NOT NULL
,UNIQUE KEY string(datatype,datahash)
);

CREATE TABLE IF NOT EXISTS log_sessions
(session_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,ip INT UNSIGNED NOT NULL
,ua_fk INT UNSIGNED
,KEY ip(ip)
);

CREATE TABLE IF NOT EXISTS log_pageviews
(pv_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
,session_fk INT UNSIGNED NOT NULL
,dt DATETIME NOT NULL
,path_fk INT UNSIGNED NOT NULL
,query_fk INT UNSIGNED
,post_fk INT UNSIGNED
,referer_fk INT UNSIGNED
,extra_fk INT UNSIGNED
,KEY session_fk(session_fk)
,KEY dt(dt)
,KEY path_fk(path_fk)
,KEY query_fk(query_fk)
,KEY post_fk(post_fk)
,KEY referer_fk(referer_fk)
,KEY extra_fk(extra_fk)
);

I use this structure, to avoid duplication of user agents, path info and other string data. My old log reached over 1.4gigs with a 1.7 gigs index file, and queries were SLOW. This new format saves at least 70% of repetitive data and is MUCH faster to query.

For your purposes though, you're right. You really just need:
datetime,ip,path,referer

Excel won't really work, cos it only does 64k lines at the most (unless of courseyour log is smaller then that).

If you get it into CSV format, you can use the MYSQL command to read from file and getit straight into this simple table. Once you've done that I'll show you how to rewrite the queryto extract the data.

Good Luck!

SN

PS: I recommend you don'T use my three table split structure. It can make things complex if you don't write acomprehensive tool kit, and is probably not worth it unless you need live querying on your logs.

mipapage

2:24 pm on Oct 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Killroy,

I haven't had much time to get into 'getting my log data into' mySQL. Can we just say that I'll create a table out of: datetime,ip,path,referer?

That way, we can close that thread off before it gets closed off by time, and for now I can implement that query on smaller logs importing via excel (new sites) and once I find a better importing solution I'll be set.

Let me know what you think!

mipapage

mipapage

3:07 pm on Nov 1, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bump!
killroy, yer sticky's full!