homepage Welcome to WebmasterWorld Guest from 54.161.155.142
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
Forum Library, Charter, Moderators: Receptional & mademetop

Website Analytics - Tracking and Logging Forum

Featured Home Page Discussion

    
Using Excel to Analyse Bots
A step by step guide
DixonJones




msg:4696889
 11:39 pm on Aug 20, 2014 (gmt 0)

With Google analytics so prevalent these days, we forget that GA misses huge swathes of data about our sites. I recently had need to ascertain the relative amount of crawling each bot was doing. I wanted to know who was crawling most and build a pie chart of each bot’s activity – but there are other uses for this data. For example you might want to know:

*What percentage of your bandwidth is being consumed by bots
*Whether the bots are properly crawling all your pages
*Whether there are bots that don’t do what you think they do.

Looking at raw log files can seem a bot daunting – and I found that all four log analyzer programs that came with my web hosting package where less than adequate for the task. I am not the strongest programmer, but I do dabble a bit with Excel, so in just a few minutes I had created my on bot analyzer in Excel. I think with a little thought, I could turn this into a whole log-file analysis suite… but for now I’ll just share what I did in the hope that it helps a few people.

How to get your log files
If your site uses cPanel, go into cpanel and select “logfiles”. One of the options is to download your raw logfiles. Bear in mind, please, that the download is usually in compressed form. My Mac unzipped these with 2 clicks – Windows mileage may vary. When they were unzipped, mine did not have a filename extension, so I changed the filename to .txt so that it could be read by Excel.

If you don’t have cPanel, then your backend system may have another easy way. If not, then you can do it using FTP. The Log files are usually in a directory off the “root”. This is usually one directory level ABOVE where your HTML files are kept.

[Aside]If they are off the same directory as your HTML files, then there is a strong possibility that they are “open” to the rest of the world to see… which is likely to result in you getting thousands on malicious bots trying to create crappy log file entries in the vain hope that they will be picked up by search engines. Get them closed![/Aside]

Create an Excel file with TWO worksheets
Call the first worksheet “BotAnalysis” and the second worksheet “RawLogFile”. I am doing it this way so that, using the same spreadsheet in the future, I will be able to make the excel file much more elaborate, with sheets for anything from visitor sessions to Mobile users.

Import the raw logs into the Raw Logs worksheet
By making the raw logs a text file, my import was just one column. That suits me just fine. Every line is a separate “hit” on your server. I have written elsewhere on Webmasterworld over the years on the difference between hits and page views… but you might want to take the time to look that up. One word of warning here… modern Excel versions do not limit the number of rows you can have, but you will still be limited by your computer’s memory and CPU. If you have a site the size of webmasterworld, this isn’t going to work! But you could maybe use this method to sample your log-file data. Still good for pie charts! I have had 500,000 lines in an Excel spreadsheet before… but it was pushing the limits of my laptop.

Identify the bots

You are going to get Excel to count every line that mentions every bot, so you need a unique string which identifies each bot. Good bots like Google tell you who they are. Looking at the lines of code, they usually not only identify themselves, but give a URL where you can read more about them. I found it best to use the URL they provide as my “bot identifier”. This is because the bots may arrive with different version numbers on different days and getting to lax will create grave errors. For example, if you simply entered “Bing” as a unique identifier for Bingbot, you would also include any log entry where bing was listed in the referrer string, which would greatly inflate the numbers.

Here are the identifiers I used, which will save you some time and heartache looking at raw log files:

Bot user AgentUnique String in Logs
MJ12Bot ww.majestic12.co.uk/bot.php
Googlebot ww.google.com/bot.html
Bing ww.bing.com/bingbot.htm
Yandex ttp://yandex.com/bots
Baidu ww.baidu.com/search/spider.html
Ahrefs ttp://ahrefs.com/robot/
Moz ttp://www.opensiteexplorer.org/dotbot

In my Bot Analysis worksheet I entered three columns:

Column A: The Bot User Agent I use for my pie chart. Just cut and paste them from the table above.
Column B: I headed this “Crawl rate” and then I put this formula in cell B2 and then copy and paste the formula down the column…

=COUNTIF(RawLogFile!A:A,"*"&C2&"*")

Column C: The Unique string in the logs. Again, just cut and paste my list above. You can always expand the list later.

Now just make it pretty!
The reason for putting the formula in column B is because I can now easily make a pie cart in Excel by simply selecting the wysiwig pie chart and dragging over columns A and B. This then gives me pretty pie chart legends. If you are not too hot on Excel, there is loads of help on how to make pie charts. Even if you don’t, you can see the relative counts in column B. You can also get a bit smarter and count how many of the lines there are in the raw logs, then sum the counts in the BotAnalysis sheet to see what percentage of your bandwidth is being taken up by non-human activity. Unfortunately – bad bots don’t identify themselves this way, so you may need to get much sneakier trying to find the signature of other bots or scrapers on your site… but it’s a good start.

 

not2easy




msg:4696906
 12:38 am on Aug 21, 2014 (gmt 0)

An excellent way for people to start getting familiar with what's doing what on their sites. That's a good plain explanation of how to access your access logs too. Thanks!

AlexB77




msg:4697296
 2:29 pm on Aug 22, 2014 (gmt 0)

It is indeed a good way to begin with. In many years of web development, the only program that I really like out of all other log reading software is WebLogExpert. I am not trying to advertise it in anyway and I am not affiliated with it at all, yet it is still by far one of the best programs when it comes to reading information from the raw logs.

physics




msg:4697578
 7:02 pm on Aug 23, 2014 (gmt 0)

Nice :) I've found this little script I wrote to be helpful also, if a little less user friendly:
https://github.com/physicsdude/FindSpiders
You can grab the output and throw it into csv file for import into Excel too :)

digideth




msg:4697618
 3:00 am on Aug 24, 2014 (gmt 0)

@DixonJones
Any chance you would be willing to share your excel file... please!
of course with your log data removed.

thanks!

DixonJones




msg:4704214
 8:50 pm on Sep 25, 2014 (gmt 0)

I don't see why not, digideth (As long as Engine doesn't mind). Just import your raw logs into the second worksheet and (if you are lucky) some a pretty pie chart should appear:

https://www.dropbox.com/s/ersy4sx97szm11n/RawLogBotExtractor-nodata.xlsx?dl=0

Dixon.

HitProf




msg:4711992
 9:56 am on Oct 30, 2014 (gmt 0)

Thanks Dixon, your Excel file works like a charm. You might want to add this one:

Bot user Agent - Unique String in Logs
Yahoo! Slurp - ttp://help.yahoo.com/help/us/ysearch/slurp

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved