|Using Excel to Analyse Bots|
A step by step guide
With Google analytics so prevalent these days, we forget that GA misses huge swathes of data about our sites. I recently had need to ascertain the relative amount of crawling each bot was doing. I wanted to know who was crawling most and build a pie chart of each bot’s activity – but there are other uses for this data. For example you might want to know:
*What percentage of your bandwidth is being consumed by bots
*Whether the bots are properly crawling all your pages
*Whether there are bots that don’t do what you think they do.
Looking at raw log files can seem a bot daunting – and I found that all four log analyzer programs that came with my web hosting package where less than adequate for the task. I am not the strongest programmer, but I do dabble a bit with Excel, so in just a few minutes I had created my on bot analyzer in Excel. I think with a little thought, I could turn this into a whole log-file analysis suite… but for now I’ll just share what I did in the hope that it helps a few people.
How to get your log files
If your site uses cPanel, go into cpanel and select “logfiles”. One of the options is to download your raw logfiles. Bear in mind, please, that the download is usually in compressed form. My Mac unzipped these with 2 clicks – Windows mileage may vary. When they were unzipped, mine did not have a filename extension, so I changed the filename to .txt so that it could be read by Excel.
If you don’t have cPanel, then your backend system may have another easy way. If not, then you can do it using FTP. The Log files are usually in a directory off the “root”. This is usually one directory level ABOVE where your HTML files are kept.
[Aside]If they are off the same directory as your HTML files, then there is a strong possibility that they are “open” to the rest of the world to see… which is likely to result in you getting thousands on malicious bots trying to create crappy log file entries in the vain hope that they will be picked up by search engines. Get them closed![/Aside]
Create an Excel file with TWO worksheets
Call the first worksheet “BotAnalysis” and the second worksheet “RawLogFile”. I am doing it this way so that, using the same spreadsheet in the future, I will be able to make the excel file much more elaborate, with sheets for anything from visitor sessions to Mobile users.
Import the raw logs into the Raw Logs worksheet
By making the raw logs a text file, my import was just one column. That suits me just fine. Every line is a separate “hit” on your server. I have written elsewhere on Webmasterworld over the years on the difference between hits and page views… but you might want to take the time to look that up. One word of warning here… modern Excel versions do not limit the number of rows you can have, but you will still be limited by your computer’s memory and CPU. If you have a site the size of webmasterworld, this isn’t going to work! But you could maybe use this method to sample your log-file data. Still good for pie charts! I have had 500,000 lines in an Excel spreadsheet before… but it was pushing the limits of my laptop.
Identify the bots
You are going to get Excel to count every line that mentions every bot, so you need a unique string which identifies each bot. Good bots like Google tell you who they are. Looking at the lines of code, they usually not only identify themselves, but give a URL where you can read more about them. I found it best to use the URL they provide as my “bot identifier”. This is because the bots may arrive with different version numbers on different days and getting to lax will create grave errors. For example, if you simply entered “Bing” as a unique identifier for Bingbot, you would also include any log entry where bing was listed in the referrer string, which would greatly inflate the numbers.
Here are the identifiers I used, which will save you some time and heartache looking at raw log files:
Bot user AgentUnique String in Logs
In my Bot Analysis worksheet I entered three columns:
Column A: The Bot User Agent I use for my pie chart. Just cut and paste them from the table above.
Column B: I headed this “Crawl rate” and then I put this formula in cell B2 and then copy and paste the formula down the column…
Column C: The Unique string in the logs. Again, just cut and paste my list above. You can always expand the list later.
Now just make it pretty!
The reason for putting the formula in column B is because I can now easily make a pie cart in Excel by simply selecting the wysiwig pie chart and dragging over columns A and B. This then gives me pretty pie chart legends. If you are not too hot on Excel, there is loads of help on how to make pie charts. Even if you don’t, you can see the relative counts in column B. You can also get a bit smarter and count how many of the lines there are in the raw logs, then sum the counts in the BotAnalysis sheet to see what percentage of your bandwidth is being taken up by non-human activity. Unfortunately – bad bots don’t identify themselves this way, so you may need to get much sneakier trying to find the signature of other bots or scrapers on your site… but it’s a good start.
An excellent way for people to start getting familiar with what's doing what on their sites. That's a good plain explanation of how to access your access logs too. Thanks!
It is indeed a good way to begin with. In many years of web development, the only program that I really like out of all other log reading software is WebLogExpert. I am not trying to advertise it in anyway and I am not affiliated with it at all, yet it is still by far one of the best programs when it comes to reading information from the raw logs.
Nice :) I've found this little script I wrote to be helpful also, if a little less user friendly:
You can grab the output and throw it into csv file for import into Excel too :)
Any chance you would be willing to share your excel file... please!
of course with your log data removed.
I don't see why not, digideth (As long as Engine doesn't mind). Just import your raw logs into the second worksheet and (if you are lucky) some a pretty pie chart should appear:
Thanks Dixon, your Excel file works like a charm. You might want to add this one:
Bot user Agent - Unique String in Logs
Yahoo! Slurp - ttp://help.yahoo.com/help/us/ysearch/slurp
|Googlebot ww.google.com/bot.html |
I suggest expanding this one to require either IP in 66.249 or 200/30x response. Otherwise your code will include any and all Googlebot spoofers. Don't know about the rest of youse, but I've seen enough of them to warrant a separate lockout ("claims to be googlebot but isn't").
How do you separate successful from unsuccessful MJ12 crawls? They've got a distressing habit of signing up with any random server farm so most of their requests don't even get through.
The same steps can be used in Access, which handles far more data (years worth if you collect it that way, which allows for trend analysis, etc.). If one breaks each log row into data fields even more analysis is possible (date/time/ip/referer/query/etc). Only advantage to a database Access is what's on my machine, can use MySQL, is related tables where the string identifiers for bots, or other analysis, can be separately maintained. Excel or database can produce crosstab queries/reports which are ideal for consolidating big data.