Forum Moderators: DixonJones

Message Too Old, No Replies

How do large companies track spider traffic on their websites?

         

Domomonkey

6:03 pm on Apr 26, 2010 (gmt 0)

10+ Year Member



What tools are used to track enterprise level sites for spider traffic? Stuff like Google Analytics can't be used because they don't track Javascript. What programs can handle very large log files for spider traffic analysis?

For use with sites that get 100,000s of hits a day.

We use WebTrends now but want other options. I have looked into various tools doing basic Google hunting but they all seem so basic, limited in that some can't even process on a daily schedule.

We want to know our spider traffic to make sure folks like Google, Yahoo, Baidu, etc are fulfilling their contracts and properly indexing our sites. We have had issues in the past and this metric is quite important to us.

I know there must be something else out there, not everyone can be using WebTrends and we can't be the only people who care about spider traffic.

I have been hunting for months! What else is there?

tedster

6:35 pm on Apr 26, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello Domomonkey, and welcome to the forums.

This may be just a misunderstanding, but search engines do not have any contract or obligation - they perform a free service and index what they want when they want to. Still, it's good to know what is happening with spiders and your observations about JavaScript triggered analytics are correct.

I crunch the raw server logs themselves using Mach5 FastStats. It works something like grep but with great features added to an interface and can handle very large files. The limits seem to be more in your computer than in the application. For a very busy site, the logfiles can get unwieldy unless you take extra measures to split them up or log search spiders to a dedicated file.

Domomonkey

6:49 pm on Apr 26, 2010 (gmt 0)

10+ Year Member



Well normally yes, it is a free service in that you hope they index your site but we are working directly with the search engines to ensure that our sites are being indexed correctly due the the complex nature of the library we carry online.

System resources are not an issue, I can throw as much dedicated Unix or Windows systems to crunching the files as required but the issue is that I don't know what software to use.

My requirements are that I need a software to process on a daily basis to give me yesterday's figures, to enable us to generate an ongoing report, by that I mean have a traffic report build each day to show days, months, then years, and so on. In this I would want to see robot and spider traffic broken down. Being able to pull in log files based on macros is a must, I can't go into the program each day and manually specify the log files we want to analyze. Being able to do something like \\servername\directory\date-1.log sort of thing for an automated daily processing is key.

WebTrends does do all this but spending $100,000 (really, that's the bill we got from them) a year to give me some spider reports seems silly when there must be something else out there.

I will take a look at FastStats. Will it let me do what I want?

Sorry to make this so long, I just found this place and it might be the answer to my long standing question on how to provide my customers with basic spider reports without needing WebTrends to do so.

mack

6:58 pm on Apr 26, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



What about having a bespoke solution built. Place the code on every page and when the UA matches a bot have it write to a database.
page indexed, time, bot

You can then code an application to display the data exactly as you require.

Mack.

blend27

10:03 pm on May 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What about having a bespoke solution built.


That is exactly what we did, we looked around at what is offered, scoped out functionality that we would need, and then spent about 2-3 weeks coding it. Logging all data starting from GEO and ClickPath(Bots and Bad bots excluded) to the Response headers. All encrypted in a properly normalized DB for the past couple of month. The rest reads the Custom log Files and Database for the Identification info of data that is not related to the visitor’s info.

Pretty much the Geek stuff...!

onlineleben

8:16 am on May 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try analog - quite old log analyzer but highly customizable. Is quite fast, too.