Spider tracking, need suggestions

Forum Moderators: open

Message Too Old, No Replies

Spider tracking, need suggestions

How do I track spiders?

Travoli

8:57 pm on Apr 26, 2001 (gmt 0)

Hi all.

This is my first post. I searched around some, but I cannot find the specific method to track spiders anywhere. We currently use a company called webtrends to analyze server traffic. However, it only gives us a daily analysis and cannot spot spiders specifically. I am betting there are easy ways to see what spider is visiting and be warned. Can someone detail their method please? Thanks!!
-Travoli

2_much

10:43 pm on Apr 26, 2001 (gmt 0)

Hi Travoli, welcome to WmW and thanks for posting your question!

Most people seem to use "homegrown" stuff to track spiders.

One of our programmers wrote a script that uses the raw logs and then checks for certain factors such as UA, spider IP and pages that it spidered.

What about other people, what are you using?

awoyo

11:01 pm on Apr 26, 2001 (gmt 0)

Welcome Travoli,

Well, I'm a newbie on the forum, too, but I guess I'll take a crack at this. Basically, a spider, is in most cases, identified by it's User Agent, IP address, and/or host name. For example.

Inktomi sends spiders out under different Hostnames like si3000.inktomi.com, and si4001.inktomi.com, and many others. Of course these can be associated with their IP address.

The User Agent for Inktomi is usually called Slurp such as (Slurp/si; slurp@inktomi.com; [inktomi.com...]
or Slurp/2.0-Owl_Weekly_Temp, Slurp/3.0-c, and so on.

The trick is to find this stuff in your access logs. If you're running a standard Linux/Apachee setup your user logs could be found in /www/logs/your-domain-access-log The access log can be rather large and hard to deal with. Instead I use a tail command based on how many hits I want to look at, or grep, if I know who I'm looking for, and pipe the info to a text file for later viewing.

But even easier than that is to use a tracking system that gets the header information from your environment table and writes it to a log. That's the same thing that's happening to your access log, but this is different in that it's more manageable. You have much better management over log files created by a tracking system like Axs, than you would over your servers log files. It's best not go pruning them anyway and to leave them to be tared up by the system. With a tracking system I can decide which pages I want to track, unlike my servers access log which is tracking every hit.

Keep visiting the forum here and you'll learn much. There's a great site search feature at [searchengineworld.com...]
some nice tools at [searchengineworld.com...] and a lot of spider information is located at [searchengineworld.com...] You'll also want to get something for doing home name lookups, IP blocks, etc.

Hope this helps.

Jim

Travoli

12:44 pm on Apr 27, 2001 (gmt 0)

Thanks fellas! (or ladies, but i doubt that this time)

A little background on me... I am not a techie at all. But I certainly can take this good info to my programmer and database admin.

So I guess I would have to ask for a custom script to be written with the names of the spiders already listed. Sounds like there is a big market out there for a script like that which could track spider activity on a site!! It is hard to believe it does not exist.
If anyone else has an easier way, let me know. And as I learn more, I will understand more :)

thanks!
-Travoli