homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Spider tracking, need suggestions
How do I track spiders?
Travoli




msg:401567
 8:57 pm on Apr 26, 2001 (gmt 0)

Hi all.

This is my first post. I searched around some, but I cannot find the specific method to track spiders anywhere. We currently use a company called webtrends to analyze server traffic. However, it only gives us a daily analysis and cannot spot spiders specifically. I am betting there are easy ways to see what spider is visiting and be warned. Can someone detail their method please? Thanks!!
-Travoli

 

2_much




msg:401568
 10:43 pm on Apr 26, 2001 (gmt 0)

Hi Travoli, welcome to WmW and thanks for posting your question!

Most people seem to use "homegrown" stuff to track spiders.

One of our programmers wrote a script that uses the raw logs and then checks for certain factors such as UA, spider IP and pages that it spidered.

What about other people, what are you using?

awoyo




msg:401569
 11:01 pm on Apr 26, 2001 (gmt 0)

Welcome Travoli,

Well, I'm a newbie on the forum, too, but I guess I'll take a crack at this. Basically, a spider, is in most cases, identified by it's User Agent, IP address, and/or host name. For example.

Inktomi sends spiders out under different Hostnames like si3000.inktomi.com, and si4001.inktomi.com, and many others. Of course these can be associated with their IP address.

The User Agent for Inktomi is usually called Slurp such as (Slurp/si; slurp@inktomi.com; [inktomi.com...]
or Slurp/2.0-Owl_Weekly_Temp, Slurp/3.0-c, and so on.

The trick is to find this stuff in your access logs. If you're running a standard Linux/Apachee setup your user logs could be found in /www/logs/your-domain-access-log The access log can be rather large and hard to deal with. Instead I use a tail command based on how many hits I want to look at, or grep, if I know who I'm looking for, and pipe the info to a text file for later viewing.

But even easier than that is to use a tracking system that gets the header information from your environment table and writes it to a log. That's the same thing that's happening to your access log, but this is different in that it's more manageable. You have much better management over log files created by a tracking system like Axs, than you would over your servers log files. It's best not go pruning them anyway and to leave them to be tared up by the system. With a tracking system I can decide which pages I want to track, unlike my servers access log which is tracking every hit.

Keep visiting the forum here and you'll learn much. There's a great site search feature at [searchengineworld.com...]
some nice tools at [searchengineworld.com...] and a lot of spider information is located at [searchengineworld.com...] You'll also want to get something for doing home name lookups, IP blocks, etc.

Hope this helps.

Jim

Travoli




msg:401570
 12:44 pm on Apr 27, 2001 (gmt 0)

Thanks fellas! (or ladies, but i doubt that this time)

A little background on me... I am not a techie at all. But I certainly can take this good info to my programmer and database admin.

So I guess I would have to ask for a custom script to be written with the names of the spiders already listed. Sounds like there is a big market out there for a script like that which could track spider activity on a site!! It is hard to believe it does not exist.
If anyone else has an easier way, let me know. And as I learn more, I will understand more :)

thanks!
-Travoli

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved