Forum Moderators: open

Message Too Old, No Replies

spider identification and monitoring

how do you know which spiders visit and when

         

spiky

7:42 pm on Dec 15, 2000 (gmt 0)

10+ Year Member



Back to basics please

Everone seems to be "seeing" spiders visit and crawl their sites with great excitement.

How is this done ?

Is there any free software to use available ? (links please ) or tutorials.

Thanks everone.

littleman

7:53 pm on Dec 15, 2000 (gmt 0)



Well the simplest (but hardest) way to watch for spiders is to look at your raw logs. There are many log
analysis tools, take a look
here [cgi.resourceindex.com]. I am in favour of lives crunchable data. As far as I know there are only two free scripts that do that,
ASX [xav.com] and Brett's own Traxis.

msgraph

8:43 pm on Dec 15, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree that the raw logs are the hardest but they are good if you want fast instant results. Really helpful if you have a lot of domains under your belt. I find Textpad the best to use for this since it makes everything look a lot clearer than Notepad or Ultra-Edit. Especially if you are getting some heavy spider activity.

If you want an easy-to-use program there is one called Web Site Traffic Analyzer. It will set you back about a 100$. You just dump in as many log files as you want and it will sort everything for you.

mivox

10:39 pm on Dec 16, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I use a program called Accessprobe for tracking individual users/spiders. It displays bar graphs of total number of hits next to each category of data... files requested, machines that visited, etc. With the paid version, if you click on the bar next to a specific file, it will show you which machines requested it, and if you click on the bar next to a specific machine name/IP, it will show you all the files that machine/IP requested.

Since most spiders will request a robots.txt file, I'll first click the bar next to the robots.txt listing, so I get a list of all the machines that looked at it... then I can scroll down and see which pages each of those machines looked at, by clicking on the bars next to their names.

Then, for any oddballs that didn't request a robots.txt, you can click on any odd looking user agents, and see what machine name/IP they were from, etc. etc. Makes correlating the basic info very easy.

chi

6:00 pm on Dec 21, 2000 (gmt 0)



First of all, hi all!
Mivox, where could I have a look to this program? Thank you.

mivox

7:22 pm on Dec 21, 2000 (gmt 0)

skirril

7:31 pm on Dec 21, 2000 (gmt 0)

10+ Year Member



Log file analyzer I use: analog (www.analog.cx) as well as looking at the raw log files.

All of you who didn't yet, also look at a post I made in [webmasterworld.com...] , regarding the impossibility to tell how many hits you have, and how many of them are spiders.

mivox

8:31 pm on Dec 21, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>the impossibility to tell how many hits you have, and how many of them are spiders.

Very true... even log analayzers that distinguish between "hits" and "visits" are far from perfectly accurate.

However, since most major SEs fall under your description of well-behaved spiders, a good log analysis program can be quite useful in determining which major SEs are actually visiting your site, what they're fetching, etc...

And beyond the major SEs, I don't pay too much mind to other spiders unless they're grossly misbehaving, so log analysis programs are fairly useful for my purposes.

littleman

8:52 pm on Dec 21, 2000 (gmt 0)



Being obsessed with tracking spiders I have several extra logs running. Two logs I find invaluable are a 'no referrer log' and a 'human no referrer' log. Both logs are stored collectively for multiple domains.

The 'no referrer' log will log everything that does not have a HTTP_REFERER.

The 'human no referrer' is the same list but screened against my list of known spider IPs.

Often I ftp up to my servers through a browser, and view the logs that way. I'll use Netscape's Find feature to search for keywords - such as the spider UA's, IP blocks, and REMOTE_HOST strings. I'll also visually go through the logs and look for patterns. This type of attention to detail is necessary if you are going to play the cloaking game.

<edit>Bad punctuation</edit>

Edited by: littleman

mivox

9:03 pm on Dec 21, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



*whew* I'm glad I haven't gotten into cloaking... I enjoy my free time too much. ;)

Friday

6:53 pm on Feb 20, 2002 (gmt 0)

10+ Year Member



Another REALLY nice shareware product is WebLog™

You can get it at
[awsd.com]

wilderness

10:45 am on Feb 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There is a very nice and very small program named LOGALIZER. Although it does have restrictions.
I used it for a short while and was quite pleased.

[am-soft.ru...]

I currently just use Notepad/Wordpad daily. At months end Analog is used to compile stats. It is quite configuareable.

Some of the free Java counters can be used effectively to compile stats.
I have sitemeter on my pages which offeres online views.

wilderness

10:52 am on Feb 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<snip>Everone seems to be "seeing" spiders visit and crawl their sites with great excitement.
How is this done ?>

This is a good beginning point for Spider ID.
Apparently Bots are supposed to regsiter here under some RFC compliance.
This page doesn't get updated very often and there are better resources.

[robotstxt.org...]