Forum Moderators: open
How do I identify the spiders that land on my site? Is there a list IPs spiders have?
And do they always us the same IP?
How can I tell the difference between regualar visitor and a spider? Also, how do i track the
spider? Can I use cookies?
How do I identify the spiders that land on my site?
The easiest non-interactive way is to browse your logs, the really useful part will be the User-Agent / UA field as this is where anything accessing your site gets a chance to identigy itself clearly. Most of the major SE's identify themselves very clearly.
Spiders generally appear with their User-Agent as something suitably spider-y ie
Google uses GoogleBot/2.1 (+http://www.googlebot.com/bot.html)
AltaVista uses Scooter/3.3
Fast uses a variety of bots, most starting with FAST-WebCrawler/3.7
etc.
If we compare this with genuine users you find they tend to appear with overly cryptic User-Agents - like this which describes their browser make (IE) version (5.5) and OS (Windows 5 aka W2k);
Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461)
For a good list of crawler UA's I'd recommend the psychedelix site which is linked from the site below as I've talked with the person behind this and he seems to have a passion for collecting / maintaining his list....
You could also try;
But some of the UA information there is a bit dated (ie the GoogleBot entry is wrong), but they also have an IP list.
Generally if you get something you haven't a clue about then a site search here or a quick google normally provides a good answer.
Is there a list IPs spiders have? And do they always us the same IP?
Spider generally tend to stick to their IP ranges because at the upper end of the SE market they are a pure-business like any other and so changing IP addresses around would interfere with the crawling schedules.
One of the other members here runs this site which lists out spider IP address and looks bang up to date;
[iplists.com...]
However my 2c says that unless you really have a serious need to identify spiders with no ambiguity then IP addresses are overkill when user-agent works perfectly well.
Can I use cookies?
The major spiders & crawlers don't accept cookies at all so you can't really track them using anything like a cookie or a session.
- Tony
Raw access logs - viewed directly, pulled into a spreadsheet, or processed with a script.
Someone posted recently - today or yesterday - about importing access logs into excel and sorting by user-agent and/or remote host. That might work pretty well if you have lots of memory for excel to work with.
HTH,
Jim