Forum Moderators: open
I have been experimenting with loggin ips an referers cos i am never entirely confident I understand the figures i get from my 2 analytics packages, tho both are actually provided by well know internet companies,
The issue o how both treat visitors with no referer is a particular sore point
1, shows like 34% of all visitors every day have no referer
the other seems to handle visitors with no referer very differently,
I saw a third package that had a spider filter , but stopped using it when the trial period expired cos I didn't hear anyone else talking about it,
Anyway QUESTION
is anyone offering a reliable list of know spiders an robots either free or cheaply or is there a well know reliable way of building a list,
what does "UA" mean?
I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.
Thanks
Vite_RTS
is anyone offering a reliable list of know spiders an robots either free or cheaply or is there a well know reliable way of building a list,
[joseluis.pellicer.org...]
[psychedelix.com...]
[botspot.com...]
[clearwaterbeachcam.com...]
[jafsoft.com...]
[projecthoneypot.org...]
[iplists.com...]
[browsers.garykeith.com...]
what does "UA" mean?
UA is an acronym for "User-Agent", which is a way to identify a visitor to a web site. From Wikipedia:
When Internet users visit a web site, a text string is generally sent to identify the user agent to the server. This forms part of the HTTP request, prefixed with User-agent: or User-Agent: and typically includes information such as the application name, version, host operating system, and language. Bots, such as web crawlers, often also include a URL and/or e-mail address so that the webmaster can contact the operator of the bot.
I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.
A never ending problem. Many don't identify themselves. For a website that gets modest traffic, bots can easily represent a non-trivial percentage of all HTTP GETs.
One pressure point available is that most of the bots who don't identify themselves also don't spread themselves across either time or IP address space very well. Thus, simply filtering out all traffic from any IP address that fetched more than N unique URLs (where N is configurable, 25 works pretty well for me at my website's current size) during the same day tends work pretty good for cleaning up the bottom-feeders, in my experience.
Real-time challenges is the only way to truly identify a bot vs a human as the human responds, the bot just keeps asking for pages.
Trust me on this as I got too many false positives from log file analysis alone because even SPEED isn't an indicator anymore thanks to Firefox pre-fetch, Google Web Accelerator and Blog readers that do things like "open all in tabs" and VOILA! 20 blogs pages are opened in 10 seconds, not a bot though.
[edited by: incrediBILL at 8:38 pm (utc) on Aug. 20, 2006]
Ronburk, do you mean any ip that visits 25 webpages within your website in the same day?
Yup. No magic in "25", that's just the current number which I retune as needed (used to be smaller when I had even less traffic than today).
IncrediBill's rebuttal is true in the general case, but I don't have to solve the general case -- I only have to make something work for a website for which I have intimate knowledge of the normal traffic patterns. I don't have high enough traffic for shared IP addresses (e.g., AOL users) to be a problem. I don't have blogs. My visitor path length is invariably short (>95% are less than 5 GETs of non-image URLs). Nearly all my traffic is free SE traffic or referrals from technical postings in forums. It's very easy to identify virtually all bots with trivial post analysis for my particular website.
In general, this is one of the (few) blessings of not having a high-traffic site. Log analysis assumptions that would break down seriously if you're getting 100,000 unique visitors per day can be highly accurate on a low-traffic website. When you're more in the range of 1,000 unique visitors per day or less, for example, the incorrect assumption that IP address == person will typically be nearly totally correct most of the time (unless there is something weirdly skewed about your traffic sources).
As always with traffic analysis, you have to know exactly what assumptions your analysis relies on, and be alert for cases where those assumptions are violated. Of course, it's pretty easy to examine the logs after being alerted to a previously unknown IP address that hit 30 URLs in a day and determine (by looking at the visitor behavior) that it really was something like a school assignment where all the students were behind the same NAT server.
I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.For log analyses I use the free program analog. It is highly configurable by its text based config file.
Just curious how you do that when many of them now call themselves IE, Firefox and Opera?