| This 55 message thread spans 2 pages: < < 55 ( 1  ) || |
|Why Your Server Log Files are Complete Fiction|
War on Web Sites Claims Analytics Casualty
If your online business relies on analyzing your web server logs you're already in trouble.
Many online services pretend to be humans by hiding as browsers in order to gain access to your server unnoticed so you can't stop them and as a side effect they are added to the visitor count on your server log analytics.
Just recently there was a massive outcry about AVG [webmasterworld.com] doing this deceptive practice, which has now almost stopped, but many others continue doing this unabated.
Some of the companies that are skewing your analytics include Picscout [webmasterworld.com], Cyveillance [webmasterworld.com], Munax [webmasterworld.com], WebSense [webmasterworld.com], and many more, too numerous to mention here, an increasingly long list growing daily.
Not only companies are doing this but scrapers, spam harvesters, spambots, botnets and all sorts of other illicit activities trying to avoid being stopped as well.
To put it mildly, your server log analytics are complete and utter fiction.
That means clients who think these server side stats are actually meaningful are probably making life miserable for their web designers, SEOs and marketing staff because it's obvious they're doing a bad job considering the low or decreasing conversion rates.
Is there a solution?
Sadly, there is no accurate analytics solution at this time.
You can get close, but no cigar.
[edited by: incrediBILL at 10:22 pm (utc) on July 15, 2008]
There are IP-Lists of all major bots. Of course they're not including small private-run bots, but it's a start. Then, of course, you could include more sophisticated bot-recognition, for example not only looking at the useragent but also at the HTTP-Version, Accept-Encoding etc pp, which most bot-builders are too lazy (or uninformed?) to adjust when faking useragents of browsers. Then, of course, traffic-spiking and speed is an issue. A hiding (thus slow) bot and a very busy user could both request a hundred pages on a given day, but the user won't usually do them in a row, finishing them in 10 minutes, but the bot might.
for bots that are specifically designed to extract your content from a database search, you could check referers and see if they match, which, of course, is quite a bit of work to do in a script (aka if he did not do a search that includes result 12345, he may be a bot if he requests show.php?id=12345). For AOL-users who use the internal browser and thus sit behind proxies, that might also be way to tell them apart if you have a large site with a lot of pages. If you combine session timeouts with analysis of possible clickstreams, you might be able to see that one AOL-user went from this page to that page while the other one is operating in a different area of the site. that, of course, requires your software to know about the linkstructure within your site.
add cookies, useragent-tracking etc and it'll get you closer to the truth.
Back in the days, we tried a dirty little trick using basic auth and the feature to include it in urls. basically, every one who arrived got redirected to [uniquesessionid:email@example.com...] thus sending this "username" and password in every request. too bad, shortly after we deployed it, browsers changed their default behavior of silently accepting those urls due to security issues.
another thing we tried was sessionids in subdomains, which were tracked, generated etc by a mod_perl-handler as PerlFixupHandler. that was nice because it allowed us static html, onsite-absolute links like /files/bar.html would still work, thus no dynamic content-overhead, not break the session tracking and it was quite accurate. it was given up when we finally rewrote the whole site, losing the frames. that kind of session-tracking was mainly born because we wanted people to jump in from search engines, giving them a frameset but show the right page in the content-frame, so anyone who had a valid session would also have a frameset, everyone else would get a session and a frameset.
You talking real users or crawlers or both or what?
And, how did you measure this?
I have a deep background in user support including big companies and outside of companies or departments with high-security needs, I've never seen it turned off and I've asked others with similar backgrounds if they've ever seen it turned off and their experience is the same as mine.
In the corporate world, the average intranet would simply not work properly if script was disabled in the user agents.
And nobody is going to get me to believe that the percentage of Windows users using IE with script turned off is more than .00001% or some such negligible number.
This hasn't been a problem for me in the past. But I just reviewed my logs for a couple of my larger sites and it's unbelievable how much crap is scraping my content. One site alone paid out 50 gigs of bandwidth in one month just looking at the top couple dozen visitors who are scraping the site.
Sigh. I guess it's time to consider doing something to stop all this. What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?
How about a honeypot approach?
Obviously, it wouldn't work with the AVG issue and anything like that which isn't crawling your links, but what about the other ones?
that's right, ergophobe. it's pretty much what Bot-Trap uses. a hidden link to a directory which is disallowed for any bot in robots.txt. a user won't click, a bot that does not follow robots.txt might. a hit leads to some kind of analysis, possibly a captcha to prove you're human and indeed landed there by accident, and a ban if you fail to prove so.
Sure, I know how a honeypot works, but again, in the case of something like the AVG bot, it will do absolutely no good b/c AVG was not crawling your site, but rather going off Google results (which observe the robots.txt).
So the question is, do the bots Bill is complaining about crawl (and thus can be caught by a honeypot) or are they accessing the site in more sophisticated ways?
I've used AlexK script for bad bots - added it after bandwidth spikes due to rogue bots led my former shared server host to turn my site off for a couple of short spells. Worked, and nifty, tho likely not catching all.
Lately moved to host that's better overall (notably uptime), but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.
Time to add IP lists; tho I'd read of the badder bots' IPs being in flux, and wonder if a few of these IPs might ban some potentially worthwhile visitors.
With the move, I've also "lost" awstats. Had been on previous host, in root - not sure if this is "locked down", so results not readily seen by joe public.
Haven't (yet) tried installing it on new host; I'd been looking at adsense stats - and only a day or two ago signed on with google analytics.
As more and more sites are AJAX-driven, I'm thinking that fewer and fewer people will have JS disabled. I've lately been surfing with the Firefox NoScript plugin enabled and most sites have some major feature that doesn't work without JS.
I think that's a countervailing trend to the security issue. The problem is that of course it's not random. If you have a security-oriented site, for example, probably a very high percentage of your most important visitors will have JS disabled.
Lets say you have a directory that is disallowed in robots.txt. You put a dynamic page in that directory. The link to that page is hidden in some of your real pages that you allow bots to crawl. This link looks like a normal link, but it's really a link to that dynamic page in the disallowed directory. It's hidden behind a 2x1 (everyone seems to use 1x1) transparent gif on your legit pages. Bad bots that ignore robots.txt will try to follow this link. Anyone or anything that follows it is trapped by the dynamic page. Real users with real eyeballs never click on this link because they cannot see it. You have to be really patient to even get your mouse pointer to recognize there's a link there, because it's only a 2x1 transparent gif.
Your dynamic page logs the rogue IP address, and throws it into htaccess, and then at the end it spits out a Forbidden header.
Now this rogue bot keeps getting a 403 Forbidden from htaccess, but it doesn't care, it just keeps trying all the links it already has for your site. But at least you've now logged the bad guy's IP address, and can count its " 403 " entries in your access_log, and can take other measures later if necessary.
|Omniture stats are a sight to behold. Especially when there are high volume numbers. Try analyzing the logfiles of a site that large. Anyone here do that? I mean, are you micro-managing more than a million pages? |
I resemble the first part of that (Omniture) and the multi-million pages remark, but not the micro-managing bit. We have a small, dedicated team working on analytics who can answer any of the obscure questions not readily accessible on a menu or dashboard. I think having more pages produces a more reliable estimate (web stats are estimates at best) than a smaller site with a few hundred or thousand pages that are more likely to be scraped fully.
A colleague and I have created separate dashboards to measure traffic to the same pages (~2M) and our respective counts are consistently out by 10 or 11 pages every day -- same Omniture data. Hence I treat web stats as estimates.
|it's unbelievable how much crap is scraping my content |
What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?
|I've used AlexK script for bad bots ... Worked, and nifty |
moved to host that's better overall ... but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.
I've used the bad-bot-blocking script [webmasterworld.com] for about 5 years now. These are my observations:
- It's forte is catching fast-scrapers.
They haven't been mentioned up to now in this thread. However, since the script regularly blocks attempts from bots on the same gigabit network as my server attempting to scrape the site at up to 50 times/sec, I think it deserves notice.
- The script parameters need reviewing every few months.
The parameter-settings are governed by how busy the site is. More pages/day means turning the dials a bit.
- Server-enforced backup which touches the tracking-files is a weakness of the script.
An effective solution is near the end of this thread [webmasterworld.com].
- I've abandoned slow-scraper blocking on most sub-domains within my site.
Slow-scraper blocks put an upper limit on how many pages are allowed to any IP per day. The problem is that it infallibly catches Google & other 'good' scrapers, which then means using a whitelist. I've become convinced that fast-scrapers are the main issue, and therefore also drop the roll-over (restart) time to 4 hours, which makes the blocking more effective.
In summary, and in direct response to this thread, the distortion causes by all the bots is a real pain. AWStats says that 40-50% of all hits on my site are from bots, and they are just the ones that it can detect. As just one example, those figures do not include all the bots trying to do spam-posting into my site forums...
Even detecting the cloaked bots would take a far-more sophisticated algorithm than the bot-blocker script employs. And, since Google--the head-and-shoulders above everyone else algorithmic superior--is unable to stop sending advertisements promoting car-driving instructors for my site pages discussing modem drivers, I suspect that we may all have a little distance to go yet in that regard.
PS to docbird:
I retain your excellent sticky to me, and am anxious to implement your suggestions. Just a question of the time to do it.
[edited by: AlexK at 11:39 pm (utc) on July 18, 2008]
[edited by: jatar_k at 5:15 pm (utc) on July 31, 2008]
[edit reason] fixed link [/edit]
IP Ranges of known white listed bots
White listed proxies
IP Ranges(country specific stats from previous visitors VS. current visitor(AKA Trends))
JS Enabled(bots that request/and may be parse JS but do not support basic fluid layout syntax as self.innerWidth on sa div tag)
Random Honey Pots(forget about robots.txt at this point)
Robots.txt access by IP(know who is at you door)
Random redirects(sometimes serverside sometimes with JS based on the browserlevel) based on the score of the above
Checkout this sectio as ofte as you can: [webmasterworld.com...]
.. and a few more that I am still examining...
A few weeks ago I was pointed to a script that was fed a username and a password for PayPall account. Keeping the session(all cookies) it went 6 pages deep into the site. Made me feel like a rookie at all this...
My Google analytics says 97.5% of my visitors have java.
Monitoring visitors have never been 100%. I used to prefer logs as client side stuff does add more bandwidth to your pages, but today, logs just don’t have enough information to make an accurate estimate. I wouldn’t recommend to anyone to use logs to monitor visits, in fact I've now turned them off completely.
Java does have a lot of information that can be pulled which gives a good indication of a real visitor, allowing us to get a reasonably accurate visitor numbers - which I believe Google analytics is doing quite a job at this.
GA is, in the end, a part of google's data mining. I know some people (the slightly paranoied ones) who block it completly so google won't know their every move. Guess that counts towards the audience thingy.
blend27: sounds like a solid approach. did you write your analyzer yourself or is there some standard software that can easily be hacked to do that? logging all the headers can be quite a data overhead, but I guess it's worth it. Care to share those you're examining right now?
And the paypal-thing: it's a different thing to have a bot written or at least adapted for a single site. that's actually very easy with the help of some perl modules like WWW::Mechanize, but somebody who knows his business and really want's to extract data from your side is always hard to stop. You could delay him by having him sign up and then controlling how many pages a user may see a day, but that'll only slow him down.
It is something that I started working on right after Florida Update back in the day. Our Host at that time had a Stats program that took our rewritten SES URIs and started reporting to us that the top most visited destinations were subdirectories when in reality there are none(at least visible to the users or bots). The system sits on top of very well normalized RDB, so data overhead is not an issue at all. IIS Logs are also parsed into the DB, just for comparison purposes and as a backup for the numbers.
Standard Firefox headers sent via browser:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
AVG got cought for missing at least 4 of thouse, and lots of rogue bots as was mentioned before.
As far as the slightly paranoid ones, I am one of them.
Hosts file entries:
No Tool bars
And the values of browser.safebrowsing.provider.0.* in Firefox are BLANK as well.
All the spiders I have are JS enabled, no way to tell them from a real browser.
if they're js-enabled and you catch them by their request speed, just send out some code that runs them into an endless loop ;)
blend27: sound's like you've really put some work into that. Header analysis is usually a sure thing, even works the other way round: people who surf with googlebot's useragent rarely think about adding the From-Header.
I find it fairly easy to distinguish bogus from real users on our retail sites. I rely on raw logs only and use ClickTracks Optimizer. With ClickTracks I create visitor segments. For instance, we create a segment of people who visited our order confirmation page. That means people who actually placed an order with a real credit card. We can then study how that segment browses the entire site. We also filter out countries that we don't ship to, so that eliminates a ton of scrapers and bots. We then compare segments like those who abandoned the shopping cart, or left after certain error pages. We have a couple of years worth of data to compare and after all the filtering we're pretty confident the data is about as good as we can hope for from raw logs.
This thread has me thinking, however, of placing some sort of trap for bots and scrapers somewhere on our main page such as a 1-pixel link, or an invisible link, and then I can segment those visitors off since no human visitor would have seen or clicked those links. I'd want to be sure this doesn't affect legit search bots, though.
>>I'd want to be sure this doesn't affect legit search bots, though.
just block the trap in robots.txt -- no legit bot should ever follow it then.
I use a combination of Adsense/Analytics and a raw log analyser that makes web reports/pages.
I did manually analyse data from a log file for one particular website for six years via a database/excel spreadsheet.
The biggest traffic spikes are related to specific events so I can tell where the traffic is coming from and what they want to look at.
Here's another prime example of why I say your log files are fiction:
I shared the list of user agents, many of them browsers, that a WordPress hacking script used to attack sites. Many other scripts use similar lists, many often longer as someone also points out in that thread.
Now just imagine there are many of these scripts or variations being run by hundreds or thousands of little kiddies wanting to leave their graffiti on your site, and you start to get a clue what some of that traffic really is.
Then you take the organized crime aspect from a country I won't mention, who likes to infiltrate residential computers by the 100s of thousands and use them as automated minions to do their bidding on the web, and the sheer scope of this problem becomes truly amazing.
Maybe your particular site only sees a little of this activity but mine sees a lot, WebmasterWorld sees a lot (just ask Brett), so be grateful if you're not in their crosshairs... yet.
| This 55 message thread spans 2 pages: < < 55 ( 1  ) |