Welcome to WebmasterWorld Guest from 188.8.131.52
Many online services pretend to be humans by hiding as browsers in order to gain access to your server unnoticed so you can't stop them and as a side effect they are added to the visitor count on your server log analytics.
Just recently there was a massive outcry about AVG [webmasterworld.com] doing this deceptive practice, which has now almost stopped, but many others continue doing this unabated.
Some of the companies that are skewing your analytics include Picscout [webmasterworld.com], Cyveillance [webmasterworld.com], Munax [webmasterworld.com], WebSense [webmasterworld.com], and many more, too numerous to mention here, an increasingly long list growing daily.
Not only companies are doing this but scrapers, spam harvesters, spambots, botnets and all sorts of other illicit activities trying to avoid being stopped as well.
To put it mildly, your server log analytics are complete and utter fiction.
That means clients who think these server side stats are actually meaningful are probably making life miserable for their web designers, SEOs and marketing staff because it's obvious they're doing a bad job considering the low or decreasing conversion rates.
Is there a solution?
Sadly, there is no accurate analytics solution at this time.
You can get close, but no cigar.
[edited by: incrediBILL at 10:22 pm (utc) on July 15, 2008]
for bots that are specifically designed to extract your content from a database search, you could check referers and see if they match, which, of course, is quite a bit of work to do in a script (aka if he did not do a search that includes result 12345, he may be a bot if he requests show.php?id=12345). For AOL-users who use the internal browser and thus sit behind proxies, that might also be way to tell them apart if you have a large site with a lot of pages. If you combine session timeouts with analysis of possible clickstreams, you might be able to see that one AOL-user went from this page to that page while the other one is operating in a different area of the site. that, of course, requires your software to know about the linkstructure within your site.
add cookies, useragent-tracking etc and it'll get you closer to the truth.
Back in the days, we tried a dirty little trick using basic auth and the feature to include it in urls. basically, every one who arrived got redirected to [uniquesessionid:email@example.com...] thus sending this "username" and password in every request. too bad, shortly after we deployed it, browsers changed their default behavior of silently accepting those urls due to security issues.
another thing we tried was sessionids in subdomains, which were tracked, generated etc by a mod_perl-handler as PerlFixupHandler. that was nice because it allowed us static html, onsite-absolute links like /files/bar.html would still work, thus no dynamic content-overhead, not break the session tracking and it was quite accurate. it was given up when we finally rewrote the whole site, losing the frames. that kind of session-tracking was mainly born because we wanted people to jump in from search engines, giving them a frameset but show the right page in the content-frame, so anyone who had a valid session would also have a frameset, everyone else would get a session and a frameset.
You talking real users or crawlers or both or what?
And, how did you measure this?
I have a deep background in user support including big companies and outside of companies or departments with high-security needs, I've never seen it turned off and I've asked others with similar backgrounds if they've ever seen it turned off and their experience is the same as mine.
In the corporate world, the average intranet would simply not work properly if script was disabled in the user agents.
And nobody is going to get me to believe that the percentage of Windows users using IE with script turned off is more than .00001% or some such negligible number.
Sigh. I guess it's time to consider doing something to stop all this. What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?
So the question is, do the bots Bill is complaining about crawl (and thus can be caught by a honeypot) or are they accessing the site in more sophisticated ways?
Lately moved to host that's better overall (notably uptime), but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.
Time to add IP lists; tho I'd read of the badder bots' IPs being in flux, and wonder if a few of these IPs might ban some potentially worthwhile visitors.
With the move, I've also "lost" awstats. Had been on previous host, in root - not sure if this is "locked down", so results not readily seen by joe public.
Haven't (yet) tried installing it on new host; I'd been looking at adsense stats - and only a day or two ago signed on with google analytics.
I think that's a countervailing trend to the security issue. The problem is that of course it's not random. If you have a security-oriented site, for example, probably a very high percentage of your most important visitors will have JS disabled.
Your dynamic page logs the rogue IP address, and throws it into htaccess, and then at the end it spits out a Forbidden header.
Now this rogue bot keeps getting a 403 Forbidden from htaccess, but it doesn't care, it just keeps trying all the links it already has for your site. But at least you've now logged the bad guy's IP address, and can count its " 403 " entries in your access_log, and can take other measures later if necessary.
Omniture stats are a sight to behold. Especially when there are high volume numbers. Try analyzing the logfiles of a site that large. Anyone here do that? I mean, are you micro-managing more than a million pages?
A colleague and I have created separate dashboards to measure traffic to the same pages (~2M) and our respective counts are consistently out by 10 or 11 pages every day -- same Omniture data. Hence I treat web stats as estimates.
it's unbelievable how much crap is scraping my content
What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?
I've used AlexK script for bad bots ... Worked, and nifty
moved to host that's better overall ... but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.
They haven't been mentioned up to now in this thread. However, since the script regularly blocks attempts from bots on the same gigabit network as my server attempting to scrape the site at up to 50 times/sec, I think it deserves notice.
The parameter-settings are governed by how busy the site is. More pages/day means turning the dials a bit.
An effective solution is near the end of this thread [webmasterworld.com].
Slow-scraper blocks put an upper limit on how many pages are allowed to any IP per day. The problem is that it infallibly catches Google & other 'good' scrapers, which then means using a whitelist. I've become convinced that fast-scrapers are the main issue, and therefore also drop the roll-over (restart) time to 4 hours, which makes the blocking more effective.
Even detecting the cloaked bots would take a far-more sophisticated algorithm than the bot-blocker script employs. And, since Google--the head-and-shoulders above everyone else algorithmic superior--is unable to stop sending advertisements promoting car-driving instructors for my site pages discussing modem drivers, I suspect that we may all have a little distance to go yet in that regard.
PS to docbird:
I retain your excellent sticky to me, and am anxious to implement your suggestions. Just a question of the time to do it.
[edited by: AlexK at 11:39 pm (utc) on July 18, 2008]
[edited by: jatar_k at 5:15 pm (utc) on July 31, 2008]
[edit reason] fixed link [/edit]
IP Ranges of known white listed bots
White listed proxies
IP Ranges(country specific stats from previous visitors VS. current visitor(AKA Trends))
JS Enabled(bots that request/and may be parse JS but do not support basic fluid layout syntax as self.innerWidth on sa div tag)
Random Honey Pots(forget about robots.txt at this point)
Robots.txt access by IP(know who is at you door)
Random redirects(sometimes serverside sometimes with JS based on the browserlevel) based on the score of the above
Checkout this sectio as ofte as you can: [webmasterworld.com...]
.. and a few more that I am still examining...
A few weeks ago I was pointed to a script that was fed a username and a password for PayPall account. Keeping the session(all cookies) it went 6 pages deep into the site. Made me feel like a rookie at all this...
Monitoring visitors have never been 100%. I used to prefer logs as client side stuff does add more bandwidth to your pages, but today, logs just don’t have enough information to make an accurate estimate. I wouldn’t recommend to anyone to use logs to monitor visits, in fact I've now turned them off completely.
Java does have a lot of information that can be pulled which gives a good indication of a real visitor, allowing us to get a reasonably accurate visitor numbers - which I believe Google analytics is doing quite a job at this.
blend27: sounds like a solid approach. did you write your analyzer yourself or is there some standard software that can easily be hacked to do that? logging all the headers can be quite a data overhead, but I guess it's worth it. Care to share those you're examining right now?
And the paypal-thing: it's a different thing to have a bot written or at least adapted for a single site. that's actually very easy with the help of some perl modules like WWW::Mechanize, but somebody who knows his business and really want's to extract data from your side is always hard to stop. You could delay him by having him sign up and then controlling how many pages a user may see a day, but that'll only slow him down.
It is something that I started working on right after Florida Update back in the day. Our Host at that time had a Stats program that took our rewritten SES URIs and started reporting to us that the top most visited destinations were subdirectories when in reality there are none(at least visible to the users or bots). The system sits on top of very well normalized RDB, so data overhead is not an issue at all. IIS Logs are also parsed into the DB, just for comparison purposes and as a backup for the numbers.
Standard Firefox headers sent via browser:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
AVG got cought for missing at least 4 of thouse, and lots of rogue bots as was mentioned before.
As far as the slightly paranoid ones, I am one of them.
Hosts file entries:
No Tool bars
And the values of browser.safebrowsing.provider.0.* in Firefox are BLANK as well.
blend27: sound's like you've really put some work into that. Header analysis is usually a sure thing, even works the other way round: people who surf with googlebot's useragent rarely think about adding the From-Header.
This thread has me thinking, however, of placing some sort of trap for bots and scrapers somewhere on our main page such as a 1-pixel link, or an invisible link, and then I can segment those visitors off since no human visitor would have seen or clicked those links. I'd want to be sure this doesn't affect legit search bots, though.
I did manually analyse data from a log file for one particular website for six years via a database/excel spreadsheet.
The biggest traffic spikes are related to specific events so I can tell where the traffic is coming from and what they want to look at.
I shared the list of user agents, many of them browsers, that a WordPress hacking script used to attack sites. Many other scripts use similar lists, many often longer as someone also points out in that thread.
Now just imagine there are many of these scripts or variations being run by hundreds or thousands of little kiddies wanting to leave their graffiti on your site, and you start to get a clue what some of that traffic really is.
Then you take the organized crime aspect from a country I won't mention, who likes to infiltrate residential computers by the 100s of thousands and use them as automated minions to do their bidding on the web, and the sheer scope of this problem becomes truly amazing.
Maybe your particular site only sees a little of this activity but mine sees a lot, WebmasterWorld sees a lot (just ask Brett), so be grateful if you're not in their crosshairs... yet.