Welcome to WebmasterWorld Guest from 54.161.53.213

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Featured Home Page Discussion

Most of Your Traffic is Not Human

     
8:40 pm on Jul 6, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


It's a disappointing but eye-opening statistic that most of the traffic to our websites is not from actual people. In fact, well over half of our traffic is not human.

Bot traffic is in an uptrend. Most of this is from bad bots, or at least by bots that are not beneficial to our interests (depending on site model.)

Here's the estimated breakdown*:
• 28% Search Engine & other good bots
• 10% Scrapers & Downloaders
• 5% Hacking tools & scripts
• 1-3% Automated link spam
• 12% Other impersonators

Analytics & site reporting software is easily fooled by bots masquerading as human. That's not what they are built to do.

*based on 10k daily page loads (YMMV)
10:25 pm on July 6, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13834
votes: 484


When you figure percentages, do you look at page requests or all requests? A human visitor never makes just one request; it's anything from half a dozen to over a hundred, depending on what's happening on the page. But it's the rare robot that requests more than the page alone--especially all at once. So even if 2/3 of your page requests are from robots, it might still represent only a small fraction of all requests handled by your server.
10:50 pm on July 6, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3331
votes: 162


The one thing that stood out for me on a recently analyzed single day is what appears to be a huge increase in bot-net type activity. For each single human page view there were over 300 sequential file requests from a single "human" UA from hundreds of IPs. The majority were variations on a request for wp-login or xmlrpc "GET" or "POST" requests. No real damage done, but the collateral issue - it has to slow down things for actual visitors to have all that background noise constantly in the picture is damage enough. It does not matter that every "GET" or "POST" request is served an error page. It's got to be a real confusing mess for people who rely on stats programs to tell them about "visitors". :(
12:31 am on July 7, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


do you look at page requests or all requests?
My figures are coming from "page loads" based on a cross section of popular sites of small to medium size.
*based on 10k daily page loads



it's the rare robot that requests more than the page alone... So even if 2/3 of your page requests are from robots, it might still represent only a small fraction of all requests handled by your server.
Absolutely true, but that's not the metric I'm writing about :)

However, there are the image scraping bots that would drive the *file request* metric up pretty high. I block a couple hundred of those every day.


The one thing that stood out for me on a recently analyzed single day is what appears to be a huge increase in bot-net type activity.
Botnets can be huge. They come and go. I don't see them as much as I used to. They mostly come from compromised servers at hosting companies but can include the occasional compromised ISP account. These are scripts that are bought/sold in the dark web.
3:24 am on July 7, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5460
votes: 3


I've always believed the bot traffic (good & bad) is closer to 70%, and nothing has happened in the last decade to make me believe otherwise!

FB, which I've allowed for almost a year now (previously denied because there is no way to determine actual referring page (whether link is used in a good or bad reference)) is an absolute JOKE!.
Visitors do not look at other pages. This result tends to make me believe that FB embeds the entire page (supporting files) and FB users who browse appear in raw logs as CLICK-visitors.
3:52 am on July 7, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7646
votes: 519


I'm with wilderness (as to the absolute noise on the web these days!). Aggressive bot thumping is a chore, but whitelisting is so much easier and bandwidth conserving. As webmasters we pick our poison and go from there.
4:03 am on July 7, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


The above stats tally to 58% of visitors being non-human. This is "more than half..."

The larger the site, the larger the number of bots accessing all those pages/files. The more exposure, the more bots will follow the backlinks. The more Social Media interaction, the more bot frenzy.

One thing I've noticed on my own site is the bot activity is fairly consistent no matter if I get a lot of traffic that day or not... but if I'm running a promotion, especially on Social Media, bots will hit my server will great vengeance!
8:17 am on July 7, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7646
votes: 519


Are those numbers from blacklisting or whitelisting?

As a whitelister I don't care how many bots hit to get a 403. And stripping out 58% (your number) from my raw logs is no problem. What I am seeing is 43% automated/undesired, YMMV.

Any way you look at it, the bot traffic is abominable!
9:51 am on July 7, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


This has nothing to do with whether the bots are completing their requests. Nothing to do with blocking or not blocking.
3:06 pm on July 7, 2017 (gmt 0)

Full Member

joined:July 29, 2012
posts:238
votes: 11


Do you have any stats on how much bot traffic is coming from countries not targeted by the website? Such as US sites being heavily hit by Chinese, Ukraine etc.
4:33 pm on July 7, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member bwnbwn is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 25, 2005
posts:3544
votes: 19


If you use chrome to visit your website or any website you might as well count it as 2 hits. One google bot and 2 you. If you look at your log files you will see this. it has been years since I looked at this but working off a bad memory I remember it as the content bot.
4:39 pm on July 7, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:July 23, 2015
posts:240
votes: 72


@keyplyr, agreed with your topic title.

I am seeing over 50% bots. And this is after discounting all Amazon's "cloud", AWS bots as I block ALL of their IP blocks, all of them.

It's nearly impossible to figure out ROI of web development and SEM nowadays.

@Awarn, Urkaine mobile phone IPs is one of the top locations. But they are all over the world. In Germany and France it's OVH hosting, there's tons of hosting companies used as proxies all over the world. There's also tons and tons of 3rd world mobile pools - Vietnam, Philippines, Brazil, etc. If it's mobile, it's likely a bot unless your site specifically targets mobile users.
5:26 pm on July 7, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13834
votes: 484


if I'm running a promotion, especially on Social Media, bots will hit my server will great vengeance!

I've got an entire category of robots that arrive solely in response to an RSS feed (someone else's). Until two years ago, I never even knew they existed. And, heck, I don't suppose they ever knew I existed ;)

If it's mobile, it's likely a bot unless your site specifically targets mobile users.

For a while I was getting iOS robotic visits to the root, but now I'm tentatively deciding that they're not really robots; it's some kind of follow-up triggered by an earlier human visit. (I don't count it as a robot if it's got some relation to voluntary human activity, like Firefox's Favicon Reloader.) The great difficulty with mobile visits is that you can't match the IP because it's different from one hour to the next.
6:40 pm on July 7, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


The great difficulty with mobile visits is that you can't match the IP because it's different from one hour to the next
Mobile IPs are no different than Desktop IPs. If you're sering different IPs "from one hour to the next" this is likely a bot, not a human on a mobile device.

There are however dynamic IPs used with a few mobile apps, which might vary the source IP. This is not normal, but I've seen it from cloud assignments.
8:06 pm on July 7, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13834
votes: 484


If you're using your mobile device on WiFi--which can have advantages over using your provider's data connection--the IP is wherever you happen to be currently located.
9:00 pm on July 7, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


If that's what you meant above, then yeah that can happen, but what gives you the impression it is the same user, just travelling around, connecting to different WiFi hubs then accessing your pages?
12:57 am on July 8, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13834
votes: 484


If they come in with a piwik cookie and I've never seen the IP before, that's a clue.
3:37 am on July 8, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Feb 3, 2014
posts: 981
votes: 202


Been saying this for quite some time. Explains Zombies.
8:23 am on July 8, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


Many non-techie site owners will look at their site traffic report or GA (that their SEO-for-hire set them up with) and see high page loads and think they are having a great day with a lot of visitors, then wonder why their products aren't converting or why no one is clicking on their adds... unaware their high page loads are likely just the result of some bot scraping their files.
1:10 pm on July 8, 2017 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator mack is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 15, 2001
posts:7655
votes: 29


Google Analytics does a not to terrible job of showing real users, although there are some systems, specifically those offered by hosting companies that show just about everything that hits the server as a user.

Mack.
1:59 pm on July 8, 2017 (gmt 0)

Senior Member

joined:July 29, 2007
posts:1780
votes: 100


I just want to dispute the figures in the opening post. Not in their total number but in their percentage breakdowns. To me, a bot by a news agency or social media company or data reporting company is not a trusted or "good" bot. These have not earned that distinction, they provide no benefit to the site being crawled etc.

My advice, learn to evaluate server logs and avoid using 3rd party software of any kind, including "trusted" sources like Analytics. Your server logs will not lie, they are what they are, you NEED to learn to read them. As for fixing the problem, be ruthless in blocking these at the door with htaccess and employ a whitelist approach to known bots. Much of the bot traffic is coming to gather information which, at a high rate, is only used to best your content by other sites or is sold by data aquisition firms.

robots.txt and popular CDN use is not enough, you need to lock the front door and learn to read server logs yourself and modify htaccess files yourself.
5:24 pm on July 8, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13834
votes: 484


This may be the forums-editing equivalent of a reductio ad absurdum, but:
social media ... provide no benefit to the site being crawled
Surely that's a matter of individual judgement?

Google Analytics does a not to terrible job of showing real users, although there are some systems, specifically those offered by hosting companies that show just about everything that hits the server as a user.
Dunno about other third-party analytics, but a major selling point of piwik--which the present site uses--is that it lives on your own server, so requests are subject to your own access-control rules.

Now if you're talking about the generic “analog stats”, yeah, that's pretty close to being worse than useless.
7:03 pm on July 8, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


To me, a bot by a news agency or social media company or data reporting company is not a trusted or "good" bot.
Of course each site owner needs to determine what is or isn't benneficial to their interests.

Example: Some image gallery sites may depend on traffic from the various image searches, while other site owners consider image bots as malicious scrapers and block them.

The metrics posted in the OP are conservative based on a handful of popular sites getting 10k page loads per day. It is likely these stats are far higher for the majority of websites.
7:15 pm on July 8, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


if you're talking about the generic “analog stats”, yeah, that's pretty close to being worse than useless
What your host does with Analog may well be "generic" and not very useful, but Analog [mirror.reverse.net] is a robust and accurate server traffic analyzer that can be highly customized to return useful data. Best to install on your local machine to process hourly/daily raw access logs. Probably the most accurate logfile analyser I've used in 20 years (but you need to customize it to produce the metrics you use.)
9:58 pm on July 8, 2017 (gmt 0)

Senior Member

joined:July 29, 2007
posts:1780
votes: 100


Surely that's a matter of individual judgement?

I'm not asking you to agree with me, I'm telling you the figures for "trusted" and "bad" are very subjective to begin with. You may have noticed that social media no longer treats all sites and content the same? That's not always good for the site owner, unfortunately.
7:54 am on July 9, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


I am seeing over 50% bots. And this is after discounting all Amazon's "cloud", AWS bots as I block ALL of their IP blocks, all of them
smilie - you may want to reevaluate that tactic.

While AWS certainly hosts a wide variety of unwelcome (bad) bots, there may be a few bots coming from those ranges that could be benneficial to your interests, if not directly then indirectly.

Example: A large number of botrunners hosting at AWS are marketing companies that gather data rolled into products they sell to help ecommerce clients develop ad campaigns.

If you publish Adsense or other ads you would want your site data included in these products to facilitate ad placement and drive up bidding. This translates to greater income from ad clicks.

So you would want to add exceptions to your blocking rules to allow those User Agents access to your server.

That's just one example. There are numerous others.

Blocking can be effective but it needs to be surgical: Blocking Methods [webmasterworld.com]
7:14 pm on July 9, 2017 (gmt 0)

Junior Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 102
votes: 7


I agree that over 50% of all traffic to my sites are bots. I find it critical to keep up with monitoring your logs.

Incapsula [incapsula.com] publishes a yearly report on bots in internet traffic. Here's 2016's numbers:
Human: 48.2%
Good Bot: 22.9%
Bad Bot: 28.9%

Bot writers are getting smarter and making it harder to differentiate from a human. I see much progress of some of the Chinese bots. As bots become more "human" it will become harder for us to out them. I think many bots are fooling the Incapsula study.

My question is: There must be significant money made by deploying so many bots. How are bot writers making so much money, who is paying them and why?
7:41 pm on July 9, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


How are bot writers making so much money, who is paying them and why?
Most bots are marketing driven, which has many applications, but it all comes down to data retrieval. Information has high value, and can be rolled into many products. Who's paying them? We are.

Many of today's bots are discussed in the Search Engine Spider & User Agent ID Forum [webmasterworld.com]
11:16 pm on July 9, 2017 (gmt 0)

Senior Member from CA 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 25, 2003
posts:1047
votes: 220


I totally agree that which bots to let through is a business decision and will, usual suspects aside, vary significantly by niche and by focus (eCom or Info). That said keyplr, based on various comments, is among the most lenient of active bot blockers I've encountered; my whitelist is under two dozen and - this is a major difference from keyplr - must be affirmable via rDNS (reverse DNS lookup) for access, which means that many/most cloud hosted bots are nonstarters.

I also am in agreement with those who believe the percentage of bots to humans is actually higher than most reports. Late 2011 early 2012 was when I first read reports of bots surpassing human web traffic. Those same reporters had bots peaking above 60% in 2013 and falling back to high 40s low 50s since. The common reasoning for the drop was Google and Penguin. However, in that same time period I was seeing an increase in much more humanistic bots, especially since 2015. I've been running tests on my sites with Piwik every year since 2014 (and colleagues have run similar with Google Analytics even longer) that consistently show them misidentifying 40-50% of otherwise identified bot traffic as human.

I can charge high direct sale ad rates in part because of bot exclusion and allowance (for presumed missed) that can be statistically justified. So false acceptance rate (FAR) and false rejection rate (FRR) are critical calculations. And that means ever new behaviour tests, unfortunately, at least for me, more difficult with mobile. Regardless, quite the arms race.

One last point to make is that all sites are not equal with regard to bots. Some are better some worse than average (by definition) but site traffic size/volume can, apparently, be absolutely critical.
Note: data from Incapsula.
Sites with n-visitors a day:
* 10-1K: ~15% human, ~85% bot.
* 1K-10K: ~30% human, ~70% bot.
* 10K-100K: ~50% human, ~50% bot.
* 100K-1M+: ~60% human, 40% bot.
Note: 2015/2016 bot percentage went up on the under 10K and down on the over 10K YoY.

How many of those smaller sites' webdevs/owners have a freaking clue?
How many performance concerns might this explain?

Reminder: Google (and other) search referred traffic is bot as well as human. Sometimes more so.
12:03 am on July 10, 2017 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8975
votes: 409


this is a major difference from keyplr - must be affirmable via rDNS (reverse DNS lookup) for access, which means that many/most cloud hosted bots are nonstarters.
iamlost, I think you may have misunderstood the OP.

Again the stats in the OP have nothing to do with blocking. The bots were counted whether they were able to access files or not. And again, this was not my own website. As mentioned above, the stats are from a dozen or so popular sites with approx 10k pages loads per day.

keyplr, based on various comments, is among the most lenient of active bot blockers I've encountered
Possibly, but I highly doubt it. My comments are usually not about my own sites. However, since you are interested, on my own personal site I block somewhere between 6k to 8k requests per day using the following, but not limited to Blocking Methods [webmasterworld.com]

[fix typo]

[edited by: keyplyr at 1:39 am (utc) on Jul 10, 2017]

This 46 message thread spans 2 pages: 46
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members