Welcome to WebmasterWorld Guest from 54.90.204.233

Forum Moderators: Ocean10000

Message Too Old, No Replies

At Home with the Robots: 2015 edition

     
8:54 am on Feb 9, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


At Home with the Robots: 2015 Edition

It's been two years. Time to see what the robots are up to.
2012 edition [webmasterworld.com]
2013 edition [webmasterworld.com]
I skipped last year because I moved sites in late December and the search engines were still in "what the ### is going on?!" mode.


The Good...

2. We Try Harder.

Why have I put #2 before #1? Because the bingbot was, once again, more active than the googlebot-- almost half again as many requests overall. But unlike past years, this was not because of a morbid appetite for robots.txt. In the entire month of January-- hold on to your hats-- the bingbot only requested robots.txt 60 (sixty) times. And even this figure is misleading. On a couple of days they read robots.txt up to 5 times, much like the bingbot of old. To make up for it, there were spells when they went over 48 hours without a single robots.txt request.

bingbot
IP ranges: 157.55, 207.46
Some formerly popular ranges seem to have disappeared: 65.52 was rare after January 2014 (a year ago); in the past year, 131.253 seems only to be used for WMT (site verification).
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

bingbot mobile
This is a brand-new UA. It first showed up on this site on 15 January, halfway through the very month I was looking at.
IP ranges: 157.55, 207.46 (same as ordinary bingbot)
UA: Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0;  http://www.bing.com/bingbot.htm)
and
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Can you spot the difference? Bing-- by any name-- has always had issues with double spaces.
Behavior: Unlike the more familiar mobile Googlebot, this new mobile bing gets all kinds of files-- images as well as pages. On the other hand, it is never used for robots.txt requests. I don't know how this would work if you wanted to set separate rules for the mobile bingbot, since there's nothing readily distinctive in its name.
So far, this UA is rare: only about 1/25 of all bing requests. I will expect it to become more common, though.

Punch line: Yup, the bingbot is using an iPhone UA. Har de har har.

msnbot-media
IP: 65.55 only, but its visits were so rare, I don't know if I can lay down a rule
UA: msnbot-media/1.1 (+http://search.msn.com/ msnbot.htm)
Behavior: Same as two years ago. Robots.txt, one image file, that's it.

msnbot
Enjoying its retirement. I haven't set eyes on it since last August, and even then it primarily asked for the robots.txt-plus-sitemap combo.

Bing Preview
IP: 199.30, 65.55
UA: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
and
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b
Behavior: The odd feature of these UAs is that they never request a page, and only occasionally a stylesheet. In general they request only the supporting files (images, fonts, scripts) belonging to that page-- which they give as referer, just like a human. From this you have to conclude that they're requesting the supporting files invoked by any given page the last time they crawled it, which may or may not be identical to what the page uses today.
The iPhone preview showed up at mid-month, at pretty exactly the same time as the mobile bingbot.

Plainclothes bingbot
IP: 65.55, 131.253, 157.56
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
Behavior: If I'd thought of it, I would have unblocked them to see if there was any change in behavior. But I didn't, so I can't. They requested assorted pages, each time accompanied by the supporting files that a human would have got from the 403 page. But they never requested the favicon (not explicitly linked from any page) the way a bona fide human would have done. Requests always included the with-script version of piwik (analytics), though I didn't let them have it.
It's been several years and I remain at a complete loss what the plainclothes bingbot is for. I doubt it's humans surfing on their own time in Redmond ... especially since one request this month came in at 4:30 AM, and two others on a Saturday.

bing site authorization
IP: 131.253.38.67 (always this exact IP)
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 7.1; Trident/5.0)
This IP-UA combination is only used for requesting the /BingSiteAuth.xml file that goes with wmt.

1. We're Number One!

How does it go again? First in war, first in peace, last in the American L-- Whoops, no, I'm thinking of something else.

IP for all activities: 66.249.64-95
Two formerly well-known Google IPs seem to be on hiatus. I last saw 74.125 in May of 2013; 72.14 was last seen in July 2014, and then only for site verification (wmt).
Behavior: Not exactly new, but I only just noticed it. The Googlebot-- including mobiles and images-- follows all redirects within the second, almost like a human. The only exception is when they've already crawled the new URL within the past hour; then there's no repeat request.
I notice a fair number of requests for some-garbage-string.html, ending in a 404 response. I believe this is programmatically triggered any time a site yields an unexpected number of redirects; they're checking for Soft 404s.

Googlebot
UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Filetypes: html, css, js, pdf
Behavior: Unlike some past years, the googlebot never requested an image this month. It may not have stopped entirely, though; I found one request as recent as December 2014. Sometimes it does still send a referer when requesting .css or .js files.
New quirk: Late last year, the googlebot took it into its head that URLs in one directory contained a double slash, like /directory//subdir/pagename.html. I've never pinpointed the reason, but they're contentedly following redirects.

mobile Google
UA (see below): Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
and
SAMSUNG-SGH-E250/1.0 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
and
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)

Note the first form. Up until February of 2014, the UA for their most common mobile crawler specifically said "Googlebot-Mobile":
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
After February of 2014, with a very brief overlap, they changed to the current UA string with "Googlebot/" instead of "Googlebot-Mobile/". At the same time, they jumped from iOS 4 to iOS 6.
The assorted mobiles together make up about 1/6 of all Google requests. Unlike mobile bing, these UAs only request pages. I find it interesting that they never request scripts or stylesheets. Do they look at my @media rules and conclude that everyone gets the same css?

Googlebot-Image
UA: Googlebot-Image/1.0
Behavior: Most requests get a 304 response. I don't know if this is a reflection of differing request headers or my own server's behavior. The same applies to pdfs, which are requested by the regular Googlebot.

blank.html
I should mention this here, because it seems to play a role in mobile image search. Requests come with the referer http://www.google.tld/blank.html. Current UAs are in no way limited to mobiles, though.

Google Favicon
UA: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google favicon
This new UA featuring FF 6 showed up in November of 2013. I guess it's an improvement over the old one with the blank UA. Concurrently they changed from the original 74.125 IP to the same IP as all other Googlebots.
Behavior: No special treatment; that FF 6 UA gets them redirected straight to an old-browsers page. Fortunately this doesn't prevent them from getting the favicon, which after all was the real purpose of the visit. They did this a total of 15 times during the month. Don't know what they did with them all; do many sites change favicons every other day?

Google Preview
IP: 66.249 (same as crawl) and 64.233
UA: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
The 64.233 IP appears to have been on an extended vacation
:: insert witticism about Matt Cutts here ::
It went away in January 2012 and didn't reappear until April 2014.
Behavior: As far as I can tell, there is no difference between the Preview triggered by a human search-- assuming this still exists?-- and the Preview you get in WMT. Tip: In logs, you can easily tell which ones are WMT previews, because the package will include one redirected request for the front page. That's your domain-name-canonicalization redirect at work.

Google Leftovers
UA: various
Search me. The AppEngine is happily not as much in evidence as it used to me. I see some GoogleImageProxy and a couple of MSIE 8 visits, with Google IP but giving google.pl as referer. They're not Translate; those come with an X-Forwarded-For header. Pending solid information, I've been blocking any request from a Google IP that contains neither "Google" in the UA nor a Forwarded header.

3. Back in the USSR

Oops, er, I guess it's Russia. (Also Turkey. Yandex seems to be big there too.)

IP: 100.43.91.18, rarely 5.255, 37.140, 100.43.something-else, 178.154
UA: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Other IPs and UAs seem to have retired. They've been crawling from the identical IP (down to the last digit) since February 2013; I haven't seen their US range 199.21 since last June. This month, more than 98% of their visits were from the same IP.

The YandexImages UA hasn't been around since last fall; YandexFavicons last showed its face in November 2014, though it was always rare. The same YandexBot from the same IP now gets all filetypes. As with Google, image files (but not pdfs) tend to receive a 304.

Seznam

SeznamBot
IP: 77.75.73, 77.75.77
UA: Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)

They seem to have changed UAs pretty exactly a year ago; it used to be
SeznamBot/3.0 (+http://fulltext.sblog.cz/)
Version 3.1 must be experimental, like Apache; I've never seen it.
Behavior: When they request images, it tends to be in large batches all at once. Unlike the Big Three, they don't immediately follow redirects.

Seznam Preview
IP: 77.75.77.123 (worth noting because, although I only saw them once this month, they used the identical IP two years ago)
UA: Mozilla/5.0 (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)

exabot

IP: 178.255.215.77 (exactly: Exabot), 178.255.215.89 (exactly: BiggerBetter)
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
and
Mozilla/5.0 (compatible; Exabot/3.0 (BiggerBetter); +http://www.exabot.com/go/robot)
These two UAs, each with their own dedicated IP, have operated in tandem for at least two years. I think they're popular in France.

Mail.RU
(their casing, not mine)

IP: 217.69.133
UA: Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)

Behavior: For several years I blocked them from all images due to unsavory behavior, very much like the worst kind of Ukrainian robot. This seems to have ended pretty suddenly in February of 2013, to be replaced with ordinary search-engine-like crawling. This month they didn't request any images at all. Like most search engines except Seznam, they follow redirects pretty promptly.

DuckDuckGo favicons

IP: 107.23.45.196

I list them here only because DDG is a respectable search engine. But this isn't quite respectable behavior; in fact I had no idea it was happening until this month, because normally I ignore 403s unless they're part of a botnet I'm tracking.
107.23.45.196 - - [02/Jan/2015:08:23:28 -0800] "GET / HTTP/1.1" 403 3357 "http://example.com/" "Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)"

... a total of 22 times in the course of the month. Unfortunately this got them a double-barreled lockout: first because of the IP (part of an AWS /14) and again because of the auto-referer. Perversely, I would have been happy to let them have the favicon-- same as with Google's faviconbot-- but they never asked.
It looks like they just started doing this in December 2014. I think the favicon is displayed with search results.

But what happened to...?

Whatever happened to Korea's big search engine, Yeti? Did I miss a memo? They disappeared abruptly in August 2013. Other once-familiar faces include:

YioopBot: last seen September 2014
TurnitinBot: sporadic
MJ12 and Gimme60: barely visible


To be continued...
11:29 pm on Feb 9, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


And now we get to:

The tolerable...

Facebook

IP: 66.220.144-159, 69.171.224-255, 173.252.64-127
UA: facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
and
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
and
visionutils/0.

The two "externalhit" UAs have been used in alternation for as long as I can remember. I think visionutils is used for whichever image the FB user actually ends up selecting.
Somewhere along the line, FB stopped hotlinking. Whew. Now it looks as if they download the image and host it themselves, so the "facebook" name only shows up in logs if some human actually follows a friend's recommendation. Before, I would see the hotlink every time someone looked at the FB page.


Robots that Pass in the Night

Every month brings a few short-term visitors. So this is a pretty arbitrary list. Robots come, robots go. For some reason, people who use the element "Nutch" seem to be unusually likely to honor robots.txt; that's probably why I've never blocked the name upfront.

Sukibot

IP: 128.214.224.188
UA: Mozilla/5.0 (compatible; sukibot_heritrix/3.1.1 +http://suki.ling.helsinki.fi/eng/webmasters.html)

This is my favorite newcomer. First seen in December 2014, they operate out of a Finnish university and appear to be doing a survey involving minority languages. I consider this a worthy cause, so they are welcome to crawl, even if I may not have exactly the languages they're looking for. (Especially not in the pages of Alonzo and Melissa, an early-19th-century novel that is their current particular interest.) A quirk of this robot is that almost all requests get a 206 response. The exceptions are files which, although they have an .html extension, are secretly rewritten from .php. (It took me a while to figure out that this is the variable.)

CB/Nutch

IP: 188.65.117.abc (probably .128-191)
UA: CB/Nutch-1.7

No idea what they're after, but they ask for robots.txt at the beginning of each visit, and otherwise do nothing to offend.

NDNutch

IP: 208.113.176.68
UA: NDNutch/ Nutch-1.9

Much of a muchness with the above ... except that shortly after month's end, they may have changed their behavior. We shall see.

IA Archiver

IP: 174.129.237.157 (same in 2013), 207.241.226.231
UA: ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
and
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)

Different IPs and different UAs, but I'm pretty sure it's the same people. Ask for and appear to obey robots.txt.

Memorybot

IP: 37.16.72.213
UA: Mozilla/5.0 (compatible; memorybot/1.21.14 +http://mignify.com/ bot.html)

They showed up towards the end of December; so far I haven't got around to ignoring them. I think they're along the same lines as TIA, which puts them in YMMV territory. For me it's No Skin Off My Nose. They ask for robots.txt and appear to honor it, though they don't understand the *.xtn locution. This leads to a few lockouts for midi and zip files, but I don't hold it against them.

Gigablast

IP: 77.66.121.244 (exactly)
UA: GigablastOpenSource/1.0

Here, again, I have no idea what they're up to. But they get robots.txt on each visit, and appear to honor it.

DomainAppender

IP: 54.various
UA: Mozilla/5.0 (compatible; DomainAppender /1.0; +http://www.profound.net/domainappender)

The 54 IP means that all requests except robots.txt will be blocked at the gate. There's certainly nothing to merit poking a hole for them.

AdvBot

IP: 136.243.14.164 (Hetzner)
UA: Mozilla/5.0 (compatible; AdvBot/2.0; +http://advbot.net/bot.html)

I posted about them when they first visited. Although there was nothing inherently dreadful about their visit (lots of pages, at intervals of at least 3 seconds), they've got two inherent strikes against them: They live in a Hetzner range, and the URL in their UA string leads to an authentication page. Nice try, but no go. I don't think they've been back since.

DomainTools SurveyBot

IP: 216.145.14.142, 64.246.165.210
UA: Mozilla/5.0 (Windows; U; Windows NT 5.1; en; rv:1.9.0.13) Gecko/2009073022 Firefox/3.5.2 (.NET CLR 3.5.30729) SurveyBot/2.3 (DomainTools)
Referer for front-page request: http://whois.domaintools.com/example.com

The last time I compiled "At Home with the Robots" records, I had a category I called Site Snoopers. This appears to be one of them.

Miscellaneous Names

Here I lumped together anyone that shows pretensions to robotitude by asking for robots.txt and/or having a name. Those who asked for multiple pages:

AhrefsBot
ips-agent (two UAs: one for robots.txt, another for page)
LSSRocketCrawler
ltx71
meanpathbot
SemrushBot
WebIndex (that is its complete UA string)
XoviBot (a distributed crawler: three widely varying IPs in the course of the month)
Yahoo! Slurp (How are the mighty fallen! Two years ago, this name had a category all to itself. Now it may not even be worth blocking them.)
... the interestingly spelled
Mozzila/5.0 (compatible; Sonic/1.0; http://www.yama.info.waseda.ac.jp/~crawler/info.html)

... and one humanoid from OVH, who won a place on this list by requesting robots.txt

Others include but are not limited to:
GarlikCrawler
MojeekBot
PagesInventory (I think this, too, goes in the Site Snoopers category)
Wotbox
beholderbot (This one fascinates me because it asked for only one, quite obscure, interior page: how and why? I've sent off a belated "please identify your robot" letter.)


... and the intolerable

If, as the song says, There's three ways that robots can go / That's good, bad and mediocre, then these are the "bad".

First the good news
The vast majority of unwelcome robots-- including all the genuinely malign ones-- were met with a resounding 403. Most came from blocked ranges; others got categorical lockouts based on behavior. Although it often feels like an unending game of whack-a-mole, I do seem to be winning overall.

Then the bad news
I must have missed a memo, because I had no idea 14 January was International Robot Day. At least, that's what logs seem to suggest.

Behavior-based lockouts are obviously site-specific. Mine currently include:
-- any POST request other than the Contact page
-- any request for .php (I don't use it in URLs)
-- any request giving "www.example.com" as referer (this site is without-www)
-- any request for /directory/pagename.html giving / as referer (my links don't work that way)
-- any request for / with auto-referer (ditto)
-- any request giving /pagename.html as referer (I have no URLs in this form)


Baidu

They haven't quite shriveled up and blown away, but we're getting there: At most, half the number of visits as last year.

BaiduSpider
IP: 123.125.71, 180.76.5-6, 220.181.108 (Chinanet: BaiduSpider only)
UA: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
and
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2

The Firefox UA is only used for requesting robots.txt. Oddly enough, it has only now occurred to me to wonder if the Baiduspider would honor robots.txt. If I excluded it by name, would it stop requesting files? I'm going to check and report back. As it stands, all I can say is that they've never requested anything from a roboted-out directory.

Vanished without a trace

A number of formerly active Chinese robots have simply vanished. They include:

SosoSpider, last seen March 2013
JikeSpider, last seen May 2013
360Spider, now very rare-- monthly or less through 2014

Trendmicro

IP: 150.70 (Japan)
UA: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)

I guess they'll never completely go away. Supposedly they've got something to do with virus checking-- but their visits seem to bear no relation to any human activity.


Infected Browsers

IP: various human IPs, all from Russia

This is a big category currently. They come in with the referer
http://yandex.ru/yandsearch?text=example.com&lr=213
or
http://yandex.ru/yandsearch?text={some-plausible-query-here}&lr=213

Someone hereabouts found a list of yandex regions; 213 is Moscow.
Details of behavior point to infected machines rather than unusual human searches. One item is that they never request the favicon. Those that are not blocked by IP get sent to a redirect page, with the option of continuing to the originally requested URL. None has ever done so.


For the Profilers

Currently there are five patterns. Three are ongoing botnets that help me identify new ranges-to-block. The last two would theoretically do the same thing-- except that all requests come from IPs that have already been blocked. Hurrah!
All UAs are human or humanoid. In a few cases they may actually be infected browsers sending their real UA, but who cares.

index.php botnet
Current pattern: 11-request visit, ending with a set of six, alternating "/index.php" and "/" (root) alone, all giving /index.php as referer.

Contact botnet
Current pattern: any random page giving / as referer, followed immediately by contact page giving previously requested page as referer.
This botnet showed up almost immediately after I created the "contact.html" page. It's in a subdirectory; they didn't simply guess.

nyet.gif botnet
Current pattern (quoting one at random):
aa.bb.cc.dd - - [14/Jan/2015:16:37:47 -0800] "GET //index.php HTTP/1.1" 403 3301 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko)" 
aa.bb.cc.dd - - [14/Jan/2015:16:37:47 -0800] "PUT /nyet.gif HTTP/1.1" 405 461 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de-LI; rv:1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)"
aa.bb.cc.dd - - [14/Jan/2015:16:37:47 -0800] "GET /nyet.gif HTTP/1.1" 404 2528 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)"

Each part of the request has its own UA and they are always the same. The two "nyet.gif" parts have existed since June and July of 2014; the //index.php part is a more recent arrival.
This may be the clearest example of a robot putting out feelers. Any decent site will return a 403 or 405 to the PUT, followed by 404 for the GET. They're looking for sites that return 200 for both.

one-plus-three
Pattern: random page with auto-referer; / (root) with first page as referer; two further / with auto-referer
This behavior has been going on for years, but I only just noticed the pattern. It seems to be the exclusive province of robots from established server farms.

pairs
Pattern: random inner page with / as referer; / (root) with http://example.com without final slash as referer. Both of these behaviors would ensure a lockout even if I didn't know the IP.


The Worst of the Worst

Finally there are the truly malign robots: the ones who ask for files in /wp-admin/ or /fckeditor/, or who try to POST to pages that don't allow posting. That's "try" and "ask for", not to be confused with "succeed" and "get". So long as they stay safely lodged in Hetzner and OVH ranges, they will remain out of sight and out of mind.


And the winner is...

This year's prize goes to the Ukrainian who made repeated requests for assorted subfiles of //wp-admin/ [sic] ... interspersed with two requests for //robots.txt. Were they hoping to learn the names of my roboted-out directories? If so, they must not have liked what they found; there were no requests for /boilerplate/ or the like.
2:36 am on Feb 10, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1977
votes: 68


-- any request giving "www.example.com" as referer (this site is without-www)

I have my stuff backwards, no none-WWW version, but sent as a referer.

My personal favorite ones are the ones that request/or provide referer as in MYDOMAIN-IN-UPPERCASE.tld or try to access the root with the referer as a root - this is where I get them, I don't link to homepage from home page.

Big Thanks for the write up and this is pretty much is what I have gone thru last year(on-going).

BTW, did I say Happy New Years? :)
2:38 am on Feb 10, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Thanks, good post.

The Good...

I agree


The tolerable... and the intolerable

Except for Facebot which allows me thousands of visitors/sales, I don't "tolerate" any of those bots listed. I block them all either by range or UA.

Important to note that everyone has a different mix for this :)
4:31 am on Feb 10, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


either by range or UA.

Humanoid UAs seem to be the big thing currently. Maybe they've all noticed that calling yourself "Python" or "libcurl" will not get you far. The two that were popular enough to get lists of their own are:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0

Nope, no idea what was so attractive about FF 12 precisely. There's a similar MSIE6 UA that's so egregious, my host blocks them with mod_security.

Main reason I don't block this kind of thing outright is that most already live at blocked IPs-- and most of the rest will never get past the old-browsers page. The same goes for the ones giving EXAMPLE.COM as referer.

Mentioning Firefox reminds me that Mail.RU also has a faviconbot. Well, I assume that's what it is for; I very nearly overlooked it. It shows up in plain clothes, currently FF 18, which gets it an automatic lockout (legacy of past years' misbehavior). But it still asks for-- and receives-- the favicon.
5:14 am on Feb 10, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Yup, I agree. For me, many "humanoid" intruders often go unnoticed unless they do something strange and/or trip a filter and get blocked.

I usually keep about 20 individual IPs temporarily blocked. Most often these are compromised ISP accounts from Brazil, Ukraine or Russia.

I keep 'em blocked a couple weeks, then given 'em another chance. If they repeat mischief again I block them semi-permanently (until I do quarterly cleaning, then all the individual IP address get removed.)

Lately I've had good luck catching bad agents in the error logs. Bots seem to make more errors than people do.

But yeah, bad bots that use bot names and come from company ranges are the easy ones.
1:06 pm on Feb 10, 2015 (gmt 0)

Full Member

5+ Year Member

joined:Apr 26, 2009
posts: 275
votes: 2


Hey Guys,

Thanks lucy24 for your nice and very informative post.

I have a question, is there any recourse on the web, were I can learn how to create a trap for blocking all this IPs and IP ranges or checking then whether there good or bad by first sending them to the authentication page where I can have a button to click to proceed further perhaps, then log them all in my DB for further analysis of their behaviour? It seems that recently, when I was going through my logs I have discovered bunch of robots or robot-like or humanoids if you will that seemed to be behaving unusually. So I started blocking them manually by hand via .htaccess, but it is very time consuming process and requires to do the same thing pretty much on the daily basis. Most of the IPs are either Russian, Ukrainian, Chinese or Malaysians. I Have even posted about this on WebmasterWorld earlier this month, but to be honest never managed to figure it all out. My site is been around for over 11 years and getting something like 50 to 60k in daily views, sit will be a shame to let robots win this game after all this years.
2:33 pm on Feb 10, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5497
votes: 3


I have a question, is there any recourse on the web, were I can learn how to create a trap for blocking all this IPs and IP ranges


Updated PHP Bad Bot Script 2004 [webmasterworld.com]
Ban malicious visitors with this Perl Script 2002 [webmasterworld.com]
6:32 pm on Feb 10, 2015 (gmt 0)

Full Member

5+ Year Member

joined:Apr 26, 2009
posts: 275
votes: 2


Thanks @wilderness.

Just went to see the situation and already found several of them that are going to go to my black list.
4:19 pm on Feb 11, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8632
votes: 279


AlexB77

A few other approaches if you don't know about them...

- Reverse Proxy services like Cloudflare or Yotaa or others.
- Bad Behavior - [bad-behavior.ioerror.us...]
- Mod Security - [modsecurity.org...]
11:08 pm on Feb 11, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15507
votes: 752


Mod Security

Be cautioned, though, that the current release can only be used in config (the earliest version could also be used in htaccess). Some hosts offer it as an optional extra.

:: reeling from the discovery that DotBot, last seen around 2011, has been back since last April on one of my sites ::
8:27 am on Feb 12, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:9257
votes: 786


Kudos, lucy24, mice overview. Now if you can just....

Wait. The bots will return. Just ask John Conner... (Terminator franchise), and these bots seem to be part of the problem!