Forum Moderators: open

Message Too Old, No Replies

Are We Ready for the Return of Cliqzbot?

         

not2easy

7:05 am on Apr 11, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



After moving to a new host over the weekend I was checking through logs today to make sure things went as smoothly as they appeared on the surface. Found a new (old) visitor I had not seen in years: Cliqzbot. And it's back with a vengeance. (Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot))
robots.txt? Yes yes yes yes yes yes yes (that's all they got).

They were last seen a few years ago: [webmasterworld.com...] as Cliqzbot/0.1 but this is a newer version Cliqzbot/1.0
Coming from all over the place now, mostly Amazon servers. One outlier is from a 41.13.nn.nn AFRnic range that was blocked in the domain where I found the rest. They are all blocked due to their address, but here's the rest of the list - this is from the 5th to the 8th of April:
AZ 34.192.0.0/10 (34.207.nn.nn)
41.13.nn.nn (AFRnic)
AZ 52.64.0.0/11 (52.90.nn.nn, 52.91.nn.nn - 2 different 52.91. IPs)
AZ 52.192.0.0/11 (52.202.nn.nn - 3 different IPs)
AZ 54.80.0.0/12 (54.87.nn.nn)
AZ 54.144.0.0/12 (54.144.nn.nn, 54.146.nn.nn - 2 different 54.146. IPs)
AZ 54.160.0.0/11 (54.162.nn.nn, 54.173.nn.nn, 54.175.nn.nn)
AZ 54.224.0.0/12 (54.236.nn.nn - 3 different IPs)
AZ 107.20.0.0/14 (107.21.nn.nn)
AZ 184.72.0.0/15 (184.73.nn.nn)

(I've only checked this one domain so far)

keyplyr

9:54 am on Apr 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see it every couple weeks at one or more sites I look after, all AWS ranges.

Easily blocked. However AFAIK it obeys robots.txt.

But the question is, do you want to block it? It contributes to several indexes in the EU and elsewhere. I allow an exception for Cliqzbot in my AWS block rules.

TorontoBoy

12:07 pm on Apr 11, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Cliqzbot visits me daily and has for years. With few exceptions I ban all of AWS. There are simply too many rogue bots from AWS.

not2easy

3:21 pm on Apr 11, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If it were coming from it's own address I might consider letting it in. But with a different address each time, it's not getting in, sorry. I did some basic checking before sorting through them, it looks like maybe a distributed project like the nerdybot?
Registrar DNS: 205.251.nn.nn (x8)
CLIQZ.COM IP Address 35.187.22.222
Host cliqz.com
City Mountain View, CA 94043
Organization Merit Network
ISP Merit Network
AS Number AS15169 Google Inc.

lucy24

8:25 pm on Apr 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Return? When was it ever gone?

I checked my ruleset and found that I don't currently poke a hole for it--because I don't need to. It's robots.txt compliant with humanoid headers, so it's good to go. (It's possible that at one time I denied it by name just to verify that it would comply.) If it ever became non-compliant, relief is at hand, since it's got a name.

If it were coming from its own address
There are lots of distributed crawlers out there, not all of them undesirable. In my specific case, the fallback is always
BrowserMatch YourName !bad_range
meaning that the robot in question comes from one of the very few IP ranges I deny by default.

afaik, UA spoofers can be counted on your thumbs: If they're Chinese they claim to be from Baidu; if they're from anywhere else they claim to be the Googlebot. OK, maybe we'll need a finger; I used to be plagued by fake Yahoo! Slurp.

keyplyr

9:35 pm on Apr 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it were coming from it's own address I might consider letting it in. But with a different address each time, it's not getting in, sorry
Cliqzbot uses cloud computing with multiple nodes at AWS. The ranges may change, nothing abnormal about this. It's the way cloud computing works. Also nothing abnormal about
cliqz.com having a different home than the Cliqzbot crawls from. Most of them do this.

There are lots of distributed crawlers out there
AFAIK Cliqzbot is not distributed.

TorontoBoy

9:51 pm on Apr 11, 2017 (gmt 0)

5+ Year Member Top Contributors Of The Month



Yes, they are compliant and well behaved, but what do they do for me? I see no compelling case. There are many such secondary search engines that I've not heard much about.

http://cliqz.com/company/cliqzbot)

keyplyr

10:19 pm on Apr 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but what do they do for me?
Sometimes we have to look beyond the initial appearance of things. Data gets supplied to many end sources, which may in turn develop products used for company intranet firewalls, web security, marketing info for Adsense advertisers, directory dumps used by smaller search upstarts, etc.

Aside: Years ago I would block any bot from a marketing company. I thought they were profiting from my content without paying me, and I objected to that. Later I learned that many of these companies were actually doing all the heavy lifting for me, getting my pages to the advertisers. I started allowing these bots and saw my ad revenue increase.