DuckDuckBot first visits!

Forum Moderators: bakedjake

Message Too Old, No Replies

DuckDuckBot first visits!

Dimitri

10:30 am on Mar 2, 2020 (gmt 0)

Yesterday, for the first time in (my) history, I had visits from DuckDuckBot, I do not mean the favicon bot, but THE DuckDuckBot... 54 visits , in one hour! I was starting to wonder if their bot really existed ... or may be it was a glitch.

Now, I can't tell if this is a crawling / indexing, or just some link check, or other tests. Officially, DuckDuck is a mix of data from third part indexes, and supposedly their own index too.

ps: DuckDuckBot verified by ua, IP range and reverse.

lammert

2:27 pm on Mar 2, 2020 (gmt 0)

DuckDuckBot is a regular visitor on my sites. It could be that some webmasters will never see this bot because AFAIK, it uses exclusively Amazon EC2 servers which a number of webmasters block by default. The bot is sometimes a little bit hungry requesting the same page several times per minute.

Dimitri

3:03 pm on Mar 2, 2020 (gmt 0)

it uses exclusively Amazon EC2 servers

Indeed: [help.duckduckgo.com...]

a number of webmasters block by default.

I am testing "known" acceptable requests, before denying unknown requests.

It's sure there are an important amount of requests from Amazon EC2/AWS ip ranges. There is a bit of everything, some are obviously scrapers, but for others, I am still puzzled. It might be some kind of apps, which might be doing something, but since I can't tell, the doors remain closed.

And there is/are guy(s), who keeps trying to download images using a Go lib : Go-http-client/1.1 , If I was letting it in, this would be hundreds of requests per minute.

lucy24

9:59 pm on Mar 2, 2020 (gmt 0)

It could be that some webmasters will never see this bot because AFAIK, it uses exclusively Amazon EC2 servers which a number of webmasters block by default.

Or could it be because their claims about robots.txt compliance are a barefaced lie?

:: detour to raw logs, cross-checked against listed IPs ::

They have a strikingly bizarre behavior which I'd forgotten about until I re-checked logs: visits start with a request for robots.txt with a referer-spam-type referer--generally some utterly random site, though once I found a Yandex search (not one that would lead to anything on my site, let alone to robots.txt) in the referer slot. This gets them the minimalist Disallow-everyone-everywhere

User-Agent: *
Disallow: /

... which they proceed to ignore. Further quirk is that they then, just like a human, get all the supporting files associated with the 403 page.

Frankly I'd always assumed they were all fakers, since they sure don�t act like a legitimate search-engine spider.

notriddle

11:41 pm on Mar 2, 2020 (gmt 0)

That's the fundamental problem with running your bot out of EC2. There's basically no way for anyone to tell if it's really your bot or if its a faker.

lucy24

11:45 pm on Mar 2, 2020 (gmt 0)

There's basically no way for anyone to tell if it's really your bot or if its a faker.

When they're coming from the down-to-the-last-digit IPs listed on their own page, you kinda have to assume it's the real thing. Unless they're got offspring sneaking in after hours to play with the robot when nobody else is using it?

tangor

1:43 am on Mar 3, 2020 (gmt 0)

DDG has been coming around for quite some time for me ... and because it does respect robots.txt I just keep an eye on it.