Forum Moderators: open

Message Too Old, No Replies

Non-crawling bots

Where have they gone?

         

dstiles

9:47 am on Sep 27, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have about a dozen bots allowed to crawl my sites: applebot, bing, duckduckgo, exabot, facebook, google, istellabot (tiscali), mojeek, qwantify, seznam, twitterbot, yandex, yeti/linespider.

Most have IP ranges associated with them to negate pretend-bots. Most are visiting regularly with the exceptions of the following, which I haven't seen for quite some time. Does anyone have experience of recent visits from these bots? Any idea why they've stopped visiting?

exabot (France) - exalead is still going but haven't seen exabot for a long time. A crawler info site gives last seen as this month. Permitted IP ranges: 178.255.215.0/24, 193.47.80.0/24. DNS shows crawlers within those ranges.

istellabot (tiscali) (Italy) - a crawler info site gives last seen as May 2020. Permitted IP ranges: 217.73.208.0/24. DNS still shows crawlers within that range but I can find no search engine for tiscali. I'm guessing the SE folded.

qwantify - a crawler info site gives last seen as this month. Permitted IP ranges: I had old IPs for this - should be (now amended) 194.187.171.0/24. Not blocked by iptables so unsure why my "unwanted bots" log does not show it. A visit to their SE shows only: "Qwantify is not currently accepting new clients. We are exclusively focused on building our existing clients' businesses." Not interested in them any more.

And... Are there any other bots I should be permitting? Country-based, ethical, non-SEO, duckduck-ish?

brotherhood of LAN

12:50 pm on Sep 27, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>And... Are there any other bots I should be permitting? Country-based, ethical, non-SEO, duckduck-ish?

You might want to look at including Brave - I think they're crawling from Amazon IPs, not sure how much info about their bot is in the wild but in the main it looks like they're intending to build their own search index.

dstiles

4:23 pm on Sep 27, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I did consider brave but could find no actual crawl information. They were said to be using a database that closed a while back - cliqz, was it?

I thought one of their main "crawl" methods was to check out hits from the Brave browser and use that to seed the crawl.

lucy24

5:05 pm on Sep 27, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Heh. A couple years back, I added a “former robots” page to the online version of At Home with the Robots, because there were that many of them. At last count, there are 57 names on the list (all categories, both good and bad). And sometimes a robot is gone for a year or more, only to reappear as if nothing had happened.

lucy24

11:46 pm on Sep 28, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Incidentally...

I see Qwantify/1.0 every few days from 194.187.171, always requesting root + favicon, both blocked. (I had to check this. I rarely bother to block the favicon, but bad_agent is one of the exceptions.) They got shifted from authorize-and-ignore a few years back when I happened to notice misbehavior, such as requesting files in roboted-out directories.

I also see Qwantify/2.4 from 91.242.162 (quick check confirms that the range belongs to Qwant), requesting only robots.txt ... in which they are Disallowed.

dstiles

9:38 am on Sep 29, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, Lucy. I had V1 not V2 IPs enabled. Almost no bots blocked in robots - they are either good or take no notice - so mainly I block previews and a few directories. But as I said, I don't like Qwantify's policy so I've disabled them now.

tangor

6:24 am on Oct 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow all to view robots.txt ... but in that I only allow a few. Any who ignore are eventually given 403 (after I investigate their value, or more likely NON-value).