DotBot - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

DotBot

lucy24

9:24 pm on Aug 5, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

A rather fuzzy question, but ...

All robots benefit somebody. Who benefits from DotBot? Asking seriously, not rhetorically. (I generally avoid the position of “If it doesn’t personally benefit me, I don’t want it.” After all, I don’t personally benefit from the street in front of your house, but that’s no reason not to maintain it.)

I keep a very close eye on redirects--currently mostly due to HTTP>>HTTPS from a move made in late 2019. Every few months there is a fresh flurry of DotBot requests, all redirected, on top of their usual sporadic visits. What infuritates me to no end is that those redirects include pages that have never existed as HTTP, so if they claim to be following links they are lying in their teeth. They are perfectly capable of using HTTPS (I've met a handful of law-abiding robots that aren’t); they just prefer HTTP, even if it means maintaining a ratio of about fifteen 301s to every one 200.

I am currently leaning toward the option of simply blocking all HTTP requests from DotBot. Wonder how they would react?

tangor

8:56 pm on Aug 6, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Been blocking DotBot for years. Doesn't stop them, they keep coming back. However, they only get 228 bytes in response.

blend27

6:21 pm on Aug 9, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

@Lucy
-- they just prefer HTTP

..don't push it, it is a low hanging fruit when it comes to 'gotcha' moment...

I guess it is a poorly written part of the bot that does what it does... get the IP banned for a /while.

On the subject: I have 0 bytes served to DotBot string. Just an abort on server level.

lucy24

7:57 pm on Aug 9, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Last time I ran my logs (yesterday), I found that DotBot accounted for well over half of the past month’s redirects, topping even bing. At that point I said To ### with it and added RewriteRules to three sites' htaccess: If it is a page request from DotBot (UA, no particular IP) and not https, off it goes to 403-land.

Hmph.

For the time being, they can continue getting redirects on image requests, because I ignore those anyway, and on requests missing the directory slash, because there aren’t many. (How very, very subjective all this is...)

martinibuster

3:03 am on Aug 10, 2021 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

DotBot accounted for well over half of the past month’s redirects, topping even bing.

That's weird. Is your site popular with forums or blogs and maybe they're linking to your pages with insecure URLs?

If it was just DotBot I'd say maybe it's a dumb crawler. But if Bingbot is hitting insecure URLs then maybe something's out there or in the site itself that's converting URLs to http.

Have you crawled your site with MissingPadlock [missingpadlock.com] to see if it spots references to http URLs?

dstiles

10:36 am on Aug 10, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I get http requests from bing as well, especially for robots.txt. A very antiquated bot, not clever at all. Still using TLSv1.2, as well.

lucy24

5:48 pm on Aug 10, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Is your site popular with forums or blogs and maybe they're linking to your pages with insecure URLs?

Not that I know of, and definitely not to this wide extent. Especially since, if there were incorrect HTTP links out there, other search engines would be following them too. Other than search engines, most links to the pages I'm especially interested in are from a curated directory, whose listings are correct.

I have a few specific pages where I do see humans getting redirected, but those are high-interest pages that would have been bookmarked before the site went HTTPS. (Psst! Browser developers! Wouldn't it be clever to update bookmarks automatically when you see a protocol redirect?)

If it was just DotBot I'd say maybe it's a dumb crawler.

Oh, it’s definitely a dumb crawler. I would like to know how they find out about deep-interior URLs, like /directory/subdir/pagename.html when they have never seen /directory/subdir/ which is the only way to get there. (I just checked. This happens far too often to be accounted for by a few random incorrect links.) Do they get their shopping list from someone else, like scraping a search engine's full listings?

In the case of bing, being slow on the uptake is an established behavior: more than any other search engine, they sporadically continue asking for URLs that ceased to exist years and years ago. So it's not just http but whole URLs. (I much prefer the behavior of Yandex: as soon as they discover that a site has gone HTTPS, all requests everywhere are strictly HTTPS, except for sporadic checks of / root to make sure it's still getting redirected from HTTP.)

All my internal links are with leading / and as each site moved to https I hand-checked all other sites to ensure there were no leftover http://example.com links.

especially for robots.txt

I don't know how many sites do this, but I exempt robots.txt from all canonicalization redirects. Some robots seem to get confused if a robots.txt request is redirected, and you don't want to give them any excuse for noncompliance.

dstiles

9:26 am on Aug 11, 2021 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

> URLs that ceased to exist years and years ago

I've had some recently that haven't existed since before G was in nappies - or diapers if you're not English. :) And then only existed for less than a year.

Jonesy

6:22 pm on Aug 14, 2021 (gmt 0)

10+ Year Member

Top Contributors Of The Month

Just had it visit today. It was attempting to look in a directory that's been gone for 17 years.
It's been blocked for years. They've changed the version number. It's blocked again.