the DotBot returns - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

the DotBot returns

lucy24

6:41 pm on May 31, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Anyone else seen them? Starting abruptly two days ago, they've been coming by every hour or two-- 26 requests to date-- asking for only robots.txt, and always from the correct form of the hostname. So far, no sign of them anywhere but my personal site, which has been around forever.

IP: 208.115.111.72, 208.115.113.88 (those two exactly)
UA: Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)
Page: /robots.txt

Search through raw logs reveals that they did the same thing in January-- 24 robots.txt requests in two days, from the identical two IPs, before disappearing as suddenly as they'd appeared. Earlier still, I find sporadic robots.txt requests, most recently last October, but only one or two a day.

Can't think what they want, unless they're simply testing for server accessibility, in which case a robots.txt request should serve as well as anything. By default, they would be blocked*, but they've never asked for a page, so I haven't even bothered denying them in robots.txt. Hm. Wonder if they'd behave differently if they did find a No Admittance sign?

An earlier thread [webmasterworld.com] postulated a relationship between DotBot and ezooms, but I haven't seen them around either.

* Happily, the latest set of visits came after I started logging headers on robots.txt requests.

keyplyr

7:54 pm on May 31, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I see it all the time, always have. At least once a week.

lucy24

12:35 am on Jun 1, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

... and, under the heading of “D’oh!” ...

I had entirely forgotten that they have been denied* since who-knows-when. So what they're doing is walking past the door over and over again to see if it still says "No Admittance" the way it did an hour ago-- but they're not brazen enough to rattle the doorknob. Not that it would do them any good, but they haven't even tried.

Along with checking robots.txt I double-checked logs to confirm that nobody else has been coming around from the same IP. Some entities do use a different user-agent for robots.txt requests than for their "real" requests.

* I'm trying to get in the habit of distinguishing between "deny" (robots.txt) and "block" (htaccess/config). It's the difference between putting up a sign that says Employees Only, and deadbolting the door.

keyplyr

12:49 am on Jun 1, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I look at every 403... and 404.

lucy24

5:50 am on Jun 1, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I look at every 403... and 404.

I look at human 403s (request for errorstyles.css) and any 404, though most of the latter are the Googlebot testing for soft 404s. Or obvious typos. The rest of the time I just can't be bothered, unless I'm getting ready for an installment of At Home With The Robots.

keyplyr

6:25 am on Jun 1, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It helps to have a hobby.

lucy24

7:00 pm on Jun 5, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Follow-up:

Over the course of several days, they asked for robots.txt more than fifty times. So I decided to let them in and see what happens. Of course they were wild with excitement and ran around requesting pages that haven't been on that site in 2½ years--sometimes longer. Nothing in roboted-out directories, and definitely not a fresh crawl; they had an old shopping list and were going to follow it. The robots.txt requests turn out to be an inherent behavior, continuing at a rate of 10-15 a day. (Hm. Could they be related to bing?)

In spite of the colossal number of redirects they picked up, to date there have been no requests at all on the new site (the one all the redirects point to).

:: detour to raw logs ::

Looks as if they decided to become robots.txt compliant on or about 8 February 2015. That's the last time they made a (403) page request.

keyplyr

10:25 am on Jun 6, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

An earlier thread [webmasterworld.com] postulated a relationship between DotBot and ezooms, but I haven't seen them around either.

There may be various posers, but DotBot belongs to Moz [moz.com...] and is used to gather data for their marketing tools; good if you publish ads.

keyplyr

9:51 pm on Jul 6, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

DotBot (which I allow) came 'round today and requested robots.txt then ignored disallow directive on one directory/page. I also use the on-page meta robots noindex.

The data we collect through DotBot is surfaced on this site, in Moz tools, and is also available via the Mozscape API.

I used their Site Explorer tool and found about a 3rd of my pages listed. I did not find the disallowed page from today, however it is likely too early to see that result.

lucy24

11:53 pm on Jul 6, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

After letting them in, I sat back and watched.

Access permitted in robots.txt: 2 June
Reaction at original site: instant
First appearance at new site (target of redirects from old site): 9 June
First request on new site for page that didn't exist on old site: 15 June
Last sighting on new site: 24 June (after requesting material from four pages or directories that didn't exist at the old site-- but not a comprehensive crawl of all new material, though they do seem fond of one new directory*).

They're still coming by regularly at the old site, picking up redirects in a quite bing'esque fashion, probably averaging out to 1 request per page per week. Also bing-like is their continuing appetite for robots.txt, about ¼ of total requests. No requests for pages in roboted-out directories on either site.

Interesting quirk: One group of pages used to have URLs in the form /dir/subdir/FileName.html where /subdir/ only contained one file. Long after the site move, I changed all these to /dir/subdir/ alone. But the DotBot has repeatedly asked for /dir/subdir/ at the old site-- an URL that never existed, though it's obviously deducible from the URL I did use. Happily the redirect target is the same either way so, hey, whatever rocks their boat.

* Edit: I checked more carefully. They asked for the directory index page twice, and each named page within that directory exactly once. So, yeah, at least that bit is a comprehensive crawl. I doubt they're following someone else's link, since it's a brand-new directory.

keyplyr

2:43 am on Jul 7, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Those crawl-to-results times are helpful, thanks.

"Interesting quirk..." IMO that only shows the behavior is vertical, and not a linear crawl.

not2easy

10:37 pm on Oct 10, 2016 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

In a way it makes sense - I'm seeing the DotBot from Wowrack:
216.244.64.0 - 216.244.95.255
216.244.64.0/19
checking some of their IPs. I am seeing that they come from where I used to get ezooms visits.
from old records:
208.115.113.80 - 208.115.113.95 dotnetdotcom.org – Ezooms
208.115.96.0/19Wowrack.com208.115.96.0 - 208.115.127.255Ezooms Bot

keyplyr

11:17 pm on Oct 10, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

DotBot from Wowrack

A quick search will show the UA is used (faked) by many IPs at hosting companies (example: wowrack.com)

I only allow DotBot from from Moz :)

blend27

4:37 pm on Oct 12, 2016 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Moz, No Moz - no DotBot.

Moz's DotBot respectfully obeys robots.txt