Dotbot Scraping Images

Forum Moderators: open

Message Too Old, No Replies

Dotbot Scraping Images

Moz behavior

keyplyr

10:27 pm on Feb 19, 2018 (gmt 0)

For the last few days I've seen Moz's dotbot scrape thousands of image files from several sites I watch.

According to [opensiteexplorer.org...] I see no reason to take image files. Anyone else?

Related:
[webmasterworld.com...]
[webmasterworld.com...]

lucy24

1:33 am on Feb 20, 2018 (gmt 0)

They've been on my Ignore list for a while, so I'd forgotten about them except when I see the name in redirect logs. At some time when I wasn't looking they moved lock stock & barrel to 216.244.66 (like 208.whatever-it-was, a WOWrack range).

My goodness, they've been busy. Thousands of requests just within 2018. (On my sites, four digits counts as a lot*.) But image requests? Very, very, very rare--with more than half of the (tiny) total being this past 19 January.

Are yours coming from 216.244 or 208.thingy?

* Cross-checking reveals that the raw total number is more than half the raw total number from Googlebot, so yeah, that's “a lot” by any measure.

keyplyr

1:42 am on Feb 20, 2018 (gmt 0)

Yes, from Wowrack range..

No doubt they have repurposed dotbot to include other functions. They are not mentioning it on their info page (hence this thread.)

wilderness

12:37 pm on Feb 21, 2018 (gmt 0)

very reluctant to reply to this thread!

keyplr REALLY? (rhetorical)
1) your most recent post does not include the most recent UA
2) Given your extensive use of headers and scripts, one would be inclined to believe that mass image grabs are avoidable
3) crawler or any variation of same has been a deniable criteria for UA's for at least fifteen years.
4) one of the first concerns for new webmasters is access to images by bots.
a) even the major search engines have proven records of 'hiccups', where their bots grabbed robots-denied-directories uncontrollably, and resulted in a practice of denials in place to prevent future 'hiccups'.

FWIW; the following are requests (robots.txt ONLY) from the wowrack Ip for Jan and thus-far-Feb 2018 and with the UA
"Mozilla/5.0 (compatible; DotBot/1.1; http:// www .opensiteexplorer .org/dotbot, help@moz.com)"
(Note; spaces added in UA to break URL)

It's worth noting that DotBot is not contained in my robots, as a result I'm assuming that some other earlier name (unknown to me) results in their compliance.

Jan
7; two robots
14; two robots
15; six robots
16; two robots
17; one robots
18; three robots
19; four robots
22; two robots
26; one robots
29; one robots

Feb
19; one robots

[edited by: wilderness at 1:34 pm (utc) on Feb 21, 2018]

wilderness

12:51 pm on Feb 21, 2018 (gmt 0)

It should be noted that the references contained above could easily be 'misconstrued' as Wide Open West, and in fact they are actually a cloud server wowrack.

keyplyr

4:09 am on Feb 22, 2018 (gmt 0)

@wilderness - thanks for your input.

Please don't assume that UAs documented in these threads should be blocked, or that I as the forum moderator am suggesting any UA be blocked, or even that I block these UAs simply because they get documented here.

UA s and their behavior are documented for information, nothing more. Each webmaster may then determine whether that agent is of benefit to their specific interests or not.

To address you post more explicitly; as one who sells advertising on one or more web sites, I would of course allow ad marketing bots like dotbot... whether they include the UA attribute "crawler" or not :)

wilderness

4:20 am on Feb 22, 2018 (gmt 0)

"To address you post more explicitly; as one who sells advertising on one or more web sites, I would of course allow ad marketing bots like dotbot... whether they include the UA attribute "crawler" or not :) "

Than why bother posting when they abuse you access?
If you chose to allow them than bite the bullet!

keyplyr

5:35 am on Feb 22, 2018 (gmt 0)

I think you're missing the point here. Once again...

UA s and their behavior are documented for information, nothing more. Each webmaster may then determine whether that agent is of benefit to their specific interests or not

One of the functions of this forum is to provide a service to those interested in this information.

This is not a debate, just an explanation.

keyplyr

10:07 am on Feb 22, 2018 (gmt 0)

Response from Moz:

I'm sorry for the inconvenience here! DotBot will essentially crawl every link which we can reach/follow.

This of course avoided the specific issue I asked them about, e.g. the sudden image file requests.

He went on to explain robots.txt and how it can be used to control what files dotbot may access.

lucy24

7:35 pm on Feb 22, 2018 (gmt 0)

If you chose to allow them then bite the bullet!

Whoa there. If I allow someone to visit my house based on my knowledge of their behavior and personality, that doesn't mean I've forfeited the right to complain if they kick my cat.

lucy24

11:21 pm on Feb 27, 2018 (gmt 0)

I was reminded of this thread when I started processing today's redirect list. It appears the DotBot came across an old shopping list (pages, not images) from the middle of 2013 ....

Whew.

keyplyr

9:07 pm on May 20, 2018 (gmt 0)

The last few visits DotBot only requested image files (blocked) and robots.txt (allowed), but today along with 1700 requests for images and 280 requests for robots.txt, it actually asked for 3 pages (allowed).

I'm convinced they are now in the business of image scraping, possibly building an image index or some other reason yet to be determined.

Moz is an Inbound Marketing company that creates analytics software for online marketers.

Whether they are still in the marketing business is unknown.

TorontoBoy

9:48 pm on May 20, 2018 (gmt 0)

I have long ago banned DotBot/OpenSite Explorer in htaccess and robots.txt. They were just taking up inordinate amounts of server utilization on a daily basis. They do visit me daily and read my robots.txt sometimes 35 times, more than once an hour? That is abusive.

The other moz.com bot, rogerbot, hits me just as often, but only goes after robots.txt and rss feeds. If or when they scrape other resources I will ban them. Moz is one of the companies that I do watch on a regular basis.

rogerbot/1.0 (http://www.moz.com/dp/rogerbot, rogerbot-crawler@moz.com)

keyplyr

11:31 pm on May 20, 2018 (gmt 0)

Since I sell my own ad space as well as publish other ad platforms, I try to allow marketing company bots access. They aggregate that data into products they sell to advertisers, then these advertisers bid for ads on my site. This makes me a happy man.

However, most of these marketing companies have more than one objective. This is the tricky part; allowing them access to the components that benefit my interests while blocking them from taking that which isn't.

lucy24

2:16 am on May 21, 2018 (gmt 0)

rogerbot ... only goes after robots.txt and rss feeds

They get information from RSS feeds. The only time I ever see them is when they're following-up a link from (a different site's) RSS feed. That is, they don't send a referer, but I know where they're getting the name of the requested page.