Welcome to WebmasterWorld Guest from 54.162.227.37

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Dotbot Scraping Images

Moz behavior

     
10:27 pm on Feb 19, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


For the last few days I've seen Moz's dotbot scrape thousands of image files from several sites I watch.

According to [opensiteexplorer.org...] I see no reason to take image files. Anyone else?

Related:
[webmasterworld.com...]
[webmasterworld.com...]
1:33 am on Feb 20, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14775
votes: 632


They've been on my Ignore list for a while, so I'd forgotten about them except when I see the name in redirect logs. At some time when I wasn't looking they moved lock stock & barrel to 216.244.66 (like 208.whatever-it-was, a WOWrack range).

My goodness, they've been busy. Thousands of requests just within 2018. (On my sites, four digits counts as a lot*.) But image requests? Very, very, very rare--with more than half of the (tiny) total being this past 19 January.

Are yours coming from 216.244 or 208.thingy?


* Cross-checking reveals that the raw total number is more than half the raw total number from Googlebot, so yeah, that's “a lot” by any measure.
1:42 am on Feb 20, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


Yes, from Wowrack range..

No doubt they have repurposed dotbot to include other functions. They are not mentioning it on their info page (hence this thread.)
12:37 pm on Feb 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5485
votes: 3


very reluctant to reply to this thread!

keyplr REALLY? (rhetorical)
1) your most recent post does not include the most recent UA
2) Given your extensive use of headers and scripts, one would be inclined to believe that mass image grabs are avoidable
3) crawler or any variation of same has been a deniable criteria for UA's for at least fifteen years.
4) one of the first concerns for new webmasters is access to images by bots.
a) even the major search engines have proven records of 'hiccups', where their bots grabbed robots-denied-directories uncontrollably, and resulted in a practice of denials in place to prevent future 'hiccups'.

FWIW; the following are requests (robots.txt ONLY) from the wowrack Ip for Jan and thus-far-Feb 2018 and with the UA
"Mozilla/5.0 (compatible; DotBot/1.1; http:// www .opensiteexplorer .org/dotbot, help@moz.com)"
(Note; spaces added in UA to break URL)

It's worth noting that DotBot is not contained in my robots, as a result I'm assuming that some other earlier name (unknown to me) results in their compliance.

Jan
7; two robots
14; two robots
15; six robots
16; two robots
17; one robots
18; three robots
19; four robots
22; two robots
26; one robots
29; one robots

Feb
19; one robots

[edited by: wilderness at 1:34 pm (utc) on Feb 21, 2018]

12:51 pm on Feb 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5485
votes: 3


It should be noted that the references contained above could easily be 'misconstrued' as Wide Open West, and in fact they are actually a cloud server wowrack.
4:09 am on Feb 22, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


@wilderness - thanks for your input.

Please don't assume that UAs documented in these threads should be blocked, or that I as the forum moderator am suggesting any UA be blocked, or even that I block these UAs simply because they get documented here.

UA s and their behavior are documented for information, nothing more. Each webmaster may then determine whether that agent is of benefit to their specific interests or not.

To address you post more explicitly; as one who sells advertising on one or more web sites, I would of course allow ad marketing bots like dotbot... whether they include the UA attribute "crawler" or not :)
4:20 am on Feb 22, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5485
votes: 3


"To address you post more explicitly; as one who sells advertising on one or more web sites, I would of course allow ad marketing bots like dotbot... whether they include the UA attribute "crawler" or not :) "

Than why bother posting when they abuse you access?
If you chose to allow them than bite the bullet!
5:35 am on Feb 22, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


I think you're missing the point here. Once again...
UA s and their behavior are documented for information, nothing more. Each webmaster may then determine whether that agent is of benefit to their specific interests or not
One of the functions of this forum is to provide a service to those interested in this information.

This is not a debate, just an explanation.
10:07 am on Feb 22, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


Response from Moz:
I'm sorry for the inconvenience here! DotBot will essentially crawl every link which we can reach/follow.
This of course avoided the specific issue I asked them about, e.g. the sudden image file requests.

He went on to explain robots.txt and how it can be used to control what files dotbot may access.
7:35 pm on Feb 22, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14775
votes: 632


If you chose to allow them then bite the bullet!
Whoa there. If I allow someone to visit my house based on my knowledge of their behavior and personality, that doesn't mean I've forfeited the right to complain if they kick my cat.
11:21 pm on Feb 27, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14775
votes: 632


I was reminded of this thread when I started processing today's redirect list. It appears the DotBot came across an old shopping list (pages, not images) from the middle of 2013 ....

Whew.
9:07 pm on May 20, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


The last few visits DotBot only requested image files (blocked) and robots.txt (allowed), but today along with 1700 requests for images and 280 requests for robots.txt, it actually asked for 3 pages (allowed).

I'm convinced they are now in the business of image scraping, possibly building an image index or some other reason yet to be determined.
Moz is an Inbound Marketing company that creates analytics software for online marketers.
Whether they are still in the marketing business is unknown.
9:48 pm on May 20, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts: 355
votes: 33


I have long ago banned DotBot/OpenSite Explorer in htaccess and robots.txt. They were just taking up inordinate amounts of server utilization on a daily basis. They do visit me daily and read my robots.txt sometimes 35 times, more than once an hour? That is abusive.

The other moz.com bot, rogerbot, hits me just as often, but only goes after robots.txt and rss feeds. If or when they scrape other resources I will ban them. Moz is one of the companies that I do watch on a regular basis.

rogerbot/1.0 (http://www.moz.com/dp/rogerbot, rogerbot-crawler@moz.com)
11:31 pm on May 20, 2018 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:11739
votes: 734


Since I sell my own ad space as well as publish other ad platforms, I try to allow marketing company bots access. They aggregate that data into products they sell to advertisers, then these advertisers bid for ads on my site. This makes me a happy man.

However, most of these marketing companies have more than one objective. This is the tricky part; allowing them access to the components that benefit my interests while blocking them from taking that which isn't.
2:16 am on May 21, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14775
votes: 632


rogerbot ... only goes after robots.txt and rss feeds
They get information from RSS feeds. The only time I ever see them is when they're following-up a link from (a different site's) RSS feed. That is, they don't send a referer, but I know where they're getting the name of the requested page.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members