Forum Moderators: open

Message Too Old, No Replies

AhrefsBot

new assigned crawl range

         

keyplyr

9:56 pm on Jul 23, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)
Protocol: HTTP/1.1
Robots.txt: Yes but disobeys
Host: ahrefs.com
151.80.32.0 - 151.80.47.255
151.80.32.0/20
Parent: ovh.com
151.80.0.0 - 151.80.255.255
151.80.0.0/16
Parent: ovh.com
164.132.0.0 - 164.132.255.255
164.132.0.0/16

AhrefsBot gathers site data for marketing products sold to advertisers. This bot has changed hosts many times over the years, but now has assigned crawl range at OVH, which is cloud computing so may use various nodes within OVH blocks.

lucy24

10:56 pm on Jul 23, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Robots.txt: Yes but disobeys
Well, ###. Can you dredge up any details on this? I know I used to have them pegged as a Bad Robot but I'm currently authorizing them. (I doubt there are fakers involved; fakers rarely get the headers right.) Looks as if I poked a hole in May 2016 after about a month and a half of good behavior.

Incidentally I didn't even realize they had fixed crawl ranges. I always thought of them as distributed. Oops.

keyplyr

11:11 pm on Jul 23, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can you dredge up any details on this?
It first requested robots.txt where it was disallowed and then requested a web page where it got blocked by range.

I don't judge the validity or usefulness of an agent by whether it respects robots.txt any longer. Most agents that are useful to my interests don't even request robots.txt. Sadly, robots.txt is kinda archaic since Social Media came into vogue.

Back to AhrefsBot - considering it now has a verifiable crawl range and has evolved to an explicit purpose, I now allow it from its designated range and other OVH blocks. Before it was too flakey to trust IMO.

But if you don't sell stuff or publish ads, it's probably of no use to your interests.

I always thought of them as distributed.
Yeah, they moved around a lot giving that (flakey) impression, especially if you looked them up here and saw the many earlier sightings.

keyplyr

2:56 am on Jul 24, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Another one...
UA: Mozilla/5.0 (compatible; AhrefsBot/5.2; News; +http://ahrefs.com/robot/)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: ovh.com
51.254.0.0 - 51.255.255.255
51.254.0.0/15

Note the "News" in the UA string.

lucy24

8:47 pm on Jul 24, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But if you don't sell stuff or publish ads, it's probably of no use to your interests.

Which brings up the eternal question: If I don't sell stuff or publish ads, what are they even looking for? I've found them in some quite deep interior pages, although never in a disallowed directory. (One of my disallowed directories contains pages that are linked from everywhere, so it's a good way to judge whether a full-spidering robot plans to be compliant. Not so useful if they're just following specific links, since I don't suppose anyone in history has ever linked to, say, my Legal page.)

This is, obviously, not an Ahrefs-specific question.

keyplyr

8:53 pm on Jul 24, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They are looking to see the content of your pages, themes, topic, platform, software, outbound links... all the metrics they use to bundle your site data with thousands of other site's data and roll it into a product they sell to those who do advertise and market.

And none of us published ads until we did :)

Even if you never publish ads or sell products, your info is extremely valuable.

lucy24

6:13 pm on Oct 22, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Aaand just this morning I found them in a disallowed directory, which prompted me to look up the most recent discussion.

Interestingly, they asked for the stylesheets associated with my error documents--but not the document itself. I think this must be a legacy of years ago when I blocked them. They don't know the URL of the 403 page, but they do know its associated stylesheets.

Still can't figure out what they do with those stylesheets, though.