Forum Moderators: open

Message Too Old, No Replies

spbot

         

keyplyr

10:46 pm on Jun 12, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; spbot/5.0.2; +http://OpenLinkProfiler.org/bot )
Protocol: HTTP/1.1
Robots.txt: No
Host: digitalocean.com
104.131.0.0 - 104.131.255.255
104.131.0.0/16

SEO link analysis. "spbot" has been mentioned a couple time at WW but this may be a different agent:
[webmasterworld.com...]
[webmasterworld.com...]

lucy24

1:22 am on Jun 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



spbot is one of the latest additions to my Test And Assess list: Newly denied in robots.txt, and if they comply for a while, I'll let them in*. I generally check once a month, at which point it's either (a) poke a hole, (b) roll-over to next month (for infrequent visitors), or (c) transfer their name to the Continue Blocking list.

fwiw, they've only visited once since I started denying them, and that time they requested nothing after robots.txt. Earlier, they requested only the front page-- three times on each visit, demonstrating some perseverance. (The 403 page has links to other pages, primarily in roboted-out directories, so the failure to ask for interior pages is a choice on their part.)

And, in reference to one of those linked earlier discusions: In my robots.txt, "User-Agent: spbot" is one of a long list of User-Agents grouped together in a comprehensive Disallow. The current version doesn't seem to have trouble understanding the syntax.


* I am generally extremely lenient with robots. So long as they honor a "No Admittance" sign, and don't meddle with roboted-out directories, they're welcome unless they're already known for egregious misbehavior. You don't see a lot of robots going to the trouble of demonstrating robots.txt compliance over a period of many weeks before they yank off the mask and run wild. Not saying they don't exist, just not common.

keyplyr

4:42 am on Jun 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"long list of User-Agents grouped together in a comprehensive Disallow"

Do you mean...?

User-agent: Example1-Bot
User-agent: Example2-Bot
User-agent: Example3-Bot
User-agent...
Disallow: /

If so, I remember Jim Morgan & I having a long discussion why it should work but isn't supported by robots.org. Last time I tried it few bots supported that syntax. Is there better support now?

lucy24

6:12 am on Jun 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you mean...?

Yes, exactly. It seems to work, in the sense that robots who are denied will eventually stop asking. Unless they never intended to comply in the first place, which I think is far more likely than that they genuinely couldn't understand what I meant.

As I understand it, the robots.txt standard has never officially been updated in any way whatsoever. But, really, there are reasonable expectations. F'rinstance, the only reason the Googlebot doesn't recognize "Crawl-Delay" is that it just doesn't feel like it.

I guess I could pick out the names on the Continue Blocking list and see if they behave differently if they get their very own individual
User-Agent: this-means-you
Disallow: /
But like I said, I really don't think it's a comprehension problem ;)

keyplyr

6:36 am on Jun 13, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the robots.txt standard has never officially been updated in any way whatsoever

The only thing I've seen updated was the added support for:
Sitemap: http://example.com/sitemap.xml
And that took almost a year. But you're correct in that it is more a matter of the bot itself and not the standard.

I speak lousy French but it's still French. My friends are tolerant. Those unfamiliar often walk away shaking their French heads.