Regarding the "bad behaviour", I've finally nailed down a few bugs. Most notable from your point of view (and mine also actually) was that the distributed crawling wasn't actually distributed. Which explains why some sites were being visited several times in a row.
Again sorry for these hiccups.
keyplyr
7:59 am on Oct 11, 2017 (gmt 0)
Hi,
Well, again, from the webmaster point of view, a small, simple UA is best. Example:
User Agent /1.0 (+http://www.example.com/info.html)
...and all pertinent information would be found on the info.html page. You can put more info there and there's the added benefit of showing everyone who/where you are.
This also keeps the wording short in the UA. Less wording, less chance of getting caught in some filter.
Currently, I had to write more than 3 seperate rules on my server to allow your bot.
blend27
2:45 pm on Nov 19, 2017 (gmt 0)
-- I kinda wish -- UA has "Bing" in it, not from MS IP Ranges and RDNS does not have ".search.msn.com" pattern.
blend27
2:57 pm on Nov 19, 2017 (gmt 0)
@exensa, also, take a look at your blog at /exensa/exensa-has-one-more-phd/.
Not sure that matches the language or thesis of the post.
lucy24
8:01 pm on Nov 19, 2017 (gmt 0)
I think I see what you mean. If the UA string contains the element “bubing” (which it no longer does, as of a few days ago) then by definition it contains the substring “bing”, which could lead to being blocked on grounds that have nothing to do with the inherent behavior of the current robot.
But wait! The former Barkrowler UA string said: based on BUBing (note casing) The thing being replicated is called: BUbiNG (note casing) bingbot is all lower-case and can be further constrained by saying "\bbing" with word boundary
You should never make your rules case-insensitive (such as [NC] flag or BrowserMatchNoCase) unless there is a specific and overarching reason to do so. Here we've got three different casings, so a rule made for one will not match the others.
That's assuming we're talking about access controls, not robots.txt. But in the latter case, you'd presumably say "bingbot", so even case-insensitively the others would no longer match.
That was interesting. I hadn't previously noticed the difference between BUbiNG and BUBing. It means I did not, after all, have to worry about the wrong rules coming into play.
blend27
2:51 pm on Nov 20, 2017 (gmt 0)
Most of my Access Control Rules are NOT in HTACCESS.
In this case I personally do the following: Check if the IP is from "Allowed Spider IP Range". Then check RDNS and see if it has a known and correct N characters to the right of the RDNS string.
right(rdnsString,15) NEQ '.search.msn.com' = BOOT, and I could investigate it later in a day(all headers are recorded). If $M changes RDNS, well I will change the rule later in a dAy.
Then again, as in original UA reported, No Parentheses = no content, even before it gets to RDNS part. Never mind the AWS Range after that.
keyplyr
10:23 pm on Feb 15, 2018 (gmt 0)
UA: Barkrowler/0.7 (+http://www.exensa.com/crawl)
lucy24
11:06 pm on Feb 15, 2018 (gmt 0)
Yes, it's been using that form since mid-November 2017.
They’ve been in my Ignore bin for a few months, so I hadn’t noticed how much they move around.
January: 51.254.blahblah February: 54.37.blahblah March and May (on vacation in April?): 217.182.blahblah May 31: 147.135.blahblah July (on vacation in June and the first 2/3 of July?): 51.15.blahblah
If it weren't for that lone May 31, I would hypothesize that they pay by the calendar month for hosting ;)
keyplyr
10:07 pm on Jul 26, 2018 (gmt 0)
More like just running crawl campaigns from different hosts & switching for whatever reason.
Makes it tedious to allow them through if the range keeps changing. Much simpler to block and forget about them.
Also, branding & trust level is diminished if a bot does not have a designated craw range: