Forum Moderators: open

Message Too Old, No Replies

Barkrowler

         

keyplyr

3:11 am on Aug 31, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




UA: Barkrowler 0.1.6
Protocol: HTTP/1.1
Robots.txt: No
Host: AWS
52.208.0.0 - 52.215.255.255
52.208.0.0/13

exensa

7:41 am on Oct 11, 2017 (gmt 0)

5+ Year Member



Hi, yes I'm listening

Ok, I can remove the BUbiNG reference, I felt it was a polite thing to do, but if it complexifies your rules.

Regarding the UA string what's best, to put the version like in Barkrowler/0.5.1 or not ?

If you think it's ok, I'll strip it down to this :

Barkrowler - www.exensa.com/crawl - admin@exensa.com

Regarding the "bad behaviour", I've finally nailed down a few bugs. Most notable from your point of view (and mine also actually) was that the distributed crawling wasn't actually distributed. Which explains why some sites were being visited several times in a row.

Again sorry for these hiccups.

keyplyr

7:59 am on Oct 11, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi,

Well, again, from the webmaster point of view, a small, simple UA is best. Example:

User Agent /1.0 (+http://www.example.com/info.html)

...and all pertinent information would be found on the info.html page. You can put more info there and there's the added benefit of showing everyone who/where you are.

This also keeps the wording short in the UA. Less wording, less chance of getting caught in some filter.

Currently, I had to write more than 3 seperate rules on my server to allow your bot.

blend27

2:45 pm on Nov 19, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- I kinda wish --
UA has "Bing" in it, not from MS IP Ranges and RDNS does not have ".search.msn.com" pattern.

blend27

2:57 pm on Nov 19, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@exensa, also, take a look at your blog at /exensa/exensa-has-one-more-phd/.

Not sure that matches the language or thesis of the post.

lucy24

8:01 pm on Nov 19, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think I see what you mean. If the UA string contains the element “bubing” (which it no longer does, as of a few days ago) then by definition it contains the substring “bing”, which could lead to being blocked on grounds that have nothing to do with the inherent behavior of the current robot.

But wait!
The former Barkrowler UA string said: based on BUBing (note casing)
The thing being replicated is called: BUbiNG (note casing)
bingbot is all lower-case and can be further constrained by saying "\bbing" with word boundary

You should never make your rules case-insensitive (such as [NC] flag or BrowserMatchNoCase) unless there is a specific and overarching reason to do so. Here we've got three different casings, so a rule made for one will not match the others.

That's assuming we're talking about access controls, not robots.txt. But in the latter case, you'd presumably say "bingbot", so even case-insensitively the others would no longer match.

That was interesting. I hadn't previously noticed the difference between BUbiNG and BUBing. It means I did not, after all, have to worry about the wrong rules coming into play.

blend27

2:51 pm on Nov 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Most of my Access Control Rules are NOT in HTACCESS.

In this case I personally do the following: Check if the IP is from "Allowed Spider IP Range". Then check RDNS and see if it has a known and correct N characters to the right of the RDNS string.

right(rdnsString,15) NEQ '.search.msn.com' = BOOT, and I could investigate it later in a day(all headers are recorded). If $M changes RDNS, well I will change the rule later in a dAy.

Then again, as in original UA reported, No Parentheses = no content, even before it gets to RDNS part. Never mind the AWS Range after that.

keyplyr

10:23 pm on Feb 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Barkrowler/0.7 (+http://www.exensa.com/crawl)

lucy24

11:06 pm on Feb 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, it's been using that form since mid-November 2017.

keyplyr

11:33 pm on Feb 15, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And now it's documented :)

keyplyr

8:46 pm on Jul 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



now coming from...

Host:online.net
51.15.0.0 - 51.15.255.255
51.15.0.0/16

Wonder if it still uses AWS?

Obeying robots.txt

lucy24

9:47 pm on Jul 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wonder if it still uses AWS?
They’ve been in my Ignore bin for a few months, so I hadn’t noticed how much they move around.

January: 51.254.blahblah
February: 54.37.blahblah
March and May (on vacation in April?): 217.182.blahblah
May 31: 147.135.blahblah
July (on vacation in June and the first 2/3 of July?): 51.15.blahblah

If it weren't for that lone May 31, I would hypothesize that they pay by the calendar month for hosting ;)

keyplyr

10:07 pm on Jul 26, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



More like just running crawl campaigns from different hosts & switching for whatever reason.

Makes it tedious to allow them through if the range keeps changing. Much simpler to block and forget about them.

Also, branding & trust level is diminished if a bot does not have a designated craw range:
host crawl-**-***-**-*.example-bot.com

Smaller companies will always have this issue.
This 42 message thread spans 2 pages: 42