Barkrowler

Forum Moderators: open

Message Too Old, No Replies

Barkrowler

keyplyr

3:11 am on Aug 31, 2017 (gmt 0)

UA: Barkrowler 0.1.6
Protocol: HTTP/1.1
Robots.txt: No
Host: AWS
52.208.0.0 - 52.215.255.255
52.208.0.0/13

lucy24

6:38 pm on Sep 7, 2017 (gmt 0)

IP: distributed across AWS. In addition to the tried-and-true 52 and 54 ranges, it has used 34.240 and 34.248-252. (I'm not very familiar with this neighborhood, but it looks like it started out belonging to Halliburton and they've been selling it off to AWS.) Each individual visit uses a single IP from beginning to end.

UA: Barkrowler 0.1.6
subsequently replaced (today!) by
Barkrowler experimental crawler - 0.4.7
(still with no contact information in string)

Robots.txt: Yes

I usually update my access controls at the beginning of the month. Since this one first showed up near the end of a month, the whole thing had an unusually short evaluation time.

Behavior:
began by asking for robots.txt followed by one or two pages that met a 403 due to deficient headers.

after being denied in robots.txt, it stopped asking for pages

later, finding itself authorized (i.e. no longer denied in robots.txt) and having a hole poked for it, asked for root several times, receiving a 200. The next day, emboldened, it spidered the entire site, omitting roboted-out directories ... six times over the span of three hours, with four more the next day. It is probably crawling as we speak.

YMMV, FWIW and so on. Watch This Space.

keyplyr

7:08 pm on Sep 7, 2017 (gmt 0)

This thing remains a 403d pest. Without supplying credentials, there can be no access.

An acceptable UA string would look something like this:
Example User Agent /1.0 (+http://www.example.com/info.html)

A bot info page should describe:
1.) Who they are?
2.) What they're after at our sites?
3.) What they will do with the data they retrieve?
4.) Why we should allow them to take our property. How does it benefit the site owner?

lucy24

10:30 pm on Sep 7, 2017 (gmt 0)

Is it possible they don't like your face for some obscure reason? For my part, as soon as I denied them in robots.txt they stopped requesting pages. No further 403 needed.

For those who are just joining us: Even when you have no intention of ever admitting a given user-agent, it's worth denying them in robots.txt. On rare occasions they turn out to be compliant--and when it comes to unwelcome visitors, the only thing better than a blocked request is no request at all.

keyplyr

10:53 pm on Sep 7, 2017 (gmt 0)

Requesting robots.txt has no weight in whether I allow a UA or not. Many (most?) malicious agents request robots.txt.

I include it in the UA documentation for information purposes only.

lucy24

4:26 am on Sep 8, 2017 (gmt 0)

It isn't about whether they ask for it. It's about whether they obey it. With me, the vast majority of robots--including brand-new ones I've never met before--are blocked by default. If I find repeated requests for robots.txt from the same entity, I'll try denying them by name and see if that makes them stop requesting pages.

keyplyr

4:38 am on Sep 8, 2017 (gmt 0)

So you poke a hole in the IP range block rule to allow the UA which you then block by name in robots.txt to see if it obeys?

lucy24

3:55 pm on Sep 8, 2017 (gmt 0)

I stopped blocking IP ranges a couple years ago when I went over to header-based access controls. At any given time I'll have a few ranges blocked to address some temporary infestation, but I'm never going back to the endless whack-a-mole of having to add one range after another after another.

keyplyr

7:16 pm on Sep 8, 2017 (gmt 0)

Oh yeah, I remember you doing that now. Not surgical enough for complex rules I need with ad marketing platforms.

If you were playing the whack-a-mole game of having to add one range after another after another... then I don't blame you for wanting to try something different.

I use a succinct cascade filter starting with headers, then ranges, then behavior. Been perfecting it's efficacy for years, especially the last part of the filter.

blend27

1:08 pm on Sep 9, 2017 (gmt 0)

I have it also from 34.240.0.0/13
Host: AWS

znarfor

8:52 pm on Sep 9, 2017 (gmt 0)

It's technically operating from the AWS eu-west-1 (Ireland) region.

The company behind it is eXenSa exensa.com you can get that information from the "From" HTTP header.

[edited by: keyplyr at 8:22 pm (utc) on Sep 10, 2017]
[edit reason] delinked URL, removed promo [/edit]

lucy24

9:55 pm on Sep 9, 2017 (gmt 0)

Follow-up, because mid-Thursday left me with avid curiosity about the ongoing crawl:

From midday on the 6th to end-of-day on the 7th (Pacific time) they made a total of 23 crawls of exactly 160 pages, representing about half the site's authorized content. (I know this thanks to certain other robots which do periodic full spiderings of the whole site, saving me the trouble of counting.) Most but not all of the 160 were the same each time. This crawl included one blocked .midi file, revealing that they don't understand the *.xtn locution in robots.txt.

Then, on the morning of the 8th (still Pacific time, so afternoon/evening in Paris), they did 6 separate shorter crawls involving just 16 pages each time. The 16 consisted of:
-- all top-level authorized /dir/
-- all /dir/subdir/ within two of those directories, but not in two others which also contain well-populated subdirectories
-- within a fifth directory, the exact list
/dir/page.html
/dir/subdir1/
/dir/subdir2/
/dir/subdir3/
where each of those subdirectories contains an index page and nothing else*, again disregarding 50 or so other subdirectories, some of which contain multiple pages.

I confess to some measure of curiosity.

* In my ebooks directory, each subdirectory is a title, so most contain only a single page plus /images/ unless it's a huge book or series.

jonasjacek

10:16 am on Sep 11, 2017 (gmt 0)

They have slightly changed the User agent to "Barkrowler experimental crawler - 0.4.7". They are all over my logs this morning.

keyplyr

10:23 am on Sep 11, 2017 (gmt 0)

Yes Jonas, Lucy noted that above. However I'm still only seeing the first one.

Both UAs may be operating concurrently, separately tasked.

lucy24

8:28 pm on Sep 12, 2017 (gmt 0)

The adventure continues...

On the 10th they continued the package of 16 requests--some always the same, some varying.

The 11th they took the day off.

Today (the 12th) they've been focusing on the same four /subdir/ files, all one directory, over and over again. They are not the newest files in their directory--I've got three that are more recent--but I'm pretty sure* they are the four most recently added to a public directory that generates both human and robotic trafic. (I wish there were a different word than �directory�, as I'm starting to confuse myself too.)

They've also been crawling another site, but that one has so few pages that I'm pretty sure they are simply spidering the whole thing every time.

* �Pretty sure� = I�m certain about three, but don�t know for sure on the fourth, and it�s not worth looking up.

exensa

7:53 am on Sep 14, 2017 (gmt 0)

Hi,
I'm the one behind this crawl. First I'm here to apologize, I'm sorry for the disturbance I created. I didn't realize it would cause such trouble. We're a very small company trying the create a 'similar sites' web search engine, and it's my first steps into crawling.

I'm working on a distributed web crawler based on BUbiNG. <snip>

It's supposed to respect robots.txt, and I had a politeness setting of 10sec per host and 10sec per IP

However it seems that it doesn't work the way it's expected to work. So I suspect it's because of the distribution failing to work correcly.

The fact that it keeps on trying to access 403'd pages is a waste of time and resources from my point of view, so I need to find a way to limit that too.

Anyway, I'm going to try to understand why it fails to correctly enforce the politeness settings before I start a new crawl.

Sorry again.

Guillaume Pitel

[edited by: keyplyr at 8:15 am (utc) on Sep 14, 2017]
[edit reason] Removed link per ToS [/edit]

exensa

8:14 am on Sep 14, 2017 (gmt 0)

Addendum, I've changed the userAgent string : userAgent=Barkrowler/0.4.7 (experimental crawler base on BUbiNG
<snip> <snip>
And I've tried to answer keyplyr's questions here : <snip>
Tell me if I can do something else to make amend :)

Guillaume Pitel

[edited by: keyplyr at 8:18 am (utc) on Sep 14, 2017]
[edit reason] Removed link per ToS [/edit]

keyplyr

8:19 am on Sep 14, 2017 (gmt 0)

Hello exensa and welcome to WebmasterWorld [webmasterworld.com]

Thanks for the information. Please no links.

exensa

8:26 am on Sep 14, 2017 (gmt 0)

Right so the answers are at www dot exensa dot com / crawl

Barkrowler

Barkrowler is our experimental and very fresh version of the BUbiNG crawler (it's basically BUbiNG with our pull requests applied and the right configuration for distribution on EC2)

It's supposed to respect robots.txt , and have a politeness setting per HOST and per IP.

However I've received several reports that in several cases, the politeness setting is not enforced.

We're are currently investigating this, hoping to stop very soon this problematic behaviour.

1.) Who are we ?

Exensa is a very small French company specialized in large scale text data analysis. We have worked on social networks, legal documentation, ecommerce.

To give you an idea we have a small demo of wikipedia pages similarity service :

Wikinsights (wikinsights dot org)

2.) What we're after at your sites?

We crawl the web at large, so no particular target - except, maybe, for experimental purposes, certain languages. We want to identify the semantic / thematic orientation of the web sites and pages.

3.) What we will do with the data we retrieve?

For now, our goal is to provide a "same site" search engine which is better than the alternatives, especially for the long tail (current alternatives allow you to find the first 10/20 similiar sites).

There is no beta online yet (that's why we need to perform a crawl). But we hope very soon.

4.) Why you should allow us to take your property. How does it benefit you, the site owner?

People looking for information sources, customers, providers, or identify competition or possible cooperation may have a use of our tool.

So even though we won't bring you as much traffic as Google, Bing or similar web search engines, the traffic we will provide should be of very high value (and otherwise we won't bother you for long...)

keyplyr

8:31 am on Sep 14, 2017 (gmt 0)

Thanks again. This information is what webmasters need in order to decide whether to give access.

Without an info page, you can see how site owners react :)

lucy24

6:03 pm on Sep 14, 2017 (gmt 0)

a politeness setting per HOST and per IP

I think it currently doesn't understand the Crawl-Delay directive. (Neither, as we all know, does the world's leading search engine.) So that would be a good start. I think mine's currently 3 seconds; Barkrowler has definitely been crawling faster. But it has never done those blizzard crawls of hundreds of requests in a few seconds.

The fact that it keeps on trying to access 403'd pages is a waste of time and resources

It's interesting you say this, because in my case the crawler didn't start requesting multiple pages until after I authorized it; earlier, if the first request met a 403 it didn't ask for anything else. It didn't even request pages linked from the 403 page. I realize this may depend on how the crawler's �shopping list� is constructed; obviously you don't just start with the root and work outward.

:: detour to check ::

It looks like you are similar to BUbiNG--which I currently permit-and-ignore--in the ways that matter on my site, so that won't cause any problems.

exensa

5:56 am on Sep 15, 2017 (gmt 0)

The Crawl-Delay directive, being non-standard, is probably ignored by the library using by BUbiNG to parse and analyze robots.txt . I'll look into it, it shouldn't be very hard to change.

I have already noticed several small issues with BUbiNG : 301 (redirects) are immediately followed (no delay), as well as the robots.txt.

Question : from your point of view, is it OK to perform the next request right after receiving a 301 ? I think this behaviour will be hard to change because it's taken care of at the connection level.

keyplyr

6:01 am on Sep 15, 2017 (gmt 0)

is it OK to perform the next request right after receiving a 301 ?

That's normal.

I gave your UA an exception to the server condition that was previously blocking access. No other conditions apply. Good luck with your index.

lucy24

5:34 pm on Sep 15, 2017 (gmt 0)

301 (redirects) are immediately followed (no delay), as well as the robots.txt

Once again: Funny you should say that. Yesterday I noticed Barkrowler on one site it hasn't yet crawled, prompting me to check earlier logs for possible blocked visits. Now, this happens to be a site that went HTTPS earlier in the year, so all HTTP requests including robots.txt* get a 301 response.

30 August (before Barkrowler was mentioned in robots.txt, and hence before it was authorized at all), in chronological order:
HTTP robots.txt 301
HTTPS robots.txt 200
HTTP one page 403

12 September (Pacific time, would be 13 September in Paris)
HTTP no less than six requests for robots.txt, all 301
HTTPS ... nothing
Odd. It made me wonder if the robot got confused by redirected robots.txt requests, and didn't know what to do next. (Did its code get tweaked in the intervening two weeks? It sounds as if it did.)

On the main question I agree with keyplyr: If you get a redirect, it makes most sense to follow it up immediately. After all, the redirect target might be something different if you come back the next day. (Sure, it isn't supposed to change that fast with a 301, but you never know.) If you're meeting a 302, there's no point in following the redirect at all unless you do it immediately. This seems to be the trend with major search engines: follow redirects within a few minutes, unless you happen to have crawled the target page within the past hour or so.

A more worrying observation is that, as of 16:00 on 12 September (Pacific time, so around 04:00 on 13 September, Paris time), on my primary site the robot seems to have forgotten all about robots.txt. Up to that time, it carefully requested robots.txt at the beginning of each visit.

* I have been thinking about excluding robots.txt from the redirect, for various reasons, but haven't yet got around to it.

keyplyr

7:20 am on Sep 17, 2017 (gmt 0)

Funny, since I removed the block it hasn't returned.

exensa

7:04 am on Sep 18, 2017 (gmt 0)

Yup, for now we're experimenting :) So I do not crawl on a regular basis. You shouldn't see Barkrowler before one or two weeks.

lucy24

7:39 pm on Oct 5, 2017 (gmt 0)

Most recent UA (seen yesterday):

Barkrowler/0.5.1 (experimenting / debugging - sorry for your logs ) www.exensa.com/crawl - admin@exensa.com�based on BUBing

Yup, that�s a literal em dash in the middle (�com--based�). Isn't it safer to stick with pure ASCII?

keyplyr

8:02 pm on Oct 5, 2017 (gmt 0)

Yup, saw 4 requests for robots.txt

He needs to stop changing the UA.

lucy24

11:01 pm on Oct 5, 2017 (gmt 0)

I kinda wish the UA string didn't include the literal text "BUBing", since some sites may have different rules--whether access-control or robots.txt--for BUBing as such. (Guillaume? You listening?) And I don't think it can be considered a generic name element like "Nutch".

keyplyr

6:30 am on Oct 11, 2017 (gmt 0)

Full site crawls 2 days in a row. Well behaved.

This 42 message thread spans 2 pages: 42

Barkrowler

keyplyr

lucy24

keyplyr

lucy24

keyplyr

lucy24

keyplyr

lucy24

keyplyr

blend27

znarfor

lucy24

jonasjacek

keyplyr

lucy24

exensa

exensa

keyplyr

exensa

keyplyr

lucy24

exensa

keyplyr

lucy24

keyplyr

exensa

lucy24

keyplyr

lucy24

keyplyr

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week