Welcome to WebmasterWorld Guest from 18.206.194.210

Forum Moderators: Ocean10000

Message Too Old, No Replies

CRIM Crawler

Centre de Recherche Informatique de Montreal

     
7:55 pm on Aug 6, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


I'm all for all Nutch spawn reading+heeding robots.txt. But they don't need to hit the file every 20 minutes for hours and hours, a la:

CRIM Crawler/Nutch-2.3 (Crawler du Centre de Recherche Informatique de Montr\xc3\xa9al (CRIM))

Note to CRIM: That acute accent in Montréal does NOT compute in your UA.

Thus far, the CRIM crawler's hailed from:

132.217.254.57
132.217.254.68

Mothership details:

Canada Montreal Centre De Recherche Informatique De Montreal
NetRange: 132.217.0.0 - 132.217.255.255 [132.217.0.0/16]

(That's a boatload of IPs!)

No clue for whom or why they're crawling. Site claims: "One of the foremost IT applied research centres in Canada..." [crim.ca...]
11:55 pm on Aug 7, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


One of my filters blocks: (crawl|nutch|seo|spider|walker) unless whitelisted by some attribute, so this actor does not get through. However looking at the company and their /16, unless I see something beneficial from them, it's all blocked.
12:28 am on Aug 8, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 885


That acute accent in Montréal does NOT compute in your UA.

Yes, I had to go look it up and verify that their computer uses UTF-8.

Query: If you know in advance that your UA string will be x-encoded, wouldn't it be a smarter move not to use non-ASCII characters in the first place?

:: wandering off to satisfy idle curiosity about how often I get an encoded User Agent, and whether any of them has ever been anything but an unwanted robot ::

Yup. The vast majority are Russian referer spam with a leading \xblahblah that decodes to the Zero Width Space aka Byte Order Mark-- in other words, not just malign but stupid. But along with the scattering of German research institutes I find
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E; Centre de Sant\xc3\xa9 Inuulitsivik; Centre de Sant\xc3\xa9 Inuulitsivik)
sic repetition, and thank you Health Centre, I heard you the first time.

:: wandering off again to confirm hunch that Inuulitsivik is somewhere in Nunavik ::
7:21 pm on Sept 10, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2067
votes: 2


CRIM switched to a branded Heritrix today. Got robots.txt, ignored it; ignored 403s, too. Jerks.

132.217.254.68
Mozilla/5.0 (compatible; heritrix/3.2.0 +http://www.crim.ca)

11:23:46 /robots.txt 200
11:23:58 / 403
11:24:04 /dir/fileA.html 403
11:24:18 /dir/fileB.html 403
11:24:24 /dir/FileC.html 403

CRIM CIDR, meet iptables DROP
8:24 pm on Sept 10, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


I think heritrix was one of the very first UAs I ever blocked.
9:49 pm on Sept 10, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15932
votes: 885


ignored 403s, too

I tend to find it more scary when a robot does pay immediate attention to a 403, and modify its behavior accordingly, because it means some intelligence went into its programming. Most of the time they've got a shopping list and nothing's going to stop them from making every last request. Criminal masterminds are interesting on other people's sites, but on my own turf, I'd rather stick with dumb crooks.
9:10 am on Dec 3, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Decided to let CRIM index pages across sites. May help to open up the Canadian marketing interests.

Mozilla/5.0 (compatible; heritrix/3.2.0 +http://www.crim.ca)"
132.217.0.0 - 132.217.255.255
132.217.0.0/16

BTW - heritrix (branded or not) does not support robots.txt, never has. IMO CRIM needs a standard compliant, custom-built web crawler if they are going to be in the spidering/indexing business.