Forum Moderators: open

Message Too Old, No Replies

oBot

         

keyplyr

10:44 pm on Mar 14, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; oBot/2.3.1; +http://filterdb.iss.net/crawler/)
Protocol: HTTP/1.1
Robots.txt: Yes
Host: IBM Deutschland GmbH
194.153.113.0 - 194.153.113.255
194.153.113.0/24

There's been a lot of speculation about this agent over the years. They now include an info page:
oBot is the web crawling bot of the Content Security Division of IBM Germany Research & Development GmbH. We use several computers to crawl webpages and a large computer cluster to categorize the content of these pages... [to create] webfilter database that is made available to our customers in several content filtering products including an SDK for OEM partners.
Previous discussions: [webmasterworld.com...] [webmasterworld.com...]

lucy24

11:45 pm on Mar 14, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ooh, ooh, I know them :) They apparently crawl every few years and then remember their results forever, so from now on into perpetuity they will keep asking for any files-- including images-- that were on your site in 2012. Just pages and images; nothing behind-the-scenes like scripts or CSS. In some way they must be interested in visible content.

:: detour to Multi-FIle search (with "Case sensitive" button* most emphatically clicked!) to refresh memory on details ::

:: further pause to wonder irritably why TextWrangler thinks "0" (zero) comes before "1" (one) in the alphabet ::

They do appear to obey robots.txt, at least in its fundamentals: their sole appearance in my test site's logs was a request for robots.txt -- just a few weeks ago, at that.

In detail: (each visit begins in robots.txt)
#1 example.com (formerly my all-purpose site, since end of 2013 reduced to two directories plus storage):

206.253.224.22 7 January 2012: HEAD for all images (only) belonging to front page. This is infuriating, because it tells me their first visit was before I started saving raw logs

Later the same day: repeated HEAD for all images that got a 404 the first time around (I rarely bother to redirect images).

206.253.224.18 14 January 2012: GET for all the same stuff, minus the previous 404s, plus one midi file belonging to an interior page, followed by that same interior page, plus 404 for image associated with another, wholly unrelated interior page. (Darn! If only I'd started logging sooner!)

206.253.226.22 29 May 2013: HEAD for everything including the 404s, again with a repeat visit some hours later to re-request the 404s. (Gosh, they don't give up, do they?)

194.153.113.35 3 November 2013: All requests except robots.txt were blocked, presumably due to changed IP. GET front page and seven of the eight crawlable directories linked therefrom, all without trailing / (don't know how they would have responded to the resulting 301).

206.253.226.22 16 December 2014. GET front page and all images currently associated with it, and also non-script version of piwik.

206.253.226.18 5 January 2015. HEAD front page and all images associated with it before December 2013, with usual follow-up a few hours later for those images that got a 301 or 410. (I must never have emptied out the /images/ directory-- in fact I still haven't-- or there would have been a lot more. Oops.)

206.253.226.12 8 September 2015, by which time they were blocked all around. And no wonder, as they seem to have forgotten about robots.txt. GET one 410'd image, and piwik for the two other sites which they crawled at exactly the same time.

206.253.224.14 2 December 2015. GET front page and only the two pages currently linked from it, still without trailing / slash.

#2 example.net (my current primary site, established late 2013):
206.253.226.22 12 October 2014: GET front page and all images belonging to it.

206.253.226.18 and .12 5 January 2015 two HEAD requests for same image (redirected from other site), netting a 404.

206.253.226.12 8 September 2015. Yowzuh. This was a long one. As usual, all trailing / missing, but this time they followed 301s immediately. Front page; /fonts/ and all images belong to its index page; /ebooks/ and ditto; and so on through all top-level directories on this site. Still no stylesheets.

195.212.29.180 30 September 2015. Front page and three separate GET for one specific image.

206.253.226.12 22 February 2016. HEAD for all images currently associated with front page, plus one that's shared by all top-level directory index pages.

# 3 example.org (art studio's site):
206.253.226.23 8 Sep 2015 Evidently a busy day for them; here too they picked up the front page and its directly associated images (but not the ones referenced in the on-page CSS).


* This did not prevent a few "SolomonoBot" from sneaking in. It's much faster with GREP turned off, though.