Forum Moderators: open

Message Too Old, No Replies

the oBot returns

         

lucy24

12:29 am on Sep 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



... Well, returns to me, anyhow.

Thread from winter 2011/2012 [webmasterworld.com]
Thread from 2013 [webmasterworld.com]

Some interesting changes, though obviously still the same critter.
IP: 206.253.226.12
UA: Mozilla/5.0 (compatible; oBot/2.3.1; +http://filterdb.iss.net/crawler/)

Behavior: visit began with robots.txt, immediately followed by front page, brief intermission, and then 60 further requests in 30 seconds for a steady average of 2 per second. (Like the Googlebot, they clearly do not know how to read the Crawl-Delay directive, but really, isn't no-more-than-one-per-second a pretty good rule of thumb?)

Now the fun begins. Nothing was requested from either of the roboted-out directories whose content is linked from the front page. So that's good. But they must be going for some kind of Robotic Stupidity prize, because for each of the six permitted directories, requests went:
/directory
(without trailing slash, leading to mod_dir redirect to)
/directory/
(i.e. the form the front page linked to in the first place)
and then
/images/something
/images/somethingelse
et cetera, each time referring to images whose actual URL is
/directory/images/etcetera
linked from directory-index pages in the form
images/etcetera

With all those misread subdirectory-images, it seems to have escaped their notice that there are also images linked directly from the front page, so the only /images/ they didn't ask for were the ones that do exist.*

I don't know if this is a domino effect whereby they think they're in the root (page at "/directory") when in fact they're in a subdirectory, or whether they just don't understand how relative links work. Net result: 62 requests, of which just 9 were successful. There was even a bonus redirect to an entirely different site, thanks to one subdirectory having the same name as a top-level directory in that other site.**

No, wait, I take it back: there was one robots.txt violation. Apparently you only have to read robots.txt for the site that contains the html; they also requested two piwik files which live in a roboted-out directory on a different site. (This, in turn, tells me that they paid a brief visit to yet a third site-- but were apparently so baffled by its link structure, they just grabbed a couple of images and quickly left.)


* "We carried away all that we did not catch, and all that we caught, we left behind."
** In fact I had to pore over my htaccess to find the redirect, since there's no earthly reason any human would ever request a file from Site B that has never existed anywhere but Site A. I just put the redirect there for insurance.

aristotle

9:23 pm on Sep 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



(Like the Googlebot, they clearly do not know how to read the Crawl-Delay directive, but really, isn't no-more-than-one-per-second a pretty good rule of thumb?)

I've never heard of that directive, but if there's no way to enforce it, then is it really worth having?

lucy24

10:57 pm on Sep 9, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's one line in robots.txt, so it can't possibly do any harm.

keyplyr

12:18 am on Sep 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I noticed this a couple weeks ago. For years I had oBot blocked, silly me. I just poked a hole for it. IBM supplies data to a large client base, some for marketing stratagems. I'm hoping they bid for my properties & increase the RPM.

hey clearly do not know how to read the Crawl-Delay directive, but really, isn't no-more-than-one-per-second a pretty good rule of thumb?

Crawl-Delay is not a required standard. Some support it, other do not. IMO Bing is the one that seems to make their own rules as far as supporting robot tags.

aristotle

1:05 am on Sep 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy -- personally I don't think it's a good idea to try to tell legitimate bots like googlebot how often they can crawl. I'm willing to let them crawl as often as they want, trusting them not to cause any problems. Seems like I read somewhere that googlebot can sense when a server is overloaded and adjust its crawl frequency accordingly.

lucy24

6:50 pm on Sep 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Googlebot simply ignores Crawl-Delay-- they say so outright-- but there is a section in WMT where you can tell them how often to crawl.

The w3 link checker has a minimum of one second between requests, though you can tell it to go slower. Excessive requests-- the kind that result in 500-class errors-- are one of the signals of an unwanted robot.

To my surprise, I got a reply from oBot (that is, IBM Germany, forget the exact name) when I emailed to ask about their request pattern. I will see if there's a follow-up.