W3CRobot/5.4.0 libwww/5.4.0

Forum Moderators: DixonJones

Message Too Old, No Replies

W3CRobot/5.4.0 libwww/5.4.0

who is this?

keyplyr

12:41 am on Feb 16, 2005 (gmt 0)

221.148.44.82 - - [15/Feb/2005:00:41:47 -0800] "HEAD /page.html HTTP/1.1" 403 0 "//www.referrer.com/" "W3CRobot/5.4.0 libwww/5.4.0"

I block access of all libwww requests until I know who it is. This one has been very persistent for over a year. Hits several times a week, several attempts each time - following links from other themed sites. I can't find any definitive info on this. The IP is Korean. Anyone have anything else? Thanks.

(related threads [google.com])

keyplyr

8:24 pm on Feb 19, 2005 (gmt 0)

Anyone?

dcrombie

9:20 pm on Feb 20, 2005 (gmt 0)

I'm only guessing, but from it's behaviour it looks like some kind of link checking utility. It's evolved a bit while I've been watching, but they've messed up the referer field now - no 'http:' before the address.

We're seeing it on a fair number of very different sites - all with a valid referer address that does actually link to those pages.

;)

bull

9:58 pm on Feb 20, 2005 (gmt 0)

It's a fake.
Has nothing to do with W3C.
There has been another thread, but I am unable to find it.

keyplyr

5:27 am on Feb 21, 2005 (gmt 0)

bull - yes, of course it's a fake.

dcrombie - it's following links but I'm not quite sure if it's a legit link checking tool. My experience has been always with the same Korean IP, but all the referrers are different sites that have links to mine. It could be anything from an indexing agent building a directory - to - an email harvester. I can't seem to find out anything other than conjecture.

bull

9:12 am on Feb 21, 2005 (gmt 0)

Have no market in Korea?
Block the entire range. Too much spambots, site scrapers from there.

dcrombie

9:58 am on Feb 21, 2005 (gmt 0)

keyplyr, if it was builing a directory/index then it probably wouldn't be passing a referer (none of the other spiders do) and if it was looking for email addresses then it would be targetting guestbooks and similar pages (and also not passing a referer).

The behaviour of following _actual_ inbound links to actual pages, requesting sometimes just the HEAD, mirrors the behaviour of other link checking utilities. Or it could be some kind of research project - there are a few university projects of similar nature in that region.

You could block it on the basis that it doesn't fetch robots.txt, but otherwise I'd classify it as harmless.