Forum Moderators: open

Message Too Old, No Replies

LTI/LemurProject

another Nutch pest

         

keyplyr

7:32 am on Feb 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are just way too many Nutch spiders causing havoc today. Most give up after a couple 403s, but not this one...

68.180.139.*** - - [05/Feb/2008:00:59:37 -0500] "GET /index.html HTTP/1.0" 403 462 "-" "LTI/LemurProject Nutch Spider/Nutch-1.0-dev (Research spider using Nutch; http://www.lemurproject.org; mhoy@cs.cmu.edu)"

FYI: IP is a Yahoo business account hosting server.

It requests robots.txt (where it is correctly disallowed) and then eats a couple hundred 403s (since it's banned by the generic "Nutch") then comes back a few hours later switching D and/or C IP class and does the same... day after day.

I've emailed them with log snippets. No reply.

wilderness

5:21 pm on Feb 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplr,
The been persistent and daily.

Don

Umbra

8:43 pm on Feb 12, 2008 (gmt 0)

10+ Year Member



FYI: IP is a Yahoo business account hosting server.

For Yahoo IPs, how can one distinguish between Yahoo activity vs 3rd party businesses that are hosted with Yahoo?

wilderness

9:09 pm on Feb 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For Yahoo IPs, how can one distinguish between Yahoo activity vs 3rd party businesses that are hosted with Yahoo?

Umbra,
I going to need to buy a newboard as unable to get the drool off mine ;)

Yahoo (and most other SE Providers) have such a vast quanity of tools coming from so many different ranges that it's impossible to stay abreast.

I did the following and thus far haven't seen anything detrimental.
[webmasterworld.com...]

A newer thread
[webmasterworld.com...]

keyplyr

5:17 am on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



how can one distinguish between Yahoo activity vs 3rd party businesses that are hosted with Yahoo? - Umbra

One way is to WhoIs the IP address and see who it's assigned to, whether direct allocation, etc. Some WhoIs toos, DNS look-ups show more info than others.

Umbra

1:21 pm on Mar 2, 2008 (gmt 0)

10+ Year Member



keyplyr,

Well to use an example, I just spotted this (denied) request:
68.180.176.114
libwww-perl/5.803

Is this Yahoo being stupid, or a 3rd party hosted on a Yahoo server? The whois record indicates Yahoo, the reverse ip gives mproc15.data.corp.sk1.yahoo.com, but I don't know where to go from there.

wilderness

1:53 pm on Mar 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Personally. . . I wouldn't allow Yahoo or any other major SE access when they used any UA that was a deviation from the same compliant UA used when crawling websites.

IMO, in no way, shape or form, does that include libwww-perl. (and many others).

All SE's are sending a MASS of IP ranges and so many various tools at our websites that we must begin to wonder if there exists any benefit to our websites for all this excess?

Don

keyplyr

6:10 am on Mar 3, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For Yahoo IPs, how can one distinguish between Yahoo activity vs 3rd party businesses that are hosted with Yahoo? - Umbra

Indeed it is difficult to tell unless there are additional "hints" in the UA string, as was in the Nutch example in my first post.

Yahoo does use libwww-per for some purpose (never did determine exactly what) and I see it occasionally. But because of the potential for abuse, I deny all libwww-per requests and only allow select IP addresses to use it via a white list with mod_rewrite.