|Unable to recall when a University has crawled my pages. |
Could mean anything. w3 for example-- including the Link Checker-- is in MIT's territory. Could even be the not-wholly-apocryphal Design A Robot computer-science class exercise.
Far as I can make out, Kevin Schmidt (the named contact person) is a bona fide human who has been associated with UCSB for ages, so it would look pretty bad if he didn't answer e-mail. Although possibly not on a Sunday ;)
Poking around 192.35 suggests that this is a sliver which UC only recently got hold of and they're now doling it out in /24 sub-slivers to assorted campuses. ("Office of the President" doesn't mean anything; it's like when an ARIN site claims to belong to the EU. Looks like Santa Cruz has snabbled .223 but most of it is still up for grabs.)
Had a valid visitor from one their other IP ranges in 2007.
I've some extensive widget articles and pics that are related to the San Fransisco area, and that could possibly be the "logic" for this crawl.
They returned three hours after the 40-pages for more and ate 403's.
I'm going to add their other IP ranges to my denials.
I on the other hand get about 15% of valid human traffic from .edu. Computer Science (CS) mostly use Nutch or other known out-of-the-box clones. Teachers sometimes use copyright check bots or plagiarism bots to validate original content of their students.
I have the complete range 220.127.116.11/24 blocked as being a pain. Odd for a US uni to have only a /24 anyway.
I have no problem with letting unis in general access sites so long as they seem genuine visitors. Persistent bot sub-ranges get blocked.
|Odd for a US uni to have only a /24 anyway. |
This is an add-on; the primary UC ranges are elsewhere.
:: shuffling papers ::
Most of them seem to be in 169. I've also met:
128.32 Berkeley (next door to MIT at .30-.31 in Early Registration territory)
132.239 and 137.110 UCSD
In fact I'm glad this came up because I re-checked and find I've got some things written down wrong (notably, UCSC does not have a /14 all to itself) and the records themselves are garbled.
.236 UC Merced [pause for derisive laughter]
But these too are obvious add-ons. If UCSD alone has two other /16s that I've personally met, there have got to be piles more, especially for Berkeley.
For those who want to split hairs: "Cal" (full name) is the University of California at Berkeley. All others are UC-cityname.
|Teachers sometimes use copyright check bots or plagiarism bots to validate original content of their students. |
I was going to leave this alone until these other IP issues popped in.
40-pages in 33-minutes or 30-seconds from an unidentified crawler (not specified in the UA) in unacceptable, regardless of their criteria and/or use.
Seems to me we've had this difference previously on 3rd party use.
I don't give edu a free ride simply because their edu.
If they behave, I'll allow them. If they are abusive than I have no qualms about denying access.
Since I've had one UCSB range denied since 2007, this is strike two and all UCSB goes.
RE: free ride
And I don't either. I block all CS dept bots. I was just sayin' that some sites (like mine) have a measurable human user base coming from univ. I'm a univ professor and my main site used to be on the Univ of Cal servers for a few years before I changed the model to for-profit and moved it to a hosting company. I'm still indexed by the combined univ library system and many classes use my material in their curriculum w/ incoming links for citation.
As an added comment - I disallow all those copyright check bots & plagiarism bots in robots.txt and AFAIK they've always obeyed.
A bad agent can come from anywhere, and there are lots of script kiddies in campus dorms (don't forget how facebook was born.)