homepage Welcome to WebmasterWorld Guest from 54.234.2.94
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
U of Cal
Santa Barbara
wilderness




msg:4541982
 9:24 am on Feb 3, 2013 (gmt 0)

Unable to recall when a University has crawled my pages.
Especially one that is not identifying itself to specific bot.

Approximately 40 pages.

192.35.222.182 - - [Sun Feb 03 06:45:13 2013] "GET /Myfolder/MySub/MyPage.html HTTP/1.1" 200 19643 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"

They requested robots.txt and two pages, some three hours earlier and I missed it in the logs.

no supporting files.

 

lucy24




msg:4541993
 10:49 am on Feb 3, 2013 (gmt 0)

Unable to recall when a University has crawled my pages.

Could mean anything. w3 for example-- including the Link Checker-- is in MIT's territory. Could even be the not-wholly-apocryphal Design A Robot computer-science class exercise.

Far as I can make out, Kevin Schmidt (the named contact person) is a bona fide human who has been associated with UCSB for ages, so it would look pretty bad if he didn't answer e-mail. Although possibly not on a Sunday ;)

Poking around 192.35 suggests that this is a sliver which UC only recently got hold of and they're now doling it out in /24 sub-slivers to assorted campuses. ("Office of the President" doesn't mean anything; it's like when an ARIN site claims to belong to the EU. Looks like Santa Cruz has snabbled .223 but most of it is still up for grabs.)

wilderness




msg:4542053
 5:15 pm on Feb 3, 2013 (gmt 0)

Had a valid visitor from one their other IP ranges in 2007.

I've some extensive widget articles and pics that are related to the San Fransisco area, and that could possibly be the "logic" for this crawl.

They returned three hours after the 40-pages for more and ate 403's.

I'm going to add their other IP ranges to my denials.

keyplyr




msg:4542111
 9:22 pm on Feb 3, 2013 (gmt 0)


I on the other hand get about 15% of valid human traffic from .edu. Computer Science (CS) mostly use Nutch or other known out-of-the-box clones. Teachers sometimes use copyright check bots or plagiarism bots to validate original content of their students.

dstiles




msg:4542134
 10:48 pm on Feb 3, 2013 (gmt 0)

I have the complete range 192.35.222.0/24 blocked as being a pain. Odd for a US uni to have only a /24 anyway.

I have no problem with letting unis in general access sites so long as they seem genuine visitors. Persistent bot sub-ranges get blocked.

lucy24




msg:4542154
 1:34 am on Feb 4, 2013 (gmt 0)

Odd for a US uni to have only a /24 anyway.

This is an add-on; the primary UC ranges are elsewhere.

:: shuffling papers ::

Most of them seem to be in 169. I've also met:
128.32 Berkeley (next door to MIT at .30-.31 in Early Registration territory)
132.239 and 137.110 UCSD

In fact I'm glad this came up because I re-checked and find I've got some things written down wrong (notably, UCSC does not have a /14 all to itself) and the records themselves are garbled.

Looks like:

169.228 UCSD
.229 Berkeley
.230 UCSF
.231 UCSB
.232 UCLA
.233 UCSC
.234 Irvine
.235 Riverside
.236 UC Merced [pause for derisive laughter]
.237 UCD

But these too are obvious add-ons. If UCSD alone has two other /16s that I've personally met, there have got to be piles more, especially for Berkeley.

For those who want to split hairs: "Cal" (full name) is the University of California at Berkeley. All others are UC-cityname.

wilderness




msg:4542163
 2:08 am on Feb 4, 2013 (gmt 0)

Teachers sometimes use copyright check bots or plagiarism bots to validate original content of their students.


keyplr,
I was going to leave this alone until these other IP issues popped in.

40-pages in 33-minutes or 30-seconds from an unidentified crawler (not specified in the UA) in unacceptable, regardless of their criteria and/or use.

Seems to me we've had this difference previously on 3rd party use.
I don't give edu a free ride simply because their edu.
If they behave, I'll allow them. If they are abusive than I have no qualms about denying access.

Since I've had one UCSB range denied since 2007, this is strike two and all UCSB goes.

keyplyr




msg:4542175
 3:49 am on Feb 4, 2013 (gmt 0)

RE: free ride

And I don't either. I block all CS dept bots. I was just sayin' that some sites (like mine) have a measurable human user base coming from univ. I'm a univ professor and my main site used to be on the Univ of Cal servers for a few years before I changed the model to for-profit and moved it to a hosting company. I'm still indexed by the combined univ library system and many classes use my material in their curriculum w/ incoming links for citation.

keyplyr




msg:4542229
 11:19 am on Feb 4, 2013 (gmt 0)



As an added comment - I disallow all those copyright check bots & plagiarism bots in robots.txt and AFAIK they've always obeyed.

A bad agent can come from anywhere, and there are lots of script kiddies in campus dorms (don't forget how facebook was born.)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved