Vision Research Lab

Forum Moderators: open

Message Too Old, No Replies

Vision Research Lab

wilderness

11:51 pm on Sep 7, 2005 (gmt 0)

128.111.60.61 - - [07/Sep/2005:14:32:12 -0700] "GET / HTTP/1.1" 403 - "-" "Vision Research Lab image spider at vision.ece.ucsb.edu"

Three visits, main page only, no robots.

GaryK

1:27 pm on Sep 8, 2005 (gmt 0)

Don, none of these darned bots from unis are ever well-behaved. :)

It seems like a snobbish attitude on their part. It's as if because they are doing "research" they don't need to observe the normal rules about crawling a site.

The minute I see a .educ TLD no further research is needed; the bot is banned, LOL.

Thanks for sharing!

wilderness

2:33 pm on Sep 8, 2005 (gmt 0)

none of these darned bots from unis are ever well-behaved

That's because their funded research falls under the umbrella of eduaction :(

[webmasterworld.com...]

This all started under keebler, then went to Northwestern University and later I found through another offline source was all related to the 100-year anniversary of FoMoCo, in which they sent a staff of a dozen or more to resarch data at a Detroit Historical Society.

And yet they expect to utilize private websites for their research?

Perhaps (as others have stated), this is the intent of the internet, however when funded agencies grab the work of non-funded websites (without providing credit or cash)the methods stink of third party use.

I have somebody from the University of Kentucky that keeps visiting my site for horse materials. Imagine that, a major University from the biggest, most pretegious and historical horse state in the country is gathering data from my websites, rather than their own extensive resources :(

These are only the tip of the iceberg.

The state of Ohio is the worst offender. Their libraries gather data and then close the online Public Library (digital archive) access to non-local residents.

The above are not issues for webmasters with very small personal type websites.
When the webmaster presents extensive data and images, it's an entirely different ballgame.

Don

Lord Majestic

2:42 pm on Sep 8, 2005 (gmt 0)

And yet they expect to utilize private websites for their research?

A newspaper could be privately owned, however if the information it prints is available publicly (like your websites I presume) then its perfectly legitimate to use it for purposes of research -- some uses do not even require credit and in some cases its impossible to give credit -- more sources than text in analysis. And best of all -- permission of owner is not required, just a purchase of the newspaper or even free use of it in the library.

AFAIK newspapers, books, anything printed is required by law to provide a few copies to specially designated library (Library of Congress?), where reseachers could use it. If it was done for the web then there would have been no need for researches to crawl the web, however just imagine how paintful it would have been fo r you to ensure that you update not just your site but also repository in Lib of Congress?

Ignoring robots.txt is of course not excuseable.

wilderness

4:00 pm on Sep 8, 2005 (gmt 0)

A newspaper could be privately owned, however if the information it prints is available publicly (like your websites I presume) then its perfectly legitimate to use it for purposes of research -- some uses do not even require credit and in some cases its impossible to give credit -- more sources than text in analysis. And best of all -- permission of owner is not required, just a purchase of the newspaper or even free use of it in the library.

Majestic your relentless ;)

Somehow I knew a reply would be forthcoming from you and provided an exception in my reply and yet you still felt a reply necessay ;)

Perhaps (as others have stated), this is the intent of the internet

"permission of owner is not required"

This is where your wrong Majestic.
Your own bot keeps eating 403's at my website and unless you come in from another IP or UA?
You don't have access to my materials or sites.

Many websittes have TOS or UAG's which bot don't have the cability to read or comply with and/or users ignore, the result for infractions of TOS or UAG are denial of service, a webmasters ONLY effective control.

Don

Lord Majestic

4:59 pm on Sep 8, 2005 (gmt 0)

Your own bot keeps eating 403's at my website and unless you come in from another IP or UA?

I have no idea if it does, wilderness, we have crawled 1 bln URLs so far and I have no clue what are your sites, to be honest I am not bothered if you or anyone else serves 403s -- fair enough, would have been better (for you and us - bandwidth and requests wise) if you blocked it via robots.txt, but 403 is fine. The UA stays the same, but new IPs appear so if you merely block based on IP then it won't be effective: robots.txt and/or UA block should be sufficient for my bot, if that's what you want.

Now, I was not talking about myself or even those guys -- I was merely supplying an analogy of what happens in real world and behavior of researchers there is not only legal but is also considered perfectly reasonable and acceptable by (say) newspaper owners. I think research falls under Fair Use laws (where they present), say I can photocopy newspaper article in library for research at home and I am pretty sure newspaper owners can't disallow this. I wonder if same applies to public non-subscription websites?

Anyhow, was just checking if you changed your views after a year, will try again next year :o

wilderness

5:31 pm on Sep 8, 2005 (gmt 0)

I wonder if same applies to public non-subscription websites?

Majestic,
Fair Use in the US and many other places, applies to portions and not entirety.
Fair Use also deems credit for source and/or publication.

The majority of the articles contained on my sites although previously published (ages ago) were NOT digitized, before my efforts.
That in itself creates a confusing copyright issue, the original author has a copyright, the original publication holds a copyright and now a websmaster with publications that are not available to the masses creates a copyright by the act of digitization and appearance on a web page.

Who holds what and how may it be used, may depend on possesion?
Copying materials for personal use off my websites is no infringement.
Using a software to accumulate that data violates TOS. Applying that material in its entirety to another web pages is also infingement.

Why would I change my position after a year or five years?
I have close to 30,000 hours into digitization and archival of materials that are not existent in other digitized forms.

I should allow a bot or any other (under the guise of research or any other theme) grab those thousands of hours worth of work in seconds? Especially when the majority are no less than grabbers or reapers who intend from the start to sell the harvested data to a thrird-party!
Hardly!

Try again in TEN Years ;)

BTW as I recall your bot traverses from a RIPE range and you wouldn't get in under those IP's regardless of number.

Don

Lord Majestic

5:46 pm on Sep 8, 2005 (gmt 0)

That in itself creates a confusing copyright issue, the original author has a copyright, the original publication holds a copyright and now a websmaster with publications that are not available to the masses creates a copyright by the act of digitization and appearance on a web page.

You must have researched this question better than me, but it seems to me that original author retains copyright and generally mere digitisation is not allowed without permission, its like ripping CD and then distributing as MP3s, but anyway, you should know your position better than me.

I think you are confusing blatant theft of your materials (a certain no-no) with research based on your data. If its so unique and valuable then make it subscription only and all problems go away. If you make it public and let few search engine get it then its rather strange to disallow same freedom to other well behaved spiders.

Anyhow, we've been through this, I merely posted to provide analogy about newspapers and research that can be done based on their content.

BTW as I recall your bot traverses from a RIPE range and you wouldn't get in under those IP's regardless of number.

This is offtopic, but the bot (my bot - MJ12bot, not the one discussed here) is using distributed model and it has very different IPs, some of which may or may not belong to RIPE (or whoever), but I doubt it - bot is run on normal broadband connections by people who believe in the spirit of the Internet :)

If you want to block the bot then use robots.txt, support for which seems to be very solid right now. If you are not clear on UA then sticky me and I will provide you this information. Don't bother banning IPs -- it simply won't work and you may get incorrect feeling that I am somehow trying to circumvent your site's filters, which is not the case as I have no clue what your site is in the first place, and even if I did I would have been out of my mind to get into fight with a webmaster who has teeth and claws to bite my head off :)