Forum Moderators: open

Message Too Old, No Replies

LCC Bot

         

WebOpz

12:37 pm on May 20, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



Anyone seen this one before?

The Leipzig Corpora Collection (LCC) is a project of the Natural Language Processing Group of the University of Leipzig. The LCC offers access to monolingual dictionaries in more than 200 languages.

The crawler that visited your website is collecting data for this project. The crawled data are used for language documentation and language statistics which are freely available on our website.

The crawling is restricted to text. Audio and video material is excluded from the crawling. If such items are crawled due to technical limitations, they are never stored.

The crawler Heritrix (Vers. 3.3.0) is used, see this link for details. Heritrix was developed by the Internet Archive and is used by several institutions.

LCC (+http://corpora.uni-leipzig(.)de/crawler_faq.html)

It aggressively downloaded my entire site.

Where do these university projects get off with assuming they can use others content or resources without permission? We need to expand the robots.txt standard to include whether we are willing to be a part of things like this. If the content creator doesn't give your project permission or license its content -- you can't just USE it OR abuse their resources.

not2easy

12:55 pm on May 20, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



We need to expand the robots.txt standard to include whether we are willing to be a part of things like this.

The robots.txt standard has zero effect on non-compliant robots, there is no requirement (or means of enforcement) to require a robot to read or comply with robots.txt. Calling for some reform of robots.txt assumes the existence of some entity that regulates robots. It does not exist.

Further, calls to action are specifically not discussed on this site - Please see ToS [webmasterworld.com]. If you believe that not giving permission to use your content can protect its use by any passing robot (or human), bless your innocence. ;)

WebOpz

1:12 pm on May 20, 2021 (gmt 0)

5+ Year Member Top Contributors Of The Month



@not2easy, Yup I *AM* aware that robots.txt is merely suggestion. I'm not naive and do understand that it will not either reduce my security or IP theft risks. Legitimate organizations seem to have little to no problem following this defacto (and soon official) standard but illegitimate ones certainly make themselves known with their malicious behavior.

The IETF IS the standards body that is responsible for this and they are working on it now. [datatracker.ietf.org...] Oh, how I wish I still had my innocence. ;)


[edited by: not2easy at 2:04 pm (utc) on May 20, 2021]
[edit reason] fixed broken link [/edit]

wilderness

3:44 pm on May 20, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA's with 'crawler or spider' as part of the UA are an early learned lesson and commonly used by generic bots.