Anyone seen this one before?
The Leipzig Corpora Collection (LCC) is a project of the Natural Language Processing Group of the University of Leipzig. The LCC offers access to monolingual dictionaries in more than 200 languages.
The crawler that visited your website is collecting data for this project. The crawled data are used for language documentation and language statistics which are freely available on our website.
The crawling is restricted to text. Audio and video material is excluded from the crawling. If such items are crawled due to technical limitations, they are never stored.
The crawler Heritrix (Vers. 3.3.0) is used, see this link for details. Heritrix was developed by the Internet Archive and is used by several institutions.
LCC (+http://corpora.uni-leipzig(.)de/crawler_faq.html)
It aggressively downloaded my entire site.
Where do these university projects get off with assuming they can use others content or resources without permission? We need to expand the robots.txt standard to include whether we are willing to be a part of things like this. If the content creator doesn't give your project permission or license its content -- you can't just USE it OR abuse their resources.