|10,000 Bot-Generated Articles Per Day: Is That Really Good or Really Bad|
I read this with some dismay. I know it goes on, but really, 10,000 per day is quite scary.
Clearly, computer article generation has improved considerably, and nobody can compete with the volume of output. I do, however, wonder about the quality, and the ethics of scraping content and republishing, and certainly in those numbers.
|Sverker Johansson could be the most prolific author you've never heard of. Volunteering his time over the past seven years publishing to Wikipedia, the 53-year-old Swede can take credit for 2.7 million articles, or 8.5% of the entire collection, according to Wikimedia analytics, which measures the site's traffic. His stats far outpace any other user, the group says. |
He has been particularly prolific cataloging obscure animal species, including butterflies and beetles, and is proud of his work highlighting towns in the Philippines. About one-third of his entries are uploaded to the Swedish language version of Wikipedia, and the rest are composed in two versions of Filipino, one of which is his wife's native tongue.
An administrator holding degrees in linguistics, civil engineering, economics and particle physics, he says he has long been interested in "the origin of things, oh, everything."
It isn't uncommon, however, for Wikipedia purists to complain about his method. That is because the bulk of his entries have been created by a computer software program—known as a bot. Critics say bots crowd out the creativity only humans can generate. Mr. Johansson's program scrubs databases and other digital sources for information, and then packages it into an article. On a good day, he says his "Lsjbot" creates up to 10,000 new entries. 10,000 Bot-Generated Articles Per Day: Is That Really Good or Really Bad [online.wsj.com]
In this context I see it as not ideal, but useful in one way. Is it better to look up something and find no entry or a stub entry that can be expanded, edited and become useful?
I imagine the "Bot Approvals Group" might wish to disconnect this type of submission. I would question the accuracy of taxonomic identification of the entries. The article says that when a project "needed bird photos, the bot turned to the Russian version of Wikimedia Commons, which provides millions of free-to-use images." so he's relying on others to do the research and legwork.
I would also wonder how many entries need to be removed due to copyright issues if the content is all scraped. His contributions have to add significantly to the workload of the editors.
Wow. Where do I buy a copy of this bot?
Maybe I'm missing something here but at 10,000 entries a day this guy is just clogging up the internet with computer generated crap. Hi is doing a dis-service to us all in my opinion.
Young white male nerds. LOL
|In this context I see it as not ideal, but useful in one way. Is it better to look up something and find no entry or a stub entry that can be expanded, edited and become useful? |
I agree. If the entry doesn't exist, what does it hurt to have useful content generated? People can improve it later. The same basic thing happens on many wiki articles created entirely by humans, they just use search engines to find information to copy or paraphrase.
Whether we like it or not, automated information gathering is going to continue to improve. I think it's a good thing.
Not so much when it comes to for profit scrapers of course but it's going to be a long time before they can compete with human authors and SE algos will continue to get better at telling the difference.
A fair few "respectable" companies are moving towards automated article / news creation systems, specially for repetitive fact based data. Take a look at what a company called Automated Insights does.
|holding degrees in linguistics, civil engineering, economics and particle physics |
I had to look this up. According to his CV, it's:
* M.Sc. Engineering Physics, 1982. [Lund Institute of Technology]
* B.Sc. Economics, 1984. [University of Gothenburg]
* Ph.D. Particle Physics, 1990. [CERN/University of Lund.]
* BA Linguistics, 2002: University of Lund.
* MA Linguistics, 2012: University of Lund.
I couldn't figure out his original BA/BS; the M.Sc. is the first thing listed after the Gymnas or whatever they call it in Sweden. He says, truthfully, "Chronic student". Honestly, it makes me think of that character in Moving Pictures whose grandfather's will stipulated ... Oh, never mind.
In view of the quoted line "I have nothing against Tolkien and I am also more familiar with the battle against Sauron than the Tet Offensive" I find it mildly hilarious that his youngest child is named Faramir. (No, Virginia, this is not a traditional Swedish name.)
At some point (as we've seen on the web itself) the repetition will take down the reputation... and the near obscene number of articles will be so diluted in content to be of little or no value. Personally I see these type of systems self-defeating, ultimately becoming Worm Ouroboros, consuming itself by competing with itself.
This isn't really new news. [blog.wikimedia.org...]
It may not be new, but it's still the problem of competition. There's no way a human can compete with the volumes produced, however, the quality is a separate issue.
In addition, what about the scraping, especially if you've written it.
|10,000 per day is quite scary. |
I know, what underachievers.
Seriously though, when you're dealing with the 8.7 million species on Earth and a total of 6,500 spoken languages, the job of translation needs some automation or it'll never get done.
Using these sheer numbers I'd opt for something to find on a page that can be corrected as people go along vs. nothing.
The wikipedia subset of 287 languages (2.4B pages for species alone) is better but I'd suggest an alternative to simply require everyone in the world to learn at least one of the top 10 languages: Chinese (Mandarin), English, Hindu, Spanish, Russian, Arabic, Bengali, Portuguese, Malay-Indonesian and French and call it a day.
That at least reduces the translation workload into something a bit more manageable thus hopefully increasing the quality per article.
I originally though this was going to be another site-scrapper stealing content for the purpose of posting advertisements and such. For this specific instance it's subjective to the policies that handle dealing with this issue. If this method creates articles and documentation where there would be otherwise have been none then it increases the chances that people will find something versus nothing. If I were running a search engine I would only tolerate it if two conditions were met. First that these automatically generated pages were explicitly marked as so. Secondly that these pages be fully reviewed and edited/approved by human beings manually and then marked as so. As a search engine I would note the resources referenced (I am presuming that there are the usual quoted resources for a Wiki page) and give them precedence though rake Wiki lower subjective to the ratio of if the wiki page really consolidates lots of fragmented data or is reliant on fewer sources that on their own provide potential readers with reasonable amounts of information on their own.
Long time lurker. First (or second) time poster...
I just can't see how this can be a good thing. In fact I have to agree with nomis5. There's already enough "repetition" and garbage on the web. Just look at your news feed, guaranteed daily you will get the exact same news from all those multiple sources you've subscribed to. In the end it's just information overload.
From and engineering and computer science perspective however, I gotta respect this kind of automation.
I don't know man...
Surprised that those people in Google didn't hire him so that together they could replace the web in the Google search engine. :)