Welcome to WebmasterWorld Guest from 54.196.144.242

Forum Moderators: not2easy

Message Too Old, No Replies

10,000 Bot-Generated Articles Per Day: Is That Really Good or Really Bad

     
1:56 pm on Jul 15, 2014 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:23241
votes: 357


I read this with some dismay. I know it goes on, but really, 10,000 per day is quite scary.

Clearly, computer article generation has improved considerably, and nobody can compete with the volume of output. I do, however, wonder about the quality, and the ethics of scraping content and republishing, and certainly in those numbers.

Sverker Johansson could be the most prolific author you've never heard of. Volunteering his time over the past seven years publishing to Wikipedia, the 53-year-old Swede can take credit for 2.7 million articles, or 8.5% of the entire collection, according to Wikimedia analytics, which measures the site's traffic. His stats far outpace any other user, the group says.

He has been particularly prolific cataloging obscure animal species, including butterflies and beetles, and is proud of his work highlighting towns in the Philippines. About one-third of his entries are uploaded to the Swedish language version of Wikipedia, and the rest are composed in two versions of Filipino, one of which is his wife's native tongue.

An administrator holding degrees in linguistics, civil engineering, economics and particle physics, he says he has long been interested in "the origin of things, oh, everything."

It isn't uncommon, however, for Wikipedia purists to complain about his method. That is because the bulk of his entries have been created by a computer software program—known as a bot. Critics say bots crowd out the creativity only humans can generate. Mr. Johansson's program scrubs databases and other digital sources for information, and then packages it into an article. On a good day, he says his "Lsjbot" creates up to 10,000 new entries. 10,000 Bot-Generated Articles Per Day: Is That Really Good or Really Bad [online.wsj.com]
3:44 pm on July 15, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:2808
votes: 71


In this context I see it as not ideal, but useful in one way. Is it better to look up something and find no entry or a stub entry that can be expanded, edited and become useful?

I imagine the "Bot Approvals Group" might wish to disconnect this type of submission. I would question the accuracy of taxonomic identification of the entries. The article says that when a project "needed bird photos, the bot turned to the Russian version of Wikimedia Commons, which provides millions of free-to-use images." so he's relying on others to do the research and legwork.

I would also wonder how many entries need to be removed due to copyright issues if the content is all scraped. His contributions have to add significantly to the workload of the editors.
8:49 pm on July 15, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:June 6, 2006
posts:1165
votes: 33


Wow. Where do I buy a copy of this bot?
9:48 pm on July 15, 2014 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 29, 2005
posts:1937
votes: 62


Maybe I'm missing something here but at 10,000 entries a day this guy is just clogging up the internet with computer generated crap. Hi is doing a dis-service to us all in my opinion.
10:14 pm on July 15, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 31, 2006
posts:1255
votes: 13


Young white male nerds. LOL
10:47 pm on July 15, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 4, 2001
posts: 1265
votes: 12


In this context I see it as not ideal, but useful in one way. Is it better to look up something and find no entry or a stub entry that can be expanded, edited and become useful?


I agree. If the entry doesn't exist, what does it hurt to have useful content generated? People can improve it later. The same basic thing happens on many wiki articles created entirely by humans, they just use search engines to find information to copy or paraphrase.

Whether we like it or not, automated information gathering is going to continue to improve. I think it's a good thing.

Not so much when it comes to for profit scrapers of course but it's going to be a long time before they can compete with human authors and SE algos will continue to get better at telling the difference.
12:06 am on July 16, 2014 (gmt 0)

Senior Member from HK 

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 14, 2002
posts:2290
votes: 16


A fair few "respectable" companies are moving towards automated article / news creation systems, specially for repetitive fact based data. Take a look at what a company called Automated Insights does.
1:10 am on July 16, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


holding degrees in linguistics, civil engineering, economics and particle physics

I had to look this up. According to his CV, it's:
 * M.Sc. Engineering Physics, 1982. [Lund Institute of Technology]
* B.Sc. Economics, 1984. [University of Gothenburg]
* Ph.D. Particle Physics, 1990. [CERN/University of Lund.]
* BA Linguistics, 2002: University of Lund.
* MA Linguistics, 2012: University of Lund.

I couldn't figure out his original BA/BS; the M.Sc. is the first thing listed after the Gymnas or whatever they call it in Sweden. He says, truthfully, "Chronic student". Honestly, it makes me think of that character in Moving Pictures whose grandfather's will stipulated ... Oh, never mind.

In view of the quoted line "I have nothing against Tolkien and I am also more familiar with the battle against Sauron than the Tet Offensive" I find it mildly hilarious that his youngest child is named Faramir. (No, Virginia, this is not a traditional Swedish name.)
4:44 am on July 16, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:6967
votes: 388


At some point (as we've seen on the web itself) the repetition will take down the reputation... and the near obscene number of articles will be so diluted in content to be of little or no value. Personally I see these type of systems self-defeating, ultimately becoming Worm Ouroboros, consuming itself by competing with itself.
7:04 am on July 16, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 31, 2006
posts:1255
votes: 13


This isn't really new news. [blog.wikimedia.org...]
8:23 am on July 16, 2014 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:23241
votes: 357


It may not be new, but it's still the problem of competition. There's no way a human can compete with the volumes produced, however, the quality is a separate issue.
In addition, what about the scraping, especially if you've written it.
12:41 am on July 17, 2014 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


10,000 per day is quite scary.


I know, what underachievers.

Slackers. :)

Seriously though, when you're dealing with the 8.7 million species on Earth and a total of 6,500 spoken languages, the job of translation needs some automation or it'll never get done.

Using these sheer numbers I'd opt for something to find on a page that can be corrected as people go along vs. nothing.

The wikipedia subset of 287 languages (2.4B pages for species alone) is better but I'd suggest an alternative to simply require everyone in the world to learn at least one of the top 10 languages: Chinese (Mandarin), English, Hindu, Spanish, Russian, Arabic, Bengali, Portuguese, Malay-Indonesian and French and call it a day.

That at least reduces the translation workload into something a bit more manageable thus hopefully increasing the quality per article.
5:06 pm on July 18, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member jab_creations is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Aug 26, 2004
posts:3159
votes: 15


I originally though this was going to be another site-scrapper stealing content for the purpose of posting advertisements and such. For this specific instance it's subjective to the policies that handle dealing with this issue. If this method creates articles and documentation where there would be otherwise have been none then it increases the chances that people will find something versus nothing. If I were running a search engine I would only tolerate it if two conditions were met. First that these automatically generated pages were explicitly marked as so. Secondly that these pages be fully reviewed and edited/approved by human beings manually and then marked as so. As a search engine I would note the resources referenced (I am presuming that there are the usual quoted resources for a Wiki page) and give them precedence though rake Wiki lower subjective to the ratio of if the wiki page really consolidates lots of fragmented data or is reliant on fewer sources that on their own provide potential readers with reasonable amounts of information on their own.

- John
6:44 am on July 22, 2014 (gmt 0)

New User

joined:Feb 4, 2013
posts:1
votes: 0


Long time lurker. First (or second) time poster...

I just can't see how this can be a good thing. In fact I have to agree with nomis5. There's already enough "repetition" and garbage on the web. Just look at your news feed, guaranteed daily you will get the exact same news from all those multiple sources you've subscribed to. In the end it's just information overload.

From and engineering and computer science perspective however, I gotta respect this kind of automation.

I don't know man...
9:13 am on July 22, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2510
votes: 44


Surprised that those people in Google didn't hire him so that together they could replace the web in the Google search engine. :)

Regards...jmcc