Forum Moderators: phranque

Message Too Old, No Replies

Wikipedia content - Scrape it or use api?

         

NickMNS

11:57 pm on Feb 21, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let me begin by explaining the context. I have a site that provide statistics about a large set of like-entities. Some thing like base ball players, as an example. The main content is very much statistics based and everything is calculated based on our proprietary algos and data. When searchers look for this type of information they often would like to have a some subjective context about these entities, who or what they are. Many sites provide this information and there is nothing of value that I (or anybody else really) can add. A prime location for such context is Wikipedia.

As a convenience to my users, I am considering adding the contextually content to my page by either scraping the data from Wikipedia, storing it and then displaying it. Or by simply using their REST api and using ajax to show it. In either case, the content will clearly be attributed to Wikipedia, and used under the Creative Commons license.

Which method is preferable. Scrapping takes time and requires storage, but the data are sure to be displayed to the user quickly. The API requires no overhead, but will likely delay page load.

Any thoughts?

In general, is using wikipedia data a bad idea?

lucy24

12:26 am on Feb 22, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In addition to the factors you've already considered: Don't forget to weigh the risk that three seconds before someone accesses your own page, someone has performed a malicious wikipedia edit. You probably don't want your readers learning that suchandsuch blameless person has been found guilty of practicing black magic, all because that's what the API happened to send along.

NickMNS

1:00 am on Feb 22, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Good point. I would need to filter the response before showing it.