Forum Moderators: open
The world's largest collection of words is now available to the public, for free, through a new Google online database that opens a door to the evolving landscape of language.
The potential of this searchable library is unveiled in today's issue of the journal Science, where a Harvard University-Google team shows how word usage has waxed and waned over the past two centuries. Their study yields cultural insights as diverse as the spread of innovation, the effects of youth and profession on fame, and even trends in censorship.
The Google Books "Ngram Viewer" and the downloadable raw data set (ngrams.googlelabs.com) achieve what mere mortals can't: analysis of 500 billion words from 5 million books published over the past four centuries, part of Google's ambitious book-scanning project....
Scholars interested in topics such as philosophy, religion, politics, art and language have employed qualitative approaches such as literary and critical analysis with great success. As more of the world's literature becomes available online, it's increasingly possible to apply quantitative methods to complement that research. So today Will Brockman and I are happy to announce a new visualization tool called the Google Books Ngram Viewer, available on Google Labs. We’re also making the datasets backing the Ngram Viewer, produced by Matthew Gray and intern Yuan K. Shen, freely downloadable so that scholars will be able to create replicable experiments in the style of traditional scientific discovery....
The Cultural Observatory at Harvard is working to enable the quantitative study of human culture across societies and across centuries. We do this in three ways:
- Creating massive datasets relevant to human culture
- Using these datasets to power wholly new types of analysis
- Developing tools that enable researchers and the general public to query the data
I'm off to upset my broadband provider...
...and with older fonts, the OCR tends to be poor. As you go even further back and spellings get less consistent, the OCR tends to mangle words more often.
The long s fell out of use in roman and italic typefaces well before the middle of the 19th century; in France the change occurred from about 1780 onwards, in Britain in the decades around 1800, and some twenty years later in the United States. This may have been spurred by the fact that long s looks somewhat like an f (in both its roman and italic forms), whereas short s did not have this disadvantage, making it easier to identify, especially for people with problems of vision.
Even within Google Books, I think it has gotten noticeably better over the course of the project.