|New Google tool, free to public, reveals evolution of language|
New Google tool, free to public, reveals evolution of language [mercurynews.com]
Silicon Valley Mercury News
|The world's largest collection of words is now available to the public, for free, through a new Google online database that opens a door to the evolving landscape of language. |
The potential of this searchable library is unveiled in today's issue of the journal Science, where a Harvard University-Google team shows how word usage has waxed and waned over the past two centuries. Their study yields cultural insights as diverse as the spread of innovation, the effects of youth and profession on fame, and even trends in censorship.
The Google Books "Ngram Viewer" and the downloadable raw data set (ngrams.googlelabs.com) achieve what mere mortals can't: analysis of 500 billion words from 5 million books published over the past four centuries, part of Google's ambitious book-scanning project....
Google's announcment, posted by Jon Orwant, Engineering Manager, Google Books...
Find out what’s in a word, or five, with the Google Books Ngram Viewer [googleblog.blogspot.com]
The Official Google Blog
|Scholars interested in topics such as philosophy, religion, politics, art and language have employed qualitative approaches such as literary and critical analysis with great success. As more of the world's literature becomes available online, it's increasingly possible to apply quantitative methods to complement that research. So today Will Brockman and I are happy to announce a new visualization tool called the Google Books Ngram Viewer, available on Google Labs. We’re also making the datasets backing the Ngram Viewer, produced by Matthew Gray and intern Yuan K. Shen, freely downloadable so that scholars will be able to create replicable experiments in the style of traditional scientific discovery.... |
The Harvard site....
|The Cultural Observatory at Harvard is working to enable the quantitative study of human culture across societies and across centuries. We do this in three ways: |
- Creating massive datasets relevant to human culture
- Using these datasets to power wholly new types of analysis
- Developing tools that enable researchers and the general public to query the data
The Google tool...
Google Labs - Books Ngram Viewer
Nice, there's a lot of data there - although (understandably) an ngram needs to be found 40 times before it's valid for inclusion.
Like the Google (LDC-distributed) ngram dataset from a few years back (which was for english web pages) this data is really handy for research purposes.
I'm off to upset my broadband provider...
Interesting tool. Some random thoughts.
I've used the vaguely similar ARTFL project tools for 15 years or so to get the equivalent in French.
The ARTFL text database is tiny in comparison - just a few thousand texts. However, they are verified and based on good editions. So searches tend to be more high culture biased than the ngram project.
For cultural studies, I generally find the ARTFL tool more interesting for many reasons
- regular expression searches
- small, medium and broad context (line, paragraph, page; work for out of copyright works)
- full attribution (author, work, date)
- filtering (by author, date)
I guess the ngram tool ultimately has "context" by taking an ngram and putting it into Google Books.
I spend hours per day in Google Books and doing text-based searches is a real art. As books get older, especially yellowing books from the 1850-1950 period (the worst for that problem) and with older fonts, the OCR tends to be poor. As you go even further back and spellings get less consistent, the OCR tends to mangle words more often.
So it makes me think the data gets worse as you get back in time. Words that commonly had a lot of kerning or letters that were easily confonded would, I think, have a certain evolution even without change in usage, just because of changes in printing techniques.
Nevertheless, I do find it interesting to be able to do searches like this
[ngrams.googlelabs.com...] (be nicer, lose weight)
[ngrams.googlelabs.com...] (liberty, safety)
[ngrams.googlelabs.com...] (f**k - very interesting and something I've seen a lot as a historian - people were more prudish between 1850 and 1950 than they were before or since)
For those of us who are colorblind, you can't really do searches with more than three terms.
Good thing about Google is, that they still come up with these kind of scientific and free-for-all things that are not completely business oriented.
|I'm off to upset my broadband provider... |
|...and with older fonts, the OCR tends to be poor. As you go even further back and spellings get less consistent, the OCR tends to mangle words more often. |
I'm surprised the OCR is as good as it is, particularly on the "long" or "medial s" character that resembles an "f". The Ngram info page [ngrams.googlelabs.com] cites this character as a common cause of misspellings and refers us to this Wikipedia article....
The article suggested to me at least a rough way of quantifying the error...
|The long s fell out of use in roman and italic typefaces well before the middle of the 19th century; in France the change occurred from about 1780 onwards, in Britain in the decades around 1800, and some twenty years later in the United States. This may have been spurred by the fact that long s looks somewhat like an f (in both its roman and italic forms), whereas short s did not have this disadvantage, making it easier to identify, especially for people with problems of vision. |
I ran comparisons for the following pairs to get an indication of whether common f/s misspellings would fade out as suggested... and generally they do, with a sharp crossover between 1780-1820....
The change was about .0300% or less, actually not trivial compared to some other changes you can observe over time. I assume that research done with the tool might eventually be used to provide feedback for fixing some of the OCR, assuming, ie, that there are precise fixes.
The tool is a natural for exploring divergences and parallels, and for finding correlations among language, history, and consciousness. Here are some I tried...
[ngrams.googlelabs.com...] (love,hate,fear) - start 1600
[ngrams.googlelabs.com...] (money,happiness,freedom) - start 1700
[ngrams.googlelabs.com...] (Bach,Mozart,Beethoven) - start 1700
[ngrams.googlelabs.com...] (dragons,reptiles) - start 1600
[ngrams.googlelabs.com...] (faith,science) - start 1600
[ngrams.googlelabs.com...] (hot,cold) - start 1700
»(Bach,Mozart,Beethoven) - start 1700
Good one - the peak due to the spread of radio?
>>"long" or "medial s" character
that's one i was thinking of, but lots of serif fonts that are very thin at the tops of the circles, so "o" can become "ii"
I have to say, that 15 years ago I needed to publish an edition of some older articles and to save typing, did OCR on them. In some cases I got entire pages of iiiiis.
Recently, I had to do something similar and ended up with something around 95% accuracy or higher. OCR has improved dramatically. Even within Google Books, I think it has gotten noticeably better over the course of the project.
So I'm not knocking it. I am able to find things in one hour that as recently as five years ago I would have only been able to find over months of sending query letters to leading experts hoping to find someone who could answer (recent mystery solved thanks to Google books: meaning of the "mal de saint Eloi"). It is phenomenal and it's great to finally have something like this in English.
As I say, though, despite the much smaller set of texts, the French project still has some great advantages. For example, I can search for, to take something from your last post, all occurrences where "language" and "consciousness" occur within four words of each other.
I expect to see the sort of thing get better and better, combining the best features of existing systems and adding in intelligence you and I have perhaps not even thought of. In 15-20 years, people will look at this discussion and marvel at how primitive and brutish our methods are.
System: The following 3 messages were spliced on to this thread from: http://www.webmasterworld.com/google_toolbar_tools/4244263.htm [webmasterworld.com] by tedster - 6:06 pm on Dec 19, 2010 (EST -5)
Google Labs has launched the Google Books NGram Viewer - an interesting search over all the books that they've scanned, converted into trend graphs. [ngrams.googlelabs.com...]
Watch out - it is case sensitive.
I've been having a ball with it. For example, compare Plato and Aristotle [ngrams.googlelabs.com].
Aristotle references definitively passed Plato references in the 1950s when Bertrand Russell became the big philosophy "star" - and he hated Plato.
Another fun play - Joomla versus Drupal [ngrams.googlelabs.com]
And finally, Danny Sullivan warns [searchengineland.com] about the challenges for OCR posed by the old style "medial S" which looks like an "F"
We've also been discussing this in some detail over in the Google Labs forum...
New Google tool, free to public, reveals evolution of language
I took a look at how we can roughly quantify the "long" or "medial s" errors for certain words, where there's no ambiguity. One of Danny's observations on the medial s though, does in fact directly comment on usage of a word euphemistically noted in our discussion, where there is ambiguity.
Plato and Aristotle is a great one, but it needs to go back to around 1100 to get real interesting.
Still, starts to get interesting if you push back before 1800
Interesting - this will allow a ton of researchers to basically tap into all of the scans Google is doing on books and come up with the necessary data to use for their anthropological research as well as studies in various areas like even the ultimate holy grail area of how innovation spreads in the written word -- If we could make each little area which Google talks of into a scientific study for each discipline, imagine the results we would find
|Even within Google Books, I think it has gotten noticeably better over the course of the project. |
Google is getting a big helping hand from the public to digitize books. They have put the ReCAPTCHA spam blocking tool [google.com...] to work to in deciphering unreadable text characters.