Here’s another interesting bauble from the nerds at Google. The Books Ngram Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. Google estimates they’ve scanned and OCR’d more than 10 percent of all the books ever published, and this plotter is based on a sample of that data.
This “most excellent time-wasting tool” was blogged about by Alexis Madrigal, a senior editor for TheAtlantic.com, in his post, The Decline of Man (as a Word), in which he shows how the word “man” has fared against “woman”. (Not well.) As Madrigal observes, this may not serve a legitimate research purpose, but it sure is fun.
Here’s a sample. I’ve searched for the term “database”, and set the years to search as 1950 to 2008. The y-axis shows the percentage of all the terms contained in Google’s sample of books written in English that are “database” for those years. As you can see, the word didn’t emerge in published sources before the early 1970s. (Click image for full size.)
The tool also allows you to plot the progress of one term against another. If you plot “database” against “data base”, you’ll see that the two-word term enjoyed a short life before the single word took over. I’ve been interested in the use of the word “gift” instead of “donation,” but the plot of those two words isn’t very informative due, I guess, to the many connotations of the word “gift.” Instead I plotted “charitable gift” and “charitable donation” to put the words in context, and came up with this chart. The concept of giving seems to have had quite a heyday up until around 1835, and “donation” was firmly in the lead. By 1880, though, it was all about the gift. (Click image for full size.)
That got me thinking about how well “philanthropy” has done through the years. Mentions before 1750 are rare, so I plotted from then to the present, and once again the first half the 19th century seems to have been relatively more preoccupied with the idea than later on. (Although, of course, who knows what data this is really based on. As I said, it’s fun, but I wouldn’t want to base a thesis on it without knowing more about the underlying data.)
Hmm – this IS fun. What if we plot poverty vs. religion vs. education? This doesn’t tell us what people were giving to, but it does give a glimpse into what they were writing about. “Poverty” has stayed relatively constant since 1750, but look at how “religion” has declined as “education” has risen. One line crosses the other right at 1909. Also interesting is that the trend started reversing direction about 10 years ago.
And finally, this chart plots “data mining” and two variations of “fundraising“. Data mining takes off as a published term in the early 1990s, and the term “fund raising” has merged into the single word, “fundraising.”
All sorts of fun. Try some for yourself! I’d be interested in hearing about any cool combos you come up with that relate to analytics and/or fundraising.
Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)
The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)
On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1’ is more appropriate for texts from the bio-medical sciences.
Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.
I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)
All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!
POSTSCRIPT
P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.
Word clouds aren’t new, but there’s a new online app for creating them that is worth checking out. Tagxedo allows you to create your clouds using some versatile tools for shaping the appearance of the cloud, which you can then easily save as a .jpg or .png.
This comes to me via a post on the LoveStats blog, where Annie Pettit has posted a couple of her own creations – one based on the text of her resume, and one on all the words in her blog.
I wrote about word clouds back in December (Quick and easy visuals of large text files), and the well-known and very cool tool known as Wordle, the creation of Jonathan Feinberg. Tagxedo does the same thing but works a little differently. Powered by Microsoft’s SilverLight browser plug-in, Tagxedo offers a nifty interface for importing your text (or URL), finely controlling your word choice, and playing with the font, colour, theme and layout of your cloud, including being able to choose a shape. The choice of shapes is rather limited – hearts, stars, rectangles and ovals, mostly. Here’s a star-shaped word cloud based on the 150 most common words on this blog:
My interest in word clouds is related to visualization of data – in this context, conveying the gist of a mass of text by giving prominence to the most common significant words. For example, last year I used Wordles to visualize tens of thousands of words entered as free-text comments in a survey of alumni. It’s no substitute for real analysis, but it does make a cool presentation slide!
NOTE: Check in tomorrow for Jason Boley’s amazing work with NodeXL for visualizing prospect connections in your data.
Late last year I posted a tutorial on creating Google motion charts with your data. These very cool charts work with your time-series data, stored in Google Docs, to create an animation with the power to convey a lot of information in an easily understandable form.
But what about private data? You may not want to rely on Google’s ability to password-protect your data, or the privacy provisions you work with may prohibit posting data to an outside server.
Here’s another way to take advantage of motion charts. I was put onto this by Trevor Skillen, President and CEO of Metasoft, in Vancouver BC, whose company is working on incorporating motion charts into their well-known FoundationSearch product.
This version uses stored code to manipulate your data locally, rather than pulling it from Google Docs.
The advantages are clear:
Trevor directed me to Google’s ‘playground’ where one can get a quick feel for the technology without much tech effort.
There is a downside … there is a good deal of manual coding you’ll have to do if you want to put a chart together using your own data. This limits you to fairly simple charts – unless you’re capable of writing the additional code that will allow the chart to get data from a file or table.
Here’s the fifth and final part of my tutorial on creating a cool motion chart from your complex data set. (Click here to go back to Part 1.) This part is important, and a little tricky.
See, everytime the flash-based chart has to be re-drawn, you lose all your preferred settings. Unfortunately the method for preserving your preferred ‘default state’ is not straightforward. I expect Google will make it easier in future, but in the meantime, here are the steps:
Now your chart will always display properly whenever you load it up, and whenever you share with others.