Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)
The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)
On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1’ is more appropriate for texts from the bio-medical sciences.
Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.
I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)
All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!
POSTSCRIPT
P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.