CoolData blog

18 May 2010

Multi-word terms in text mining

Filed under: Coolness, Free stuff, Text, Text mining — Tags: , — kevinmacdonell @ 11:14 am

Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)

The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)

On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1’ is more appropriate for texts from the bio-medical sciences.

Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.

I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)

All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!

POSTSCRIPT

P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.

(Click image for large view)

2 Comments »

  1. This looks very cool. I don’t know how you could get by with it on a practical data set though with the privacy concerns of uploading information from something like contact reports to a third-party.

    Comment by Jason — 3 June 2010 @ 11:00 am

    • Good point. Here are some thoughts on that. First, no personally identifying information need be transmitted – no names, IDs, etc. This exercise simply isolates multiple-word terms and counts their frequency. That’s the easy part – after that you’d have to search for those terms in your data set (which WILL include the ID) and create indicator variables for whomever said/wrote/is associated with those terms. Excel and the “contains” filter is good for this, although it’s pretty manual.

      Comment by kevinmacdonell — 3 June 2010 @ 11:28 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: