CoolData blog

18 May 2010

Multi-word terms in text mining

Filed under: Coolness, Free stuff, Text, Text mining — Tags: , — kevinmacdonell @ 11:14 am

Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)

The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)

On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1′ is more appropriate for texts from the bio-medical sciences.

Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.

I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)

All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!

POSTSCRIPT

P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.

(Click image for large view)

12 April 2010

New way to look at words

Filed under: Coolness, Data visualization, Free stuff, Text — Tags: , , , , , — kevinmacdonell @ 8:13 am

Word clouds aren’t new, but there’s a new online app for creating them that is worth checking out. Tagxedo allows you to create your clouds using some versatile tools for shaping the appearance of the cloud, which you can then easily save as a .jpg or .png.

This comes to me via a post on the LoveStats blog, where Annie Pettit has posted a couple of her own creations – one based on the text of her resume, and one on all the words in her blog.

I wrote about word clouds back in December (Quick and easy visuals of large text files), and the well-known and very cool tool known as Wordle, the creation of Jonathan Feinberg. Tagxedo does the same thing but works a little differently. Powered by Microsoft’s SilverLight browser plug-in, Tagxedo offers a nifty interface for importing your text (or URL), finely controlling your word choice, and playing with the font, colour, theme and layout of your cloud, including being able to choose a shape. The choice of shapes is rather limited – hearts, stars, rectangles and ovals, mostly. Here’s a star-shaped word cloud based on the 150 most common words on this blog:

(Click for full size image.)

My interest in word clouds is related to visualization of data – in this context, conveying the gist of a mass of text by giving prominence to the most common significant words. For example, last year I used Wordles to visualize tens of thousands of words entered as free-text comments in a survey of alumni. It’s no substitute for real analysis, but it does make a cool presentation slide!

NOTE: Check in tomorrow for Jason Boley’s amazing work with NodeXL for visualizing prospect connections in your data.

8 February 2010

How to do basic text-mining

Filed under: Annual Giving, Derived variables, Text, Text mining — Tags: , — kevinmacdonell @ 8:49 am

Turn prose into data for insights into your constituents' behaviour. (Photo used via Creative Commons licence. Click photo for source.)

Database users at universities make frequent use of free-text comment fields to store information. Too frequent use, perhaps. Normally, free text is resorted to only when there’s a need to store information of a type that cannot be conveniently coded (preferably from a pre-established “validation table” of allowed values). Unstructured information such as comments requires some work to turn it into data that can reveal patterns and correlations. This work is called text mining.

Here are steps I took to do some rather crude text-mining on a general-comments field in our database. My method was first to determine which words were used most frequently, then select a few common ‘suggestive’ words that might show interesting correlations, and finally to test the variables I made from them for correlations with giving to our institution.

The comments I was trying to get at were generated from our Annual Giving phonathon. Often these comments flag alumni behaviours such as hanging up on the caller, being verbally abusive, or other negative things. As certain behaviours often prompt the same comments over and over (eg. “hung up on the caller”), I thought that certain frequently-occurring keywords might be negatively correlated with giving.

The method outlined below is rather manual. As well, it focuses on single words, rather than word combinations or phrases. There are some fantastic software packages out there for going much deeper, more quickly. But giving this a try is not difficult and will at least give you a taste for the idea behind text mining.

My method was first to discover the most common words that sounded like they might convey some sense of “attitude”:

  • Using a query in Access, I extracted the text of all comments, plus comment type, from the database – including the ID of the individual. (We use Banner so this data came from the APACOMT screen.)
  • I dumped the data into Excel, and eliminated certain unwanted comments by type code (such as event attendance, bios, media stories, etc.), leaving about 6,600 comments. (I saved this Excel file, to return to later on.)
  • I copied only the column of remaining comments, and pasted this text into a basic text editor. (I like to use EditPad Lite, but anything you have that works with big .txt files is fine.)
  • I used Find-and-replace to change all spaces into carriage returns, so that each word was on one line.
  • I used Find-and-replace again to removed common punctuation (quote marks, periods, commas etc.)
  • I changed all uppercase characters to lowercase characters, so “The” wouldn’t be counted separately from “the”.
  • The result was a very long column of single words. I copied the whole thing, and pasted it into Data Desk, as a single variable.
  • This allowed me to create a frequency table, sorted by count so the most common words would appear at the top. More than 100,000 cases fell into a little less than 5,000 categories (i.e. words).

The most common words were, in order: to, the, a, made, and, be, mail, by, only, from, not, list, removed, nn, of, in, solicitation, he, no, phonathon, she, pledge, is, wishes, said, unhonoured, on, does, was, giving, phone, will, caller, her, donate.

I recognized some of our most common comments, including “made by-mail-only”, “made phonathon no”, “unhonoured pledge”, etc. These states are already covered by specific coding elsewhere in the database, so I skipped over these and looked farther down to some of the more subjective “mood” words, such as “hang” and “hung” (which almost always meant “hung up the phone”), “rude”, “upset”, “never”, “told”, etc.

I went back to my original Excel file of comments and created a few new columns to hold a 0/1 variable for some of these categories. This took some work in Excel, using the “Contains” text filter. So, for example, every comment that contained some variation on the theme of ‘hanging up the phone’ received a 1 in the column called “Hung up”, and all the others got a zero.

From there, it was easy to copy the IDs, with the new variable(s), into Data Desk, where I matched the data up with Lifetime Giving. The idea of course was to discover a new predictor variable or two. For example, it seemed likely that alumni with a 1 for the variable ‘Hung Up’ might have given less than other alumni. As it turned out, though, the individual variables I created on this occasion were not particularly predictive of giving (or of failing to give).

I certainly haven’t given up on the idea, though, because there is much room for improvement in the analysis. For one thing, I was looking for correlations with Lifetime Giving, when I should have specified Phonathon Giving. People who hang up on student callers aren’t non-donors, necessarily; they just don’t care for being contacted by phone. (Why they don’t just ask to be taken off the calling list, I’m not sure.)

In the meantime, this very basic text-mining technique DID prove very useful when I needed to compare two models I had created for our Annual Giving program. I had designed an improved model which specifically targeted phone-receptive alumni, in the hopes of reducing the number of hang-ups and other unpleasant phone encounters. I showed the effectiveness of this approach through the use of text mining, conducted exactly as outlined above. (I’ll detail the results in a future post.)

Do you have a lot of text-based comments in your database? Do you have a lot of text-based response data from (non-anonymous) surveys? Play around with mining that text and see what insights you come up with.

9 December 2009

Quick and easy visuals of large text files

Filed under: Data visualization, Free stuff, Text — Tags: , , , , — kevinmacdonell @ 7:30 pm

Earlier this year we conducted an extensive survey of alumni, made up mostly of scale statements but including a few free-text comment fields as well. Respondents typed in nearly 80,000 words in comments – that’s slightly longer than the first Harry Potter book!

Somebody has to read all this stuff (not me!). But what can we do with it in the meantime?

Why not play with it in Wordle?

According to the web site (www.wordle.net), Wordle is “a toy for generating ‘word clouds’ from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes. The images you create with Wordle are yours to use however you like.”

Here is a word cloud for the free-text comments made in response to the question, “Do you have any other comments about your academic experience?

You can also enter the URL of any blog or feed into Wordle, and it will generate a word cloud from that. Here’s what a Wordle of this blog looks like (so far).

Word cloud for CoolData blog (up to 9 Dec 2009).

Useful or just a toy? I did use some of these word clouds in a presentation of the alumni survey results, and the response told me it was worth it. It’s a cool thing, and people like cool things. I also see Wordle creations in newspapers – I think the first example I ever saw was a comparison of campaign speeches made by Barack Obama and John McCain.

What do you use word clouds for?

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,050 other followers