CoolData blog

18 May 2010

Multi-word terms in text mining

Filed under: Coolness, Free stuff, Text, Text mining — Tags: , — kevinmacdonell @ 11:14 am

Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)

The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)

On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1’ is more appropriate for texts from the bio-medical sciences.

Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.

I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)

All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!


P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.

(Click image for large view)


9 February 2010

Preventing hangups and rudeness in your Phonathon program

Filed under: Annual Giving, Model building, Text mining — Tags: , , , , — kevinmacdonell @ 12:54 pm

(Image used by Creative Commons license. Click on photo for source.)

In 2007 and 2008 we used predictive models in Annual Giving to segment the entire alumni population into deciles according to propensity to give. Both years, our annual giving coordinator noticed that alumni in the highest deciles (9 and 10) seemed to hang up on callers with unexpected frequency.

An analysis of the cases who hung up on callers bore her observation out. Hang-ups, rudeness and other “red flags” are recorded in our database as text comments, rather than validated codes. Therefore, a little text mining was required to identify IDs who exhibited these behaviours.

(In a previous post, I described a very manual yet simple method of extracting potential predictor variables from the kind of free-form text found in database comment fields or survey responses. Today, I’m using text-mined variables not for predicting giving, but for comparing two whole models with each other. More on that in a bit.)

When I mined the comments, I discovered that fully half of the people who hung up had a score of 6 or higher. The model was failing to weed out people who were not receptive to phone solicitation. Of course, our higher scorers were giving more than lower scorers overallbut could we do better?

The answer was yes.

The models created in 2007 and 2008 were aimed at predicting giving at any level (from annual giving to major giving), via any channel (phone, mail, etc.), and based on past giving made at any time (i.e., lifetime giving rather than recent giving).

In short, these were very general models, not Annual Giving models. Our high-scoring hanger-uppers were donors: Many of them gave quite generously, in fact. They just didn’t give via the calling program. Most gave on their own, or in response to a mail solicitation. (For whatever reason, they had not been added to our do-not-call list, so they continued to receive unwanted calls.)

They did deserve to be high scorers – but not for the calling program.

In 2009 I took a different approach to defining the predicted value (a.k.a. dependent variable):

  • Instead of predicting for any type of giving, I narrowed our focus to gifts made to Annual Giving.
  • Instead of gifts via any type of solicitation in Annual Giving, I counted only donations made in response to a phone call.
  • Instead of using Lifetime Giving as our predicted value, I limited it to the past six fiscal years of giving.

How did our hanger-uppers score now, with the new model? The results were dramatically different. For testing the improvement, I had two text-mined indicator variables to work with, one for all IDs that had ever hung up on a caller, and another for anyone who had ever been rude to a caller. Neither variable had been used as a predictor in my models, so they were perfect for conducting an independent test of the new model’s ability to target the right people.

To compare the two old models with the new one, I simply looked at how the alumni responsible for unpleasant encounters were distributed by score decile.

Have a look at how these two charts compare. The one labeled ‘Old decile’ shows how ‘hanger-uppers’ scored in the older model (2008). As I said earlier, a lot of them were high scorers. (I’m not saying how many – notice I’ve removed the Y axis scale – I want to show you the distribution, not the actual numbers. The vertical scale differs from one chart to the other.)

The chart at right shows the same people, as they were scored in the new, phonathon-specific model (2009). In the new model, only 34% of hanger-uppers score 6 or higher – compared with 50% in the old model. As well, almost a third of them are clustered in the very lowest decile. Not perfect, but a big improvement.

Now, how about “rudeness“? Here are two more charts, same idea: The breakdown for the old model is on the left, and the one for the new model is on the right. Again, the (hidden) vertical scale is different: If they were shown on the same scale, the bar for the first decile in the chart on the left would actually be half the height of the bar for the first decile in the chart on the right.)

In the old model, people who were difficult on the phone were as likely to score high as score low. In the new model, however, they tend to be very low scorers. Again, a lot of them are lumped together in the lowest decile.

Remember: Neither of these variables was used as a predictor in the new model!

I don’t see myself ever going back to creating a model that isn’t specific to the task at hand, whether it’s Phonathon, event-attendance likelihood, Planned Giving potential, or what-have you. For Phonathon, getting smarter and more targeted means that fewer donors who are averse to being contacted by phone will be called, with the result that student callers will experience fewer unpleasant encounters and have a better experience on the job. It just makes sense.

8 February 2010

How to do basic text-mining

Filed under: Annual Giving, Derived variables, Text, Text mining — Tags: , — kevinmacdonell @ 8:49 am

Turn prose into data for insights into your constituents' behaviour. (Photo used via Creative Commons licence. Click photo for source.)

Database users at universities make frequent use of free-text comment fields to store information. Too frequent use, perhaps. Normally, free text is resorted to only when there’s a need to store information of a type that cannot be conveniently coded (preferably from a pre-established “validation table” of allowed values). Unstructured information such as comments requires some work to turn it into data that can reveal patterns and correlations. This work is called text mining.

Here are steps I took to do some rather crude text-mining on a general-comments field in our database. My method was first to determine which words were used most frequently, then select a few common ‘suggestive’ words that might show interesting correlations, and finally to test the variables I made from them for correlations with giving to our institution.

The comments I was trying to get at were generated from our Annual Giving phonathon. Often these comments flag alumni behaviours such as hanging up on the caller, being verbally abusive, or other negative things. As certain behaviours often prompt the same comments over and over (eg. “hung up on the caller”), I thought that certain frequently-occurring keywords might be negatively correlated with giving.

The method outlined below is rather manual. As well, it focuses on single words, rather than word combinations or phrases. There are some fantastic software packages out there for going much deeper, more quickly. But giving this a try is not difficult and will at least give you a taste for the idea behind text mining.

My method was first to discover the most common words that sounded like they might convey some sense of “attitude”:

  • Using a query in Access, I extracted the text of all comments, plus comment type, from the database – including the ID of the individual. (We use Banner so this data came from the APACOMT screen.)
  • I dumped the data into Excel, and eliminated certain unwanted comments by type code (such as event attendance, bios, media stories, etc.), leaving about 6,600 comments. (I saved this Excel file, to return to later on.)
  • I copied only the column of remaining comments, and pasted this text into a basic text editor. (I like to use EditPad Lite, but anything you have that works with big .txt files is fine.)
  • I used Find-and-replace to change all spaces into carriage returns, so that each word was on one line.
  • I used Find-and-replace again to removed common punctuation (quote marks, periods, commas etc.)
  • I changed all uppercase characters to lowercase characters, so “The” wouldn’t be counted separately from “the”.
  • The result was a very long column of single words. I copied the whole thing, and pasted it into Data Desk, as a single variable.
  • This allowed me to create a frequency table, sorted by count so the most common words would appear at the top. More than 100,000 cases fell into a little less than 5,000 categories (i.e. words).

The most common words were, in order: to, the, a, made, and, be, mail, by, only, from, not, list, removed, nn, of, in, solicitation, he, no, phonathon, she, pledge, is, wishes, said, unhonoured, on, does, was, giving, phone, will, caller, her, donate.

I recognized some of our most common comments, including “made by-mail-only”, “made phonathon no”, “unhonoured pledge”, etc. These states are already covered by specific coding elsewhere in the database, so I skipped over these and looked farther down to some of the more subjective “mood” words, such as “hang” and “hung” (which almost always meant “hung up the phone”), “rude”, “upset”, “never”, “told”, etc.

I went back to my original Excel file of comments and created a few new columns to hold a 0/1 variable for some of these categories. This took some work in Excel, using the “Contains” text filter. So, for example, every comment that contained some variation on the theme of ‘hanging up the phone’ received a 1 in the column called “Hung up”, and all the others got a zero.

From there, it was easy to copy the IDs, with the new variable(s), into Data Desk, where I matched the data up with Lifetime Giving. The idea of course was to discover a new predictor variable or two. For example, it seemed likely that alumni with a 1 for the variable ‘Hung Up’ might have given less than other alumni. As it turned out, though, the individual variables I created on this occasion were not particularly predictive of giving (or of failing to give).

I certainly haven’t given up on the idea, though, because there is much room for improvement in the analysis. For one thing, I was looking for correlations with Lifetime Giving, when I should have specified Phonathon Giving. People who hang up on student callers aren’t non-donors, necessarily; they just don’t care for being contacted by phone. (Why they don’t just ask to be taken off the calling list, I’m not sure.)

In the meantime, this very basic text-mining technique DID prove very useful when I needed to compare two models I had created for our Annual Giving program. I had designed an improved model which specifically targeted phone-receptive alumni, in the hopes of reducing the number of hang-ups and other unpleasant phone encounters. I showed the effectiveness of this approach through the use of text mining, conducted exactly as outlined above. (I’ll detail the results in a future post.)

Do you have a lot of text-based comments in your database? Do you have a lot of text-based response data from (non-anonymous) surveys? Play around with mining that text and see what insights you come up with.

Create a free website or blog at