How to do basic text-mining

8 February 2010

How to do basic text-mining

Filed under: Annual Giving, Derived variables, Text, Text mining — Tags: Predictor variables, Text mining — kevinmacdonell @ 8:49 am

Turn prose into data for insights into your constituents' behaviour. (Photo used via Creative Commons licence. Click photo for source.)

Database users at universities make frequent use of free-text comment fields to store information. Too frequent use, perhaps. Normally, free text is resorted to only when there’s a need to store information of a type that cannot be conveniently coded (preferably from a pre-established “validation table” of allowed values). Unstructured information such as comments requires some work to turn it into data that can reveal patterns and correlations. This work is called text mining.

Here are steps I took to do some rather crude text-mining on a general-comments field in our database. My method was first to determine which words were used most frequently, then select a few common ‘suggestive’ words that might show interesting correlations, and finally to test the variables I made from them for correlations with giving to our institution.

The comments I was trying to get at were generated from our Annual Giving phonathon. Often these comments flag alumni behaviours such as hanging up on the caller, being verbally abusive, or other negative things. As certain behaviours often prompt the same comments over and over (eg. “hung up on the caller”), I thought that certain frequently-occurring keywords might be negatively correlated with giving.

The method outlined below is rather manual. As well, it focuses on single words, rather than word combinations or phrases. There are some fantastic software packages out there for going much deeper, more quickly. But giving this a try is not difficult and will at least give you a taste for the idea behind text mining.

My method was first to discover the most common words that sounded like they might convey some sense of “attitude”:

Using a query in Access, I extracted the text of all comments, plus comment type, from the database – including the ID of the individual. (We use Banner so this data came from the APACOMT screen.)
I dumped the data into Excel, and eliminated certain unwanted comments by type code (such as event attendance, bios, media stories, etc.), leaving about 6,600 comments. (I saved this Excel file, to return to later on.)
I copied only the column of remaining comments, and pasted this text into a basic text editor. (I like to use EditPad Lite, but anything you have that works with big .txt files is fine.)
I used Find-and-replace to change all spaces into carriage returns, so that each word was on one line.
I used Find-and-replace again to removed common punctuation (quote marks, periods, commas etc.)
I changed all uppercase characters to lowercase characters, so “The” wouldn’t be counted separately from “the”.
The result was a very long column of single words. I copied the whole thing, and pasted it into Data Desk, as a single variable.
This allowed me to create a frequency table, sorted by count so the most common words would appear at the top. More than 100,000 cases fell into a little less than 5,000 categories (i.e. words).

The most common words were, in order: to, the, a, made, and, be, mail, by, only, from, not, list, removed, nn, of, in, solicitation, he, no, phonathon, she, pledge, is, wishes, said, unhonoured, on, does, was, giving, phone, will, caller, her, donate.

I recognized some of our most common comments, including “made by-mail-only”, “made phonathon no”, “unhonoured pledge”, etc. These states are already covered by specific coding elsewhere in the database, so I skipped over these and looked farther down to some of the more subjective “mood” words, such as “hang” and “hung” (which almost always meant “hung up the phone”), “rude”, “upset”, “never”, “told”, etc.

I went back to my original Excel file of comments and created a few new columns to hold a 0/1 variable for some of these categories. This took some work in Excel, using the “Contains” text filter. So, for example, every comment that contained some variation on the theme of ‘hanging up the phone’ received a 1 in the column called “Hung up”, and all the others got a zero.

From there, it was easy to copy the IDs, with the new variable(s), into Data Desk, where I matched the data up with Lifetime Giving. The idea of course was to discover a new predictor variable or two. For example, it seemed likely that alumni with a 1 for the variable ‘Hung Up’ might have given less than other alumni. As it turned out, though, the individual variables I created on this occasion were not particularly predictive of giving (or of failing to give).

I certainly haven’t given up on the idea, though, because there is much room for improvement in the analysis. For one thing, I was looking for correlations with Lifetime Giving, when I should have specified Phonathon Giving. People who hang up on student callers aren’t non-donors, necessarily; they just don’t care for being contacted by phone. (Why they don’t just ask to be taken off the calling list, I’m not sure.)

In the meantime, this very basic text-mining technique DID prove very useful when I needed to compare two models I had created for our Annual Giving program. I had designed an improved model which specifically targeted phone-receptive alumni, in the hopes of reducing the number of hang-ups and other unpleasant phone encounters. I showed the effectiveness of this approach through the use of text mining, conducted exactly as outlined above. (I’ll detail the results in a future post.)

Do you have a lot of text-based comments in your database? Do you have a lot of text-based response data from (non-anonymous) surveys? Play around with mining that text and see what insights you come up with.

Comments (18)

18 Comments »

Thanks for experimenting with this and posting your results. From I can tell, the actual technology of text mining is still in its infancy, but I’m a firm believer in collecting the unstructured data now anyway, and just waiting until the mining tools catch up.

If you’re interested, I’d love to get more data samples to toy around with (doing a Master’s project in text analytics this semester), and I could run your records through the software I have access to and see what it finds. I’m running comparisons on various text mining packages, including a few freebie opensource ones that, if useful, you could use to make it a bit easier on yourself.

Comment by Jen Olomon — 8 February 2010 @ 10:56 am

Reply
- Hi Jen, and thanks for the offer. I’d have to think about that. I’m currently sitting on a huge pile of survey comments (with a word count equal to that of a short novel), and there might be an opportunity to work with that data. In the meantime, though, perhaps another reader might have data he or she could provide to you immediately?
  
  Comment by kevinmacdonell — 8 February 2010 @ 11:07 am
  
  Reply
Hi Kevin,
If you are doing predictive modeling try using the presence of a key word in contact reports, like “art” “horse” or “boat”. I’ve found some of these to be significant in models and correlate with major giving.

Comment by Jason Boley — 15 February 2010 @ 6:01 pm

Reply
- I’d never thought of that idea. Thanks! This would be of especial value at large institutions with numerous development officers filing call reports directly into the database. (I work at a small institution and I’m the one doing the data entry.)
  
  Comment by kevinmacdonell — 16 February 2010 @ 11:25 am
  
  Reply
[…] link: How to do basic text-mining « CoolData blog Share this […]

Pingback by How to do basic text-mining « CoolData blog « Information Mining – Zenorg R&D — 24 March 2010 @ 9:00 am

Reply
I am designing my project in php
( an analysis tool for a discussion forum) and i have been told to use text mining to modify my project

any idea on what i should do?
do the steps indicated work only for the applications you mentioned

i will be greatful for a positive response

Kalumera Christine

Comment by Kalumera Christine — 13 May 2010 @ 5:53 am

Reply
- I’m not sure what is meant by “modify”, in your question. The steps I’ve outlined will allow you to isolate certain key words that might make useful variables in a predictive model. For your project, you may have a different goal in mind. My method is manual and time-consuming, so it may not be suitable for a programmatic approach to analysis of the comments in a discussion forum (you mention using php to create an analysis tool). Your first step, it seems to me, is to ask yourself what you need to do text mining FOR. Are you trying to identify trends, new hot topics, most frequently-used terms? Are you trying to predict something? I’m not asking you these questions – just suggesting questions you must ask of yourself.
  
  Comment by kevinmacdonell — 13 May 2010 @ 6:04 am
  
  Reply
[…] NaCTeM, Text mining — kevinmacdonell @ 11:14 am Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word […]

Pingback by Multi-word terms in text mining « CoolData blog — 18 May 2010 @ 11:14 am

Reply
Hi
We are currently writing our master thesis, and are in great need of a specific program. We have no idea whether or not it actually exists, but what we are looking for is a program capable of comparing about 500 texts simultaneously in order to find differences/similarities between the texts – and being able to report about these cross-references. A dreamscenario would be if it’s also capable of giving a procentage of the differences in between the texts.
We are aware that some of the different diff-programs are able to compare two texts – but we really need it to be able to comprehend more than that.

What we want to do is to compare annual reports to find out whether or not companies “borrow” text from each other, when writing the annual report.
Therefore, we need the program to take 1 text at a time and compare it to the others (prox. 500) to find overlapping between them.
Quite like when universities compare exam thesis with old exam thesis, to se whether or not there has been “copy/paste”

It´s preferable if the program can be installed on a windows computer, furthermore if it is possible for the program to load *.pdf files

Hope that you – or someone else – are able to help us.

Comment by Kamilla — 16 February 2011 @ 1:17 pm

Reply
- Wow, you got me there. I did some searching around but didn’t find anything that could compare more than three documents simultaneously. Without having given it a great deal of thought, here’s one thing that occurs to me: It is possible to analyze each document for key phrases and calculate frequencies for those phrases. So each document would have perhaps a dozen variables that quantify the frequencies of those phrases (or short groups of key words), together which would form a sort of fingerprint for that document. Then you could perhaps use a clustering algorithm to group documents with similar fingerprints into groups based on how closely related they were. Just my two cents – perhaps someone else will be able to help.
  
  Comment by kevinmacdonell — 16 February 2011 @ 2:17 pm
  
  Reply
SAS does have a product that does exactly what you are talking about. It is able to find sentences and parts of documents.

Full disclosure I am the Product Manager of SAS Text Analtyics suite.

feel free to contact me if you need more information

Comment by richard foley — 16 February 2011 @ 3:04 pm

Reply
- Vendor comments are welcome. Thanks, Richard.
  
  I posted this question to the Prospect-DMM listserv and one person commented that the answer might have been included in the question: “Quite like when universities compare exam thesis with old exam thesis, to see whether or not there has been copy/paste.” The listserv member wrote, “A quick Google search turns up all kinds of software suited to this very task. I’d run those PDFs through OCR to create the comparison database, and then you’re off.”
  
  Comment by kevinmacdonell — 16 February 2011 @ 3:15 pm
  
  Reply
I have heard of a program that is being piloted at MIT that does this. It is not commercially available but it does exist.

I remember reading a whole article about it but can’t find the link.

Comment by David Broussard — 16 February 2011 @ 3:19 pm

Reply
Im in the process to do my PhD in the area of text mining, however i dont have a specific topic in my mind, which area should I focus on.. Text mining is a big topic..Therefore, Do have any ideas regarding the text mining area for my PhD studies?

Comment by jee — 3 October 2011 @ 2:45 am

Reply
high frequency words and unusual strong connection between words, those will be useful.

I really like your simple techniques such as this “text” mining and infering age from name, simplicity is beauty!

Comment by cnukus — 26 February 2012 @ 11:51 pm

Reply
hi,
am doin my final yr project in the field of data mining in cloud computing environment..
as a matter of fact i have worked on mining data sets including numerical data(student records,finding balanced diet,hospital records)using k means clustering algorithm..but now I would like to add text mining also as a part of my project..when i mention text mining it means tat i ve to group(cluster) some web(HTML),pdf files,word docs, pages depending upon their similarity..for ex:-i can use the meta tags to find out meaning of the text present in the HTML pages..the problem that am facing is that how to extract these words from the docs n which algo is the best for clusterig them..

Comment by aditya singh — 12 March 2012 @ 1:37 pm

Reply
I was searching for a way to get started, and voila! Here is this blog post. Thanks, Kevin! –Marianne

Comment by Marianne Pelletier — 30 October 2012 @ 9:29 am

Reply
Reblogged this on Okuduklarım.

Comment by patolog — 25 November 2012 @ 8:20 am

Reply