CoolData blog

7 January 2015

New finds in old models

 

When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.

 

Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.

 

In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?

 

I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.

 

Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.

 

As it turned out, yes, I could.

 

Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.

 

My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.

 

For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.

 

In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.

 

As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.

 

“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.

 

First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):

 

 

new_expec

(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)

 

But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix

 

I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.

 

This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.

 

This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.
Advertisement

6 October 2014

Don’t worry, just do it

2014-10-03 09.45.37People trying to learn how to do predictive modelling on the job often need only one thing to get them to the next stage: Some reassurance that what they are doing is valid.

Peter Wylie and I are each just back home, having presented at the fall conference of the Illinois chapter of the Association of Professional Researchers for Advancement (APRA-IL), hosted at Loyola University Chicago. (See photos, below!) Following an entertaining and fascinating look at the current and future state of predictive analytics presented by Josh Birkholz of Bentz Whaley Flessner, Peter and I gave a live demo of working with real data in Data Desk, with the assistance of Rush University Medical Center. We also drew names to give away a few copies of our book, Score! Data-Driven Success for Your Advancement Team.

We were impressed by the variety and quality of questions from attendees, in particular those having to do with stumbling blocks and barriers to progress. It was nice to be able to reassure people that when it comes to predictive modelling, some things aren’t worth worrying about.

Messy data, for example. Some databases, particularly those maintained by non higher ed nonprofits, have data integrity issues such as duplicate records. It would be a shame, we said, if data analysis were pushed to the back burner just because of a lack of purity in the data. Yes, work on improving data integrity — but don’t assume that you cannot derive valuable insights right now from your messy data.

And then the practice of predictive modelling itself … Oh, there is so much advice out there on the net, some of it highly technical and involving a hundred different advanced techniques. Anyone trying to learn on their own can get stymied, endlessly questioning whether what they’re doing is okay.

For them, our advice was this: In our field, you create value by ranking constituents according to their likelihood to engage in a behaviour of interest (giving, usually), which guides the spending of scarce resources where they will do the most good. You can accomplish this without the use of complex algorithms or arcane math. In fact, simpler models are often better models.

The workhorse tool for this task is multiple linear regression. A very good stand-in for regression is building a simple score using the techniques outlined in Peter’s book, Data Mining for Fundraisers. Sticking to the basics will work very well. Fussing with technical issues or striving for a high degree of accuracy are distractions that the beginner need not be overly concerned with.

If your shop’s current practice is to pick prospects or other targets by throwing darts, then even the crudest model will be an improvement. In many situations, simply performing better than random will be enough to create value. The bottom line: Just do it. Worry about perfection some other day.

If the decisions are high-stakes, if the model will be relied on to guide the deployment of scarce resources, then insert another step in the process. Go ahead and build the model, but don’t use it. Allow enough time of “business as usual” to elapse. Then, gather fresh examples of people who converted to donors, agreed to a bequest, or made a large gift — whatever the behaviour is you’ve tried to predict — and check their scores:

  • If the chart shows these new stars clustered toward the high end of scores, wonderful. You can go ahead and start using the model.
  • If the result is mixed and sort of random-looking, then examine where it failed. Reexamine each predictor you used in the model. Is the historical data in the predictor correlated with the new behaviour? If it isn’t, then the correlation you observed while building the model may have been spurious and led you astray, and should be excluded. As well, think hard about whether the outcome variable in your model is properly defined: That is, are you targeting for the right behaviour? If you are trying to find good prospects for Planned Giving, for example, your outcome variable should focus on that, and not lifetime giving.

“Don’t worry, just do it” sounds like motivational advice, but it’s more than that. The fact is, there is only so much model validation you can do at the time you create the model. Sure, you can hold out a generous number of cases as a validation sample to test your scores with. But experience will show you that your scores will always pass the validation test just fine — and yet the model may still be worthless.

A holdout sample of data that is contemporaneous with that used to train the model is not the same as real results in the future. A better way to go might be to just use all your data to train the model (no holdout sample), which will result in a better model anyway, especially if you’re trying to predict something relatively uncommon like Planned Giving potential. Then, sit tight and observe how it does in production, or how it would have done in production if it had been deployed.

  1. Observe, learn, tweak, and repeat. Errors are hard to avoid, but they can be discovered.
  2. Trust the process, but verify the results. What you’re doing is probably fine. If it isn’t, you’ll get a chance to find out.
  3. Don’t sweat the small stuff. Make a difference now by sticking to basics and thinking of the big picture. You can continue to delve and explore technical refinements and new methods, if that’s where your interest and aptitude take you. Data analysis and predictive modelling are huge subjects — start where you are, where you can make a difference.

* A heartfelt thank you to APRA-IL and all who made our visit such a pleasure, especially Sabine Schuller (The Rotary Foundation), Katie Ingrao and Viviana Ramirez (Rush University Medical Center), Leigh Peterson Visaya (Loyola University Chicago), Beth Witherspoon (Elmhurst College), and Rodney P. Young, Jr. (DePaul University), who took the photos you see below. (See also: APRA IL Fall Conference Datapalooza.)

Click on any of these for a full-size image.

DSC_0017 DSC_0018 DSC_0026 DSC_0051 DSC_0054 DSC_0060 DSC_0066 DSC_0075 DSC_0076 DSC_0091

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one — please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:

pivot1

Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:

pivot2

The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

10 October 2012

Logistic vs. multiple regression: Our response to comments

Guest post by John Sammis and Peter B. Wylie

Thanks to all of you who read and commented on our recent paper comparing logistic regression with multiple regression. We were not sure how popular this topic would be, but Kevin told us that interest was high, and there were a number of comments and questions. There were several general themes in the comments; Kevin has done an excellent job responding, but we thought we should throw in our two cents.

Why not just use logistic?

The point of our paper was not to suggest that logistic regression should not be used — our point was that multiple regression can achieve prediction results quite similar to logistic regression. Based on our experience working with and training fundraising professionals getting introduced to analytics, logistic regression can be intimidating. Our goal is always to get these folks to use analytics to help with their fundraising initiatives. We find many of them catch on with multiple regression, and much less so with logistic regression.

Predicted values vs. probabilities

We understand that the predicted values generated by multiple regression are different from the probabilities generated by logistic regression. Regardless of the statistic modeling technique we use, we always bin the raw prediction or probability values into equal-sized score levels. We have found that score level bins are easier to use than raw values. And using equal-sized score levels allows for easier evaluation of the scoring model.

“I cannot agree”

Some commenters, knowledgeable about statistics, said they would not use multiple regression when the inputs called for logistic. According to the rules, if the target variable is binary, then linear modelling doesn’t make sense — and the rules must be obeyed. In our view, this rigid approach to method selection is inappropriate for predictive modelling. The use of multiple linear regression in place of logistic regression may not always make theoretical sense, but predictive modellers are concerned with whether or not a model produces an output that is useful in practical terms. The worth of a model is testable against new, real-world data, therefore a model has only one criterion for determining “appropriate” use: Whether it really predicts what the modeler claims it will predict. The truth is revealed during evaluation.

A modest proposal

No one reading this should simply take our word that these two dissimilar methods yield similar results. Neither should anyone dismiss it out of hand without providing a critique based on real data. We would encourage anyone to try doing something on your own with data using both techniques and show us what you find. In particular, graduate students looking for a thesis or dissertation topic might consider producing something under this title: “Comparing Logistic Regression and Multiple Regression as Techniques for Predicting Major Giving.”

Heck! Peter says that if anyone were interested in doing a study like this for a thesis or dissertation, he would be willing to offer advice on how to:

  1. Do a thorough literature review
  2. Formulate specific research questions
  3. Come up with a study design
  4. Prepare a proposal that would satisfy a thesis or dissertation committee.

That’s quite an offer. How about it?

20 September 2012

When less data is more, in predictive modelling

When I started doing predictive modelling, I was keenly interested in picking the best and coolest predictor variables. As my understanding deepened, I turned my attention to how to define the dependent variable in order to really get at what I was trying to predict. More recently, however, I’ve been thinking about refining or limiting the population of constituents to be scored, and how that can help the model.

What difference does it make who gets a propensity score? Up until maybe a year ago, I wasn’t too concerned. Sure, probably no 22-year-old graduate had ever entered a planned giving agreement, but I didn’t see any harm in applying a score to all our alumni, even our youngest.

Lately, I’m not so sure. Using the example of a planned gift propensity model, the problem is this: Young alumni don’t just get a score; they also influence how the model is trained. If all your current expectancies were at least 50 before they decided to make a bequest, and half your alumni are under 30 years old, then one of the major distinctions your model will make is based on age. ANY alum over 50 is going to score well, regardless of whether he or she has any affinity to the institution, simply because 100% of your target is in that age group.

The model is doing the right thing by giving higher scores to older alumni. If ages in the sample range from 21 to 100+, then age as a variable will undoubtedly contribute to a large chunk of the model’s ability to “explain” the target. But this hardly tells us anything we didn’t already know. We KNOW that alumni don’t make bequest arrangements at age 22, so why include them in the model?

It’s not just the fact that their having a score is irrelevant. I’m concerned about allowing good predictor variables to interact with ‘Age’ in a way that compromises their effectiveness. Variables are being moderated by ‘Age’, without the benefit of improving the model in a way that we get what we want out of it.

Note that we don’t have to explicitly enter ‘Age’ as a variable in the model for young alumni to influence the outcome in undesirable ways. Here’s an example, using event attendance as a predictor:

Let’s say a lot of very young alumni and some very elderly constituents attend their class reunions. The older alumni who attend reunions are probably more likely than their non-attending classmates to enter into planned giving agreements — for my institution, that is definitely the case. On the other hand, the young alumni who attend reunions are probably no more or less likely than their non-attending peers to consider planned giving — no one that age is a serious prospect. What happens to ‘event attendance’ as a predictor in which the dependent variable is ‘Current planned giving expectancy’? … Because a lot of young alumni who are not members of the target variable attended events, the attribute of being an event attendee will be associated with NOT being a planned giving expectancy. Or at the very least, it will considerably dilute the positive association between predictor and target found among older alumni.

I confirmed this recently using some partly made-up data. The data file started out as real alumni data and included age, a flag for who is a current expectancy, and a flag for ‘event attendee’. I massaged it a bit by artificially bumping up the number of alumni under the age of 50 who were coded as having attended an event, to create a scenario in which an institution’s events are equally popular with young and old alike. In a simple regression model with the entire alumni file included in the sample, ‘event attendance’ was weakly associated with being a planned giving expectancy. When I limited the sample to alumni 50 years of age and older, however, the R squared statistic doubled. (That is, event attendance was about twice as effective at explaining the target.) Conversely, when I limited the sample to under-50s, R squared was nearly zero.

True, I had to tamper with the data in order to get this result. But even had I not, there would still have been many under-50 event attendees, and their presence in the file would still have reduced the observed correlation between event attendance and planned giving propensity, to no useful end.

You probably already know that it’s best not to lump deceased constituents in with living ones, or non-alumni along with alumni, or corporations and foundations along with persons. They are completely distinct entities. But depending on what you’re trying to predict, your population can fruitfully be split along other, more subtle distinctions. Here are a few:

  • For donor acquisition models, in which the target value is “newly-acquired donor”, exclude all renewed donors. You strictly want to have only newly-acquired donors and never-donors in your model. Your good prospects for conversion are the never-donors who most resemble the newly-acquired donors. Renewed donors don’t serve any purpose in such a model and will muddy the waters considerably.
  • Conversely, remove never-donors from models that predict major giving and leadership-level annual giving. Those higher-level donors tend not to emerge out of thin air: They have giving histories.
  • Looking at ‘Age’ again … making distinctions based on age applies to major-gift propensity models just as it does to planned giving propensity: Very young people do not make large gifts. Look at your data to find out at what age donors were when they first gave $1,000, say. This will help inform what your cutoff should be.
  • When building models specifically for Phonathon, whether donor-acquisition or contact likelihood, remove constituents who are coded Do Not Call or who do not have a valid phone number in the database, or who are unlikely to be called (international alumni, perhaps).
  • Exclude international alumni from event attendance or volunteering likelihood models, if you never offer involvement opportunities outside your own country or continent.

Those are just examples. As for general principles, I think both of the following conditions must be met in order for you to gain from excluding a group of constituents from your model. By a “group” I mean any collection of individuals who share a certain trait. Choose to exclude IF:

  1. Nearly 100% of constituents with the trait fall outside the target behaviour (that is, the behaviour you are trying to predict); AND,
  2. Having a score for people with that trait is irrelevant (that is, their scores will not result in any action being taken with them, even if a score is very low or very high).

You would apply the “rules” like this … You’re building a model to predict who is most likely to answer the phone, for use by Phonathon, and you’re wondering what to do with a bunch of alumni who are coded Do Not Call. Well, it stands to reason that 1) people with this trait will have little or no phone contact history in the database (the target behaviour), and 2) people with this trait won’t be called, even if they have a very high contact-likelihood score. The verdict is “exclude.”

It’s not often you’ll hear me say that less (data) is more. Fewer cases in your data file will in fact tend to depress your model’s R squared. But your ultimate goal is not to maximize R squared — it’s to produce a model that does what you want. Fitting the data is a good thing, but only when you have the right data.

Older Posts »

Blog at WordPress.com.