CoolData blog

26 January 2012

More mistakes I’ve made

Filed under: Best practices, Peter Wylie, Pitfalls, Validation — Tags: , , , — kevinmacdonell @ 1:38 pm

A while back I wrote a couple of posts about mistakes I’ve made in data mining and predictive modelling. (See Four mistakes I have made and When your predictive model sucks.) Today I’m pleased to point out a brand new one.

The last days of work leading up to Christmas had me evaluating my new-donor acquisition models to see how well they’ve been working. Unfortunately, they were not working well. I had hoped — I had expected — to see newly-acquired donors clustered in the upper ranges of the decile scores I had created. Instead they were scattered all along the whole range. A solicitation conducted at random would have performed nearly as well.

Our mailing was restricted by score (roughly the top two deciles only), but our phone solicitation was more broad, so donors came from the whole range of deciles:

Very disappointing. To tell the truth, I had seen this before: A model that does well predicting overall participation, but which fails to identify which non-donors are most likely to convert. I am well past the point of being impressed by a model that tells me what everyone already knows, i.e. that loyal donors are most likely to give again. I want to have confidence that acquisition mail dollars are spent wisely.

So it was back to the drawing board. I considered whether my model was suffering from overfit, whether perhaps I had too many variables, too much random noise, multicolinearity. I studied and rejected one possibility after another. After so much effort, I came rather close to concluding that new-donor acquisition is not just difficult — it might be darn near impossible.

Dire possibility indeed. If you can’t predict conversion, then why bother with any of this?

It was during a phone conversation with Peter Wylie that things suddenly became clear. He asked me one question: How did I define my dependent variable? I checked, and found that my DV was named “Recent Donors.” That’s all it took to find where I had gone wrong.

As the name of the DV suggested, it turned out that the model was trained on a binary variable that flagged anyone who had made a gift in the past two years. The problem was that included everybody: long-time donors and newly-acquired donors alike. The model was highly influenced by the regular donors, and the new donors were lost in the shuffle.

It was a classic case of failing to properly define the question. If my goal was to identify the patterns and characteristics of newly-acquired donors, then I should have limited my DV strictly to non-donors who had recently converted to donors!

So I rebuilt the model, using the same data file and variables I had used to build the original model. This time, however, I pared the sample down to alumni who had never given a cent before fiscal 2009. They were the only alumni I needed to have scores for. Then I redefined my dependent variable so that non-donors who converted, i.e., who made a gift in either fiscal 2009 or 2010, were coded ’1′, and all others were coded ’0′. (I used two years of giving data instead of just one in order to have a little more data available for defining the DV.) Finally, I output a new set of decile scores from a binary logistic regression.

A test of the new scores showed that the new model was a vast improvement over the original. How did I test this? Recall that I reused the same data file from the original model. Therefore, it contained no giving data from the current fiscal year; the model was innocent of any knowledge of the future. Compare this breakdown of new donors with the one above:

Much better. Not fan-flippin-tastic, but better.

My error was a basic one — I’ve even cautioned about it in previous posts. Maybe I’m stupid, or maybe I’m just human. But like anyone who works with data, I can figure out when I’m wrong. That’s a huge advantage.

  • Be skeptical about the quality of your work.
  • Evaluate the results of your decisions.
  • Admit your mistakes.
  • Document your mistakes and learn from them.
  • Stay humble.

16 January 2012

Address updates and affinity: Consider the source

Filed under: Correlation, Predictor variables, skeptics — Tags: , , , , — kevinmacdonell @ 1:03 pm

Some of the best predictors in my models are related to the presence or absence of phone numbers and addresses. For example, the presence of a business phone is usually a highly significant predictor of giving. As well, a count of either phone or address updates present in the database is also highly correlated with giving.

Some people have difficulty accepting this as useful information. The most common objection I hear is that such updates can easily come from research and data appends, and are therefore not signals of affinity at all. And that would be true: Any data that exists solely because you bought it or looked it up doesn’t tell you how someone feels about your institution. (Aside from the fact that you had to go looking for them in the first place — which I’ve observed is negatively correlated with giving.)

Sometimes this objection comes from someone who is just learning data mining. Then I know I’m dealing with someone who’s perceptive. They obviously get it, to some degree — they understand there’s potentially a problem.

I’m less impressed when I hear it from knowledgeable people, who say they avoid contact information in their variable selection altogether. I think that’s a shame, and a signal that they aren’t willing to put in the work to a) understand the data they’re working with, or b) take steps to counteract the perceived taint in the data.

If you took the trouble to understand your data (and why wouldn’t you), you’d find out soon enough if the variables are useable:

  • If the majority of phone numbers or business addresses or what-have-you are present in the database only because they came off donors’ cheques, then you’re right in not using it to predict giving. It’s not independent of giving and will harm your model. The telltale sign might be a correlation with the target variable that exceeds correlations for all your other variables.
  • If the information could have come to you any number of ways (with gift transactions being only one of them), then use with caution. That is, be alert if the correlation looks too good to be true. This is the most likely scenario, which I will discuss in detail shortly.
  • If the information could only have come from data appends or research, then you’ve got nothing much to worry about: The correlation with giving will be so weak that the variable probably won’t make it into your model at all. Or it may be a negative predictor, highlighting the people who allowed themselves to become lost in the first place. An exception to the “don’t worry” policy would be if research is conducted mainly to find past donors who have become lost — then there might be a strong correlation that will lead you astray.

An in-house predictive modeler will simply know what the case is, or will take the trouble to find out. A vendor hired to do the work may or may not bother — I don’t know. As far as my own models are concerned, I know that addresses and phone numbers come to us via a mix of voluntary and involuntary means: Via Phonathon, forms on the website, records research, and so on.

I’ve found that a simple count of all historical address updates for each alum is positively correlated with giving. But a line plot of the relationship between number of address updates and average lifetime giving suggests there’s more going on under the surface. Average lifetime giving goes up sharply for the first half-dozen or so updates, and then falls away just as sharply. This might indicate a couple of opposing forces: Alumni who keep us informed of their locations are more likely to be donors, but alumni who are perpetually lost and need to be found via research are less likely to be donors.

If you’re lucky, your database not only has a field in which to record the source of updates, but your records office is making good use of it. Our database happens to have almost 40 different codes for the source, applied to some 300,000 changes of address and/or phone number. Not surprisingly, some of these are not in regular use — some account for fewer than one-tenth of one percent of updates, and will have no significance in a model on their own.

For the most common source types, though, an analysis of their association with giving is very interesting. Some codes are positively correlated with giving, some negatively. In most cases, a variable is positive or negative depending on whether the update was triggered by the alum (positive), or by the institution (negative). On the other hand, address updates that come to us via Phonathon are negatively correlated with giving, possibly because by-mail donors tend not to need a phone call — if ‘giving’ were restricted to phone solicitation only, perhaps the association might flip toward the positive. Other variables that I thought should be positive were actually flat. But it’s all interesting stuff.

For every source code, a line plot of average LT giving and number of updates is useful, because the relationship is rarely linear. The relationship might be positive up to a point, then drop off sharply, or maybe the reverse will be true. Knowing this will suggest ways to re-express the variable. I’ve found that alumni who have a single update based on the National Change of Address database have given more than alumni who have no NCOA updates. However, average giving plummets for every additional NCOA update. If we have to keep going out there to find you, it probably means you don’t want to be found!

Classifying contact updates by source is more work, of course, and it won’t always pay off. But it’s worth exploring if your goal is to produce better, more accurate models.

11 January 2012

The data-driven organization: Know any?

Filed under: Book — kevinmacdonell @ 11:59 am

I was chatting with Peter Wylie the other day, which we do from time to time, since we are, after all, collaborating on a book which we hope to finish writing in the coming months. The book is about how to bring our institutions, nonprofits, development and advancement offices into the data-driven decision making age.

We got to thinking, are there any institutions (universities, or university advancement departments, or nonprofit organizations) that are  shining examples of data-driven decision making? Is there anyone we can profile in the book as an exemplar?

We can name plenty of data-oriented people who are doing great work as individuals. But what about institutions or departments as a whole? Are there any that employ analytics from top to bottom? Are there any that pass all decision-making processes through a layer of data analysis (if appropriate) before the final stage is reached?

We struggled to come up with non-profit examples. Can you help? Tell us about the organization you’d nominate as data-driven — perhaps it’s your own. Be prepared to explain why. You can remain anonymous, although we would prefer to be able to identify persons and institutions by name. Email me at kevin.macdonell@gmail.com.

4 January 2012

Look inside first

Filed under: External data — Tags: — kevinmacdonell @ 8:41 am

During a panel discussion on the second day of last October’s DRIVE conference, one of the panel members mentioned that it’s possible to learn which of your constituents have “liked” your fan page on FaceBook. The mechanics of it went over my head — I don’t recall if it involved developing an application or scraping data some other way. Anyway, the discussion veered in the direction that doing this was crossing some sort of line.

I’m not sure about that. On one hand, you need to remember your responsibility to donors to operate in the least wasteful way possible. Getting smarter about identifying people who feel an affinity with your institution or cause is part of that.

On the other hand, it’s hard to justify making it a priority to scrape data from external sources if you’re doing a lousy job of using your much more valuable internal data.

Let’s get the internal part right first, and explore the ethics of casting a wider net later.

22 December 2011

Why I love work

Filed under: Training / Professional Development — kevinmacdonell @ 2:12 pm

It’s so hard to get anything finished this time of year. As the holidays get closer, I get more distracted and feeling unwell from eating too much sugar. I’ve got more than a dozen draft posts queued up — some are rather meaty and nearly ready to publish, but I just haven’t gotten up the steam.

So let me finish off the year with a thought that came into my head as I was going to work this morning.

Today is my last day before the holidays, and although I’m happy to have some time off, I’m in no hurry to have this day over with. I love my job and I enjoy being at work.

I didn’t always feel that way. Up until about eight years ago I would rather have avoided work. So what’s changed?

It has a lot to do with learning how to work with data to help people accomplish real things. How to describe what that feels like … Well, you probably don’t remember the experience of learning to read and write, but you can imagine the feeling of being enabled that it must have given you at the time — acquiring this amazing technology for making sense of things through reading, and creating one’s own sense through writing. It gave you handles with which to grasp the world, to experience it and even to change it.

Data analysis is a bit like that.

13 December 2011

Finding connections to your major gift prospects in your data

Guest post by Erich Preisendorfer, Associate Director, Business Intelligence, Advancement Services, University of New Hampshire

(Thanks to Erich for this guest post, which touches on something a lot of prospect researchers are interested in: mapping relationships to prospects in their database. Actually, this work is more exciting than that, because it actually helps people find connections they may not have known about, via database queries and a simple scoring system. Is your Advancement Services department working on something like this? Why not ask them? — Kevin.)

Data miners often have an objective of exploring sets of data to determine meaningful patterns which can then be modeled for predictive patterning, hopefully to help meet their organization’s end goal(s).  However, there may be a time when the end behavior is not inherent in your database. Such a situation recently came up for my Advancement organization.

Our prospecting team recently started a program wrapped around peer recommendations: A prospect recommends new suspects to us based on the prospect’s interactions with the suspects. The question then became, what can we provide to the prospect to help get them thinking about potential suspects?

We currently do not have any type of data which would allow us to say, “Yes, this is what a relationship looks like,” outside of family relationships. We had to find a different way to identify potential acquaintances. I looked back at my own relationships to determine how I know the people I know. My friends and acquaintances largely come from some basic areas: school, work, places I’ve gone, etc.

Transforming my experience with relationships into what we have for useable data, I saw three key areas where relationships may exist: work history, education history, and extracurricular activities including one-time events. Fortunately, I was able to pinpoint our constituents’ time in each of the above areas to help isolate meaningful, shared experiences amongst constituents. Our work and extracurricular history includes to/from dates, and we have loads of educational history data that includes specific dates. Using this data, I am able to come up with potential relationships from a single prospect.

Prospect Profile (generated by entering a single prospect’s ID):

  • John Adams
  • Widget Factory, Employee 01/05/1971 – 06/16/1996
  • Student Activities: Football, Student Senate 09/1965-05/1966
  • Bachelor of Arts, Botany 1966

Potential Relationships (each item below is a separate query, using the Prospect Profile results):

  • Those employed by the Widget Factory who started before John ended, and ended after John began.
  • Those students who participated in Football and had a class year within +/-3 years of John.
  • Those students in Student Senate at the same time as John, similar to the Widget Factory example.
  • Those students who were in the same class year as John.
  • Those students who share John’s major.

Currently,since I have no way of proving the value of one point of contact over the other, each row returned in the potential relationships earns the constituent one point. Since my database stores historical records, I may get more than one row per constituent in any one category if they met more than one of John’s associated records – say they participated in Student Senate and played Football. This is great, because I want to give those particular constituents two points since they have more than one touch point in common with John.

I end up with a ranked list of constituents who share potential relationship contacts with my main prospect. The relationship lists provide our prospect researchers a starting point in putting together a solid list of high capacity constituents a single person may have some sort of relationship with, thus a greater insight into potential giving.

As of now, the report is in its infancy but looks to have high potential. As we grow the concept, there are multiple data points where further exploration could result in a higher level of functioning. As prospects use the lists to identify people they know, we can then deconstruct those choices to determine what is more likely a relationship. Should shared employment be ranked higher than shared class year? Should Football rank higher than Student Senate? I would guess yes, but I currently do not have supporting data to make that decision.

Another interesting concept, raised at the recent DRIVE 2011 conference, would be: “How are these two prospects potentially related by a third constituent?”  The result could mean the difference between two separate, forced conversations and one single conversation with three prospects shared over nostalgic conversations, drinks and, hopefully, money in the door!

Erich Preisendorfer is Associate Director, Business Intelligence, working in Advancement Services at the University of New Hampshire.

Older Posts »

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 479 other followers