CoolData blog

6 July 2010

Pragmatism and validity: Don’t get your knickers in a knot

Filed under: skeptics — Tags: , — kevinmacdonell @ 12:05 pm

Worried about failing to reject the null hypothesis? Don't come crying to me. (Creative Commons license. Click image for source.)

Data mining is not a lab experiment — it is pragmatic. Do data miners play fast and loose with proper statistical technique? I don’t think so. No. We work in the real world, we use what works. Use the tools of your trade properly, by all means, but let’s not get bogged down in nit-picking about the validity of our methods.

A number of things set the practice of data mining apart from the statistics you read about in textbooks.

Data mining is exploratory, not experimental. Elements of statistics used in experiment design (including the null hypothesis) are not a big concern to me.

Data mining is concerned with correlation and prediction, not with correlation and causation. Everyone working with stats is concerned with correlation, but I’m not interested in the direction of dependence — whether Egg X brings about Chicken Y, or vice-versa. I know that having a single initial in the database for first or middle name is sometimes predictive of giving (even when other variables such as age are taken into account). I don’t need to know why; knowing the correlation exists is enough for me.

My bottom line is: observe the correlations, give them proper influence in the prediction via regression, and don’t worry about causation.

I also do not wring my hands over the sequence of events in time; for example, whether a prospect’s business phone number or whatever was acquired AFTER a gift was received. The worry would be understandable — how can that datum be “predictive” when it occurs after the fact? The concern is a direct result of a conventional understanding of the term “prediction,” which implies a certain order in time. The ‘predictor’ must precede the ‘predicted’, n’est-ce pas?

Not in my world, necessarily.

The conventional view has it that non-donors and donors are on opposite sides of a division in time. One day, a non-donor approaches the divide, and passes through it, magically transforming into a donor. The caterpillar changes into a butterfly. One might think therefore that only caterpillar-attributes are appropriate predictors for us to use. Butterfly-attributes, like our business phone that came to us after the gift (or because of it) would be inadmissible.

That’s not the way I see it. There is a divide, but it is not in time. The divide is between non-donors and donors, but no one crosses it. Why? Because there are already a lot of non-donors on the donor side: They are donors who haven’t given yet! To the unaided eye, they look like caterpillars, but their nature is pure butterfly.

Their butterfly-nature was created while they were a student, and nurtured during their time as an alum. It’s everything they think and feel about alma mater, their level of engagement, everything measurable and unmeasurable. Being a donor is only an outward expression of it. And as important as that expression is to us, it is not essential to the butterfly; we have to ask for it.

Donors share all sorts of characteristics, some of which we know about: reunion attendance, a tendency to provide contact information, and a dozen other things, some quite non-intuitive. When we find these same tendencies to a high degree in certain non-donors, we recognize them for what they are: Butterflies, and donors-to-be.

You won’t find me looking for a date-stamp on when we put that email address or that address update in the database, to see whether it preceded or resulted from giving. It just doesn’t matter to me.

Sure, there are exceptions. Major Giving, Planning Giving — those big events, probably unique in a donor’s lifetime, require some attention paid to the sequence of events in time, when we attempt to predict from whom the next gift is going to come.

Giving to the annual fund, however, is not so much an event as it is a state of being. I’m not saying the state doesn’t change, but it does persist.

Not everyone will see it that way. Some smart people will look askance on my equal reliance on data that is 10 years old or one day old (it’s all the same to me!). And the fact that the null hypothesis makes me yawn with disinterest. My variable distributions are non-normal. I violate assumptions left and right.

But … it works! And for me that’s where the discussion about validity ends.

Advertisements

5 Comments »

  1. I’ve been thinking a lot about this debate, and while I take your point about being pragmatic, I am concerned. If we totally ignore the question of causation, then we will never learn anything about our alums and donors. If something changes, like the rise of the Millennials (much as I hate that term), then our models may not work anymore. The pragmatic data miner will just redo the model and find the new correlations. But without understanding why things changed, we will always be throwing darts at the target. We won’t be able to actively seek out data that might be explanatory, but that we don’t typically collect.

    Now, all I’m saying is that causal research needs to be done. I don’t know that fundraising professionals are the right people for that job. I would love to see the academic community step into this field and do some rigorous, experimental research. The people who are actually out there raising money have better things to do.

    Comment by Annie Davis Weber — 6 July 2010 @ 1:19 pm

    • You take my point about pragmatism, and I take your point about being prepared to learn about what motivates donors! I would never say I’d be happy to remain ignorant about causal elements. My point is more directed at the doubters on the sidelines who question the validity of the techniques that data miners use, (to what end I don’t know).

      I do agree with you — there is some fascinating work to be done (and, in fact, BEING done), in the area of donor psychology. University campuses would seem to provide the ideal setting for fruitful collaboration on research in the area of philanthropic motivation!

      Comment by kevinmacdonell — 6 July 2010 @ 2:24 pm

  2. Glad to hear your response. Your post reminded me of this article from a while back:

    http://www.wired.com/science/discoveries/magazine/16-07/pb_intro

    It annoyed me at the time, but I never got around to writing out exactly why. I haven’t had a chance to do a data project in a while, so I’m in withdrawal. Makes me irritable 🙂

    Comment by Annie Davis Weber — 6 July 2010 @ 5:40 pm

    • I hadn’t seen this – looks interesting. I know what you mean about data projects … it can feel so much like play that sometimes we put it off because we would feel guilty “playing” while there is “real work” to do.

      Comment by kevinmacdonell — 7 July 2010 @ 7:12 am

  3. Very nice post. The way you present data mining is very interesting.

    I definitely agree about learning only the one algorithms that are really used in industry (of course this is not valid for researchers).

    The causation issue is very important also. I particularly like the example saying:

    A lot of men have a car.
    A lot of men like soccer.
    This does NOT mean that having a car is the causation of liking soccer (or vice-versa).

    Comment by Sandro — 12 July 2010 @ 11:42 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: