Data mining is not a lab experiment — it is pragmatic. Do data miners play fast and loose with proper statistical technique? I don’t think so. No. We work in the real world, we use what works. Use the tools of your trade properly, by all means, but let’s not get bogged down in nit-picking about the validity of our methods.
A number of things set the practice of data mining apart from the statistics you read about in textbooks.
Data mining is exploratory, not experimental. Elements of statistics used in experiment design (including the null hypothesis) are not a big concern to me.
Data mining is concerned with correlation and prediction, not with correlation and causation. Everyone working with stats is concerned with correlation, but I’m not interested in the direction of dependence — whether Egg X brings about Chicken Y, or vice-versa. I know that having a single initial in the database for first or middle name is sometimes predictive of giving (even when other variables such as age are taken into account). I don’t need to know why; knowing the correlation exists is enough for me.
My bottom line is: observe the correlations, give them proper influence in the prediction via regression, and don’t worry about causation.
I also do not wring my hands over the sequence of events in time; for example, whether a prospect’s business phone number or whatever was acquired AFTER a gift was received. The worry would be understandable — how can that datum be “predictive” when it occurs after the fact? The concern is a direct result of a conventional understanding of the term “prediction,” which implies a certain order in time. The ‘predictor’ must precede the ‘predicted’, n’est-ce pas?
Not in my world, necessarily.
The conventional view has it that non-donors and donors are on opposite sides of a division in time. One day, a non-donor approaches the divide, and passes through it, magically transforming into a donor. The caterpillar changes into a butterfly. One might think therefore that only caterpillar-attributes are appropriate predictors for us to use. Butterfly-attributes, like our business phone that came to us after the gift (or because of it) would be inadmissible.
That’s not the way I see it. There is a divide, but it is not in time. The divide is between non-donors and donors, but no one crosses it. Why? Because there are already a lot of non-donors on the donor side: They are donors who haven’t given yet! To the unaided eye, they look like caterpillars, but their nature is pure butterfly.
Their butterfly-nature was created while they were a student, and nurtured during their time as an alum. It’s everything they think and feel about alma mater, their level of engagement, everything measurable and unmeasurable. Being a donor is only an outward expression of it. And as important as that expression is to us, it is not essential to the butterfly; we have to ask for it.
Donors share all sorts of characteristics, some of which we know about: reunion attendance, a tendency to provide contact information, and a dozen other things, some quite non-intuitive. When we find these same tendencies to a high degree in certain non-donors, we recognize them for what they are: Butterflies, and donors-to-be.
You won’t find me looking for a date-stamp on when we put that email address or that address update in the database, to see whether it preceded or resulted from giving. It just doesn’t matter to me.
Sure, there are exceptions. Major Giving, Planning Giving — those big events, probably unique in a donor’s lifetime, require some attention paid to the sequence of events in time, when we attempt to predict from whom the next gift is going to come.
Giving to the annual fund, however, is not so much an event as it is a state of being. I’m not saying the state doesn’t change, but it does persist.
Not everyone will see it that way. Some smart people will look askance on my equal reliance on data that is 10 years old or one day old (it’s all the same to me!). And the fact that the null hypothesis makes me yawn with disinterest. My variable distributions are non-normal. I violate assumptions left and right.
But … it works! And for me that’s where the discussion about validity ends.