CoolData blog

28 June 2015

Data mining in the archives

Filed under: Data, Predictor variables — Tags: , , , — kevinmacdonell @ 6:24 pm


When I was a student, I worked in a university archives to earn a little money. I spent many hours penciling consecutive index numbers onto acid-free paper folders, on the ultra-quiet top floor of the library. It was as dull a job as one can imagine.


Today’s post is not about that kind of archive. I’m talking about database archive views, also called snapshots. They’re useful for reporting and business intelligence, but they can also play a role in predictive modelling.


What is an archive view?


Think of a basic stat such as “number of living alumni”. This number changes constantly as new alumni join the fold and others are identified as deceased. A straightforward query will tell you how many living alumni there are, but that number will be out of date tomorrow. What if someone asks you how many living alumni you had a year ago? Then it’s necessary to take grad dates and death dates into account in order to generate an estimate. Or, you look the number up in previously-reported statistics.


A database archive view makes such reporting relatively easy by preserving the exact status of a record at regular points in time. The ideal archive is a materialized view in a data warehouse. On a given schedule (yearly, quarterly, or even monthly), an automated process adds fresh rows to an archive table that keeps getting longer and longer. You’re likely reliant on central IT services to set it up.


“Number of living alumni” is an important denominator for such key ratios as the percentage of alumni for whom you have contact information (mail, phone, email) and participation rates (the proportion of alumni who give). Every gift is entered as an individual transaction record with a specific date, which enables reporting on historical giving activity. This tends not to be true of contact information. Even though mailing addresses may be added one after another, without overwriting older addresses, the key piece of information is whether the address is coded ‘valid’ or ‘invalid’. This changes all the time, and your database may not preserve a history of those changes. Contact information records may have “To” and “From” dates associated with them, but your query will need to do a lot of relative-date calculations to determine if someone was both alive and had a valid address for any given point in time in the past.


An archive table obviates the need for this complex logic, and ours looks like the example below. There’s the unique ID of each individual, the archive date, and a series of binary indicator variables — ‘1’ for “yes, this data is present” or ‘0’ for “this data is absent”.




Here we see three individuals and how their data has changed over three months in 2015. This is sorted by ID rather than by the order in which the records were added to the archive, so that you can see the journey each person has taken in that time:


  • A00001 had no valid email in the database in February and March, but we obtained it in time for the April 1 snapshot.
  • A00002 had no contact information at all until just before March, when a phone append supplied us with a new number. The number proved to be invalid, however, and when we coded it as such in the database, the indicator reverted to zero.
  • A00003 appeared in our data in February and March, but that person was coded deceased in the database before April 1, and was excluded from the April snapshot.


That last bullet point is important. Once someone has died, continuing to include them as a row in the archive every month would be a waste of resources. In your reporting software, a simple count of records by archive date will give you the number of living alumni. A simple count of ‘Address Indicator’ will give you the number of alumni with valid addresses. Dividing the number of valid addresses by the number of living alumni (and multiplying by 100) will give you the percentage of living alumni that are addressable for that month. (Reporting software such as Tableau will make very quick work of all this.)


Because an archive view preserves changing statuses over time at the level of the individual constituent, it can be used for reporting trends along any slice you choose (age bracket, geography, school, etc.), and can play a role in staff activity/performance reporting and alumni engagement scoring.


But enough about archive views themselves. Let’s talk about using them for predictive modelling.


In the archive example above, you see a bunch of 0/1 indicator variables. Indicator variables are common in predictive modelling. For example, “Mailing address present” can have one of two states: Present or not present. It’s binary. A frequency breakdown of my data at this point in time looks like this (in Data Desk):




About 78% of living alumni have a valid address in the database today — the records with an address indicator of ‘1’. As you might expect, alumni with a good address are more likely to have given than alumni without, and they have much higher lifetime giving on average. In the models I build to predict likelihood to give (and give at higher levels), I almost always make use of this association between contact information and giving.


But what about using the archive view data instead? The ‘Address Indicator’ variable breakdown above shows me the current situation, but the archive view adds depth by going back in time. Our own archive has been taking monthly snapshots since December of last year — seven distinct points in time. Summing on “Address Indicator” for each ID shows that large numbers of alumni have either never had a valid address during that time (0 out of 7 months), or always did (7 out of 7). The rest had a change of status during the period, and therefore fall between 0 and 7:




A few hundred alumni (387) had a valid address in one out of seven months, 143 had a valid address in two out of seven — and so on. Our archive is still very young; only about 1% of alumni have a count that is not 0 or 7. A year from now, we can expect to see far more constituents populating the middle ground.


What is most interesting to me is an apparent relationship between “number of months with valid address” (x-axis) and average lifetime giving (y-axis), even with the relative scarcity of data:




My real question, of course, is whether these summed, continuous indicators really make much of a difference in a model over simply using the more familiar binary variables. The answer is “not yet — but someday.” As I noted earlier, only about 1% of living alumni have changed status in the past seven months, so even though this relationship seems linear, the numbers aren’t there to influence the strength of correlation. The Pearson correlation for “Address Indicator” (0/1) and “Lifetime Giving” is 0.186, which is identical to the Pearson correlation for “Address Count” (0 to 7) and “Lifetime Giving.” For all other variables except one, the archive counts have only very slightly higher correlations with Lifetime Giving than the straight indicator variables. (Email is slightly lower.)


It’s early days yet. All I can say is that there is potential. Have a look at this pair of regression analyses, both using Lifetime Giving (log-transformed) as the dependent variable. (Click on image for larger view.) In the window on the left, all the independent variables are the regular binary indicator variables. On the right, the independent variables are counts from our archive view. The difference in R-squared from one model to the other is very slight, but headed in the right direction: From 12.7% to 13.0%.




Looking back on my student days, I cannot deny that I enjoyed even the quiet, dull hours spent in the university archives. Fortunately, though, and due in no small part to cool data like this, my work since then has been a lot more interesting. Stay tuned for more from our archives.


20 April 2011

“Data” isn’t plural (anymore)

Filed under: Data — Tags: , , — kevinmacdonell @ 11:42 am

In an endnote to his 2008 book The Numerati, author Stephen Baker acknowledges that data is the plural of the singular noun datum, but says he’s decided to use data as singular in his book. “[I]n many fields, data is treated as a singular noun, just as the singular word sand stands for lots of individual bits of silica,” he writes.

On this blog, I also consistently use the word as a singular noun. When I say “this data is interesting,” instead of “these data are interesting,” it probably grates on the nerves of a few language-savvy readers. As a guy who likes both data and words, I make no apologies. To my ear, the old singular form “datum” sounds more quaint with every passing year, and not for arbitrary reasons or due to linguistic laziness. It’s a natural result, I believe, of our changing view of what data are. (Is.)

In his book The Stuff of Thought, Stephen Pinker observes that humans from early childhood display superb mental agility in drawing a conceptual distinction between an object (eg., pebble) and a substance (eg., gravel). We capture this distinction in our language as the difference between a “count noun” (a pebble, two pebbles), and a “mass noun” (gravel, some gravel, more gravel). The English language, more than other languages, draws sharp borders around the two types of nouns.

We seem to distinguish between things that are bounded (“delineated by a fixed shape,” made up of countable individuals) such as horses, and things that are unbounded (a multitude of individuals that are inseparable and uncountable, or a continuous mass like dust or goop) such as gravel or hair or glue. Pinker says, however, that our noun usage is reflective of “cognitive attitudes,” rather than physical properties. Therefore, we also see the distinction applied to “things” that aren’t made of matter at all: “opinions” is a count noun, while “advice” is a mass noun. As well: “stories” (count) vs. “fiction” (mass), “songs” (count) vs. “music” (mass) — and so on.

The ability to construe these differences begins as early as age three, according to experiments Pinker describes in his book. And these studies seem to suggest that our conceptual choices result from the way we have heard others describe things — the way others have used nouns. What’s also remarkable is that so many English speakers tend to agree on these usages, despite the fact they have had to be learned on a case-by-case basis, and may differ over time and even from dialect to dialect.

None of the examples Pinker chooses, drawn from everyday speech, are subject to differences of opinion like the “data is / data are” question is, however. So here’s what I think about “data”. Are you ready?

If data is a count noun, (that is, the plural of datum), then we have no choice but to say, “These data are interesting.” And if it’s a mass noun (that is, more of a substance than a collection of delineated individuals), then we are correct in saying, “This data is interesting.” The distinction is clear for nearly every thing we refer to in the run of a day. Yet, there remains some dispute or uncertainty about this word “data”, because the word is in a state of transition.

Half a century or more ago, William Strunk Jr. and E.B. White said in their pithy little tome, The Elements of Style, that “data” was most certainly plural and “best used with a plural verb.” They also noted, however, that the word “is slowly gaining acceptance as a singular.”

And so it is, now more than ever. Have a look at the chart below, produced by Google’s Books Ngram Viewer. The Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. (I wrote about it in the post Chart frequency of words and terms from books with Google, 17 Dec 2010.) You can click on the image to go directly to the chart in Google and play with the settings. This chart compares published instances of “data is” with “data are”, from 1950 to 2005:

Back in the 1950s, when Strunk & White’s famous guide was published in the edition we’re familiar with today, “data are” was firmly in the lead, as a percentage of published usages. Since 1985, however, the traditional usage has lost some ground, and “data is” may one day take the lead. (Try comparing the terms “datum” and “data point”.)

In E.B. White’s day, data were difficult and expensive to collect — every datum was recorded by hand, and calculations were done by hand, too. That’s all changed. Today we say we are deluged by a “flood” of data, or that we are sitting on a “mountain” of data. Both images suggest mass — a liquid that bathes us, an undifferentiated heap of ore that we sit on or mine into. And all this data is digested by computers like whales snarfing up plankton (another mass noun).

The more we hear data spoken of this way, the further our brains are rewired (Pinker-like) to think of “data” as a singular noun. Data these days is less akin to facts (a count noun) and more akin to knowledge (a mass noun). I agree with those who say that common usage doesn’t make something right. But language evolves, and any usage that goes against the grain of our conceptual grasp of reality is not going to survive.

22 February 2011

Data disasters, courtesy of Mordac

Filed under: Best practices, Data, Pitfalls — Tags: , , — kevinmacdonell @ 6:28 am

(Image used via Creative Commons license. Click image for source.)

Have you ever heard of Mordac, the Preventer of Information Services? He’s a character in the comic strip Dilbert. Mordac cares about technology and security, but he doesn’t give a rip about users and their need to do actual work. I’ve never worked with a Mordac, but judging from some of the stories I’ve been collecting over the past week, I’m sad to say that a lot of YOU have.

Last week I wrote about the ways data is viewed differently by data miners and the good people who work in Advancement Services. (Warning: This data is different.) These differences can lead to misunderstandings, and much worse. When data is treated as essentially a technology issue, instead of a core institutional asset — that is, when Mordac gets to decide what happens to data — disaster can follow.

Data disasters come in three forms:

  1. Data that would be useful to have is never captured or entered.
  2. Historical data is overwritten with newer data.
  3. Data is deliberately deleted, or left out of a database conversion.

Some of these disasters unfurl slowly over years, others happen with the click of a mouse. The result is the same: Key insights into engagement are lost forever. Every annual fund donor who will never be proactively identified as a Major Gift or Planned Giving prospect is a huge loss. Institutional memory is flushed down the toilet, harming not just data mining efforts but prospect research and other data-related work. The word disaster is not too strong to describe the financial impact of the accumulation of such losses.

It’s a hidden disaster, too. No one will ever be able to add up the cost of what’s been lost collectively by the schools and nonprofits who sent me their tales of horror.

Let’s start with the issue of data that never gets entered in the first place:

  • One university established in the 1970s did nothing to capture athletic team membership, distinguished alumni, awards, campus club membership, and other key information.  This institution also never had yearbooks. The same university had a penchant for deleting old addresses.
  • I heard a similar tale from another school, but at least they have yearbooks going back to the 1940s. No one captured athletics, awards, or club memberships in any of the databases over the years. “We are trying to catch up,” this contributor writes, “but I probably won’t live long enough to see it.”
  • At various places where one contributor worked, development contact reports were not entered for long periods because management didn’t enforce policy.  “Future staff (including the new president) were constantly embarrassed when meeting with the prospects because they didn’t know the contact history — including asks, campus tours, first contacts, meetings with the college president, etc.”
  • Non higher-ed organizations are especially prone to neglect gathering data. One person writes: “Understanding that historical information in the database can be used for analytics is a concept that isn’t usually introduced until the organization has significant fundraising capacity. Also, non-higher ed organizations are sometimes so starved for staff that no one has sufficient database experience to consistently maintain data, even when raising millions.”
  • Sometimes, the attitude is that old data is useless data and therefore not worth the bother. An analyst at a large, non-higher ed nonprofit is dealing with a significant number of records for which the first gift date has been incorrectly entered. For example, 1980 is entered accidentally as 1908, 2009 is entered accidentally as 9200, 2001 is accidentally entered as 1901, which causes havoc for any kind of analysis. Correcting the errors is liable to affect the general ledger in Finance, so the bias is to do nothing. The contributor writes: “I was talking with one of the old managers in gift processing (who is no longer here, actually), and his response was, ‘Who cares, it was a long time ago, and it’s over now’.”

Then there is historical data that is lost by being overwritten with current data:

  • A database manager for one university deleted all the old addresses and did a global replacement of these addresses with a four-letter code standing for “Moved, left no address.”
  • “I’ve been in shops where that’s done,” writes a contributor, regarding the overwriting of constituent records.  “What’s worse is when the record gets deleted and the ID gets re-used. I’ve seen some really weird mailings go out in those shops.”
  • One school failed to protect historical fundraiser assignments to a prospect, says a contributor, despite the fact such tracking “is really critical if you think about piecing together a person’s institutional history.” The contributor traced the overwriting to the need for the information to come out correctly on a report. In other words, the technology tail was wagging the strategy dog, a common problem. Even after a way to work around such reporting issues is put in place, this person writes, data overwriting still happens “out of habit.”
  • “I have also experienced this,” writes another person. “No records being deleted (that I know of) but prospect assignments, addresses, employers, etc. all that stuff would get overwritten with the current stuff, and the historical info just goes away. Makes data mining significantly more difficult!”
  • The worst one, I think, is the university where this happened: When any alum returned for post-grad work or a higher degree, their original degree and data was, inexplicably, overwritten.

And finally, the horror of the deliberate destruction of data:

  • Here’s a breathtaking example: “The database had serious size limitations, so in order to free up space, all records marked deceased were deleted. Some old gift data also was deleted, I think (all older than seven years at the time of deletion, since apparently someone thought we only needed seven years of info). The result was that we lost some irreplaceable information, particularly in trying to track down relatives of individuals who made estate gifts or endowments, since the original records (with attached names, contact reports, etc.) were purged.”
  • A terrible tale from another school: “Due to limited database space, this organization did indeed perform a huge purge of information sometime in the early ‘90s.  Luckily, there was someone cognizant enough regarding historical values to not allow any purging of gifts. However, many never-givers’ records and parents of alumni records were all purged from the system with all of the notes and information in those records.  In addition, many historical addresses were purged. Someone decided that only ONE former address for each record was necessary!”
  • One university established in the 1930s, and which is on its third database for registration and alumni records, has been committing a variety of data crimes that include overwriting and deleting. During database conversions the following things have occurred: Those who attended but did not receive a degree were never transferred to the new database; the records of alumni who held certain types of degrees were never transferred from paper to digital format and are not in the database; people who died or had no valid address were not uploaded to the newer databases; and, finally, someone overwrote all the female constituent’s middle names with maiden names when they married. The contributor who sent me this list adds, “One of the older databases that still had some of the old missing data crashed and IT was going to just forget about it. I begged for the tables from this database and built an Access database so we could at least query and lookup this old data, otherwise it would have disappeared.”
  • Sometimes a data crime directly impacts Major Giving: “I was the moves manager and we also had a prospect researcher.  Between the two of us, we added a huge amount of information to the database, including prospect interests, historical contacts, affiliations and financial data.  After she and I left, the operations staff managed the conversion to a different database and, in their infinite wisdom, deleted all the data we had added. The prospect researcher who was hired after us called me to cry on my shoulder and I heard that even the development officers wept over the loss!”
  • Again, even more so than schools with alumni, it is non-higher ed nonprofits that might be most at risk. A contributor writes: “With the advent of databases housed online by the vendor and charged per record, organizations are deleting all non-donor records or donors who have not given in years.”

The reasons behind data disasters range on a spectrum from well-meaning good intentions, through lack of awareness, to criminal wrongdoing: Data integrity concerns, privacy concerns, space and cost concerns, miscommunication, lack of consultation with data users, short-sightedness, ignorance, laziness, expediency, and finally, deliberate sabotage.

What unites almost all of these cases is the identity of the person doing the deed: Mordac. His character represents the attitude that data is just part of the technology. What he doesn’t recognize is that servers and hard drives are replaceable, but data is not. The cost of acquiring and maintaining the technology may indeed be high, but because data has uses we can’t foresee, and because it cannot be recovered once deleted, its value cannot even be calculated. Mordac is simply not qualified to make decisions about data on his own.

I would like to think that most data disasters are caused by the more benign impulses on the spectrum. In fact, a number of contributors noted that sometimes deleting data is necessary in order to maintain data integrity. Often, updates from NCOA (the National Change of Address databases in both Canada and the U.S.) introduce incorrect addresses into our databases. In cases such as this, it would seem prudent to delete the errors completely.

There are two problems with that, both mentioned by contributors. First, data entry staff might not always know the difference between a legitimate historical address and an address introduced in error. And second, if you delete an error, you are doomed to repeat it. As one contributor writes: “We keep old addresses that are wrong and code them ‘incorrect address match’ so that if that address comes up again from NCOA or another update service, we have a record of the first oops.”

Advancement Services and IT staff are the professionals who keep our data ship afloat, and I do not mean to suggest that these horror stories represent typical operating procedure. Moreover, many of the disasters I was told about happened ten years ago or more. Consciousness about the value of historical data is on the rise, and we can hope that cases such as this are becoming increasingly rare. “We have since come to our senses,” one person writes. “(We) have filled out those (missing data) gaps to the best of our ability and we no longer delete information.  We are just discovering how important data analysis can be.”

So therefore, I’m now interested in hearing GOOD things about your IT and IS professionals. What is done at your institution to both protect the integrity of the data for today’s use, and ensure that it remains intact for analysis for many years to come? Do you have any advice for working with others in your organization to foster a culture of respect for historical data? Comment below, or send your ideas to me at

Oh, but I’m still collecting stories about data disasters, though! Send your horrific tales to me in confidence at


P.S.: Data disaster stories, and other related tales, sent to me since this post went live:

“We just got a significant donation with the promise of more from someone who attended our school for a short period more than 50 years ago. We found him by picking up his name from an old yearbook, adding him to our database, profiling him for a wealth snapshot, and inviting him to an event. Yes, there is a reason to keep old data!” (22 Feb 2011)

“We have all the scenarios in our institution.  We are also using a campus-wide database, and different departments keep creating duplicates of the same records because of different data standards. We’ve talked with many other people who use this database and they all agree that this is one of the major problems of sharing our database. It’s a dream to have a database integrated with other units of the institution, but quite a different situation when it comes to the reality of it, which is everyone is working on completely different priorities and paradigms (Finance, Development, Student Services in particular).” (22 Feb 2011)

17 February 2011

Warning: This data is different

Filed under: Best practices, Pitfalls — Tags: , — kevinmacdonell @ 2:09 pm

This post is named for a conference keynote I will give this spring for senior managers working in advancement services. These people are no strangers to data, in fact their working lives revolve around data. But they don’t necessarily see data through the same lens as we do, and don’t value the same things as we do.

We’d better learn to understand their perspective, because “our” data is at their mercy. I’m talking about gift processing, alumni records, IT and computing services, database admins — people who can be our best friends, or bring data disasters down on our heads. Oh, and there are disasters!

To illustrate, I will draw a distinction between “everyday” data — processing gifts, updating constituent records, maintaining databases, and pulling reports — and predictive modeling data. The differences might seem a bit philosophical, but they’re real and have real consequences.

Everyday data is used for sense-making and explaining in the present, via reporting and descriptive statistics. (“What were Decembers pledge totals, and how do they compare with this time last year?”) Modeling data is not reporting or explaining anything — so it’s hard for some people to put a value on it. Everyday data might be doing important things such as hunting for causes (“Did pushing the income tax deadline email on Dec 31 boost giving?”). But not modeling data, which only seeks to uncover associations between things without trying to determine causation. (“Is there a connection between giving in December and being a significant donor?”). In short, everyday data work pays off in the short term; modeling data work pays off over a much longer period of time.

When everyday data is messy, it will probably be dismissed as invalid. When modeling data is messy, that’s considered normal, and there are techniques to address it. For everyday data, missing values are an issue; for modeling data, missing values can be useful, (i.e., predictive). When missing data is troublesome rather than predictive, we are free to make up data to fill the gaps, using imputation. This is a foreign concept to people who deal exclusively with everyday data.

In the everyday, we are picky: “Give me these records, but not those, and include this field, and this field, but not those fields.” For modeling, we say, “Give me everything — I want it all!” Everyday data seeks an answer, a single-point destination reached by one route. Modeling data has a destination too, but it gets there via a myriad of routes. Every potential predictor is a new route to explore. And we don’t know in advance what routes will get us there fastest; we have to drive them all.

And finally, one key difference in philosophy which can spell disaster for your institution: In the everyday, the most current data supersedes and replaces old data. Think of address information: Of what use is a mailing list to the Alumni Office if it’s full of addresses from the 1970s? Well, in modeling, that old data is just as valuable as fresh data. For example, I’ve found that the count of address updates an individual has is highly predictive of giving. The only way I can get that count is if I total up the number of deactivated records, and then add the current, active record. No historical records, no predictor.

Yes, some institutions routinely overwrite or just flat-out delete this stuff. But that is not the half of it. Because I wasn’t sure this sort of thing really happened, I started asking around. I received a raft of data disaster stories from all sorts of organizations, from non-profits to universities. I’ve collected so many tales of horror that I’m going to share them with you in a separate post next week.

(By the way, plug plug: the conference I’ll be speaking at is CASE’s Institute for Senior Advancement Services Professionals in Baltimore, April 27-29.)

23 November 2010

Mine your hidden call centre data

Filed under: Annual Giving, Best practices, Phonathon, Predictor variables — Tags: , , , — kevinmacdonell @ 1:33 pm

(Image used by Creative Commons license. Click image for source.)

One day earlier this fall on my walk to the office, I passed a young woman bundled up in toque and sweater and sitting in a foldup chair at an intersection. She was holding a clipboard, and as I passed by, I heard a click from somewhere in that bundle. She was counting. Whether it was cars going through the intersection, or whether I myself had just been counted, I don’t know. I could have asked her, but I knew what she was doing: She was collecting data.

All those clicks might be used by a local business or charity looking for the best location and time to solicit passersby, or they might find their way into GIS and statistical analysis and be used by city planners working on traffic control issues. Locating business franchises, planning for urban disasters, optimizing emergency services — all sort of activities are based on the mundane activity of counting.

This week I’m thinking about a different type of click: the reams of data that flow from Phonathon. If your institution is fortunate enough to have a call centre that is automated, you may be sitting on a wealth of data that never makes it into the institutional database. (Thus, “hidden”.) In our program, only a few things are loaded into Banner from CampusCall: Address updates, employment updates, any requested contact restrictions, and the pledges themselves. The rest stays behind in the Oracle database that runs the calling software, and I am only now pulling out some interesting bits which I intend to analyze over the coming days.

Call centre data is not just about the Phonathon program. Gathered from many thousands of interactions across a broad swath of your constituency, this data contains clues that will potentially inform any model you create, including giving by mail, Planned Giving, even major gifts.

What data am I looking for? So far, here’s what I have, plus some early intuition about what it might tell me.

  • ID: Naturally, I’ll need prospect IDs in order to match my data up, both across calling projects and in my predictive models themselves.
  • Last result code: The last call result coded by the student caller (No Pledge, Answering Machine, etc.) There are many codes, and I will discuss those in more detail in a future post.
  • Day call: People who tell us they’d rather be called back during the day (at the office, in many cases), are probably statistically different from the rest.
  • Number of attempts: This is the number of times a prospect was called before we finally reached them or gave up. I suspect high call attempt numbers are associated with lower affinity, although that remains to be seen. It’s probably more specific than that — high attempt numbers make a person a relatively poor phone prospect, but may cause them to score better in a mail-solicitation model.
  • Refusal reason: The reason given by the prospect for not making a pledge, usually chosen by the Phonathon employee from a drop-down menu of the most common responses. Refusal reasons are not always well-tracked, but they’re potentially useful for designing strategies aimed at overcoming objections. I’ve observed in the past that certain refusal reasons are actually predictive of giving (by mail).
  • Talk time: The length of the call, in seconds. People who pledge are on the phone longer, of course, but not every long call results in a pledge. I think of longer calls as a sign of successful rapport-building.

There are other important types of information: Address and employment updates, method of payment and so on — but these are all coded in our database and I do not need to extract them from the Phonathon software. My focus today is on hidden data — the data that gets left behind.

In CampusCall, prospects are loaded into giant batches called “projects”. Usually there is only one project per term, but multiple projects can be run at once. Each one is like its own separate database. I have data for ten projects conducted from 2007 to the present. I had to extract data for each project separately, and then match all the records up by ID in order to create one huge file of historical calling results. The total number of records in all the extracts was 189,927; when matched up they represent 56,216 unique IDs. Yum!

Where I go from here will be discussed in future posts. I need to put some thought into the variables I will create. For example, will I simply add up all call attempts into a single variable called “Attempts”, or should I calculate an average number of attempts, keeping in mind that some prospects were called in some projects and not others?

Until I figure these things out, here’s a final thought for today. If your job is handling data, then it’s also your job to understand where that data comes from and how it is gathered. Stick your nose into other peoples’ business from time to time, and get involved in the establishment of new processes that will pay off in good data down the road. Go to the person who runs your Phonathon and ask him or her if refusal reasons are being tracked. (In an automated system, it’s not that hard.) If you ARE the person running the Phonathon, make sure your callers are trained to select the right code for the right result.

In other words, it all starts with that young person bundled against the cold: The point at which data is collected. What happens here determines whether the data is good, usable, reliable. Without this person and her clicker, not much else is possible.

P.S. If you’re interested in analyzing your call centre data, have a read of this white paper by Peter Wylie: What Makes a Call Successful.

30 August 2010

New tricks for old data

Filed under: Alumni, Event attendance, Predictor variables — Tags: , , — kevinmacdonell @ 7:38 am

Your data might be old, but that doesn't mean it isn't predictive. (Used via Creative Commons licence. Click image for source.)

Do you sometimes exclude variables from your model because you feel the data is just too old to be useful? I wouldn’t be too quick. For some data at least, there’s no expiry date when it comes to predictive modeling.

I’ve heard of some modelers using old wealth-screening data and getting good results. It may be too out of date for the Major Gift people to use, but if there’s still a correlation with giving, it does your model no harm to make use of it. Just be sure the data you’re using is capacity-related and  not itself a predictive model, or existing donors will score high and high-likelihood non-donors will be submerged.

Old and out-of-date contact information works just fine. I always remind people that a phone number doesn’t have to be valid to be predictive. At some point, that alum provided that number (or email, or cell phone number), and the fact you have it at all is more important than whether you can still reach someone with it. For email in particular, probably a significant portion of your information is useless from a communication point of view, but will still be helpful in prediction. I am not talking about lost alumni — I never include those in my models. I mean alumni that we assume to be contactable.

I test whether the presence of, say, a business phone number is correlated with giving, but I also test the COUNT of business phone numbers. In order to do this, your database must retain the history of previous numbers. When an update is made, ideally a new record is created instead of overwriting the old one. This allows one to query for the NUMBER of update records — which I’ve often found correlates with giving. We know all those previous numbers were disconnected years ago, but their presence indicates a history of ongoing engagement.

What about event attendance? We might reasonably assume that an alum who attended a campus event a decade ago and never returned is far less likely to give than someone who visited just last year. Some schools have attendance data going back many years — is any of that still relevant? My answer is “probably.” I once got to study Homecoming attendance data for a university that had done a good job of recording it in the database going back 10 years. I already knew that Homecoming attendance was predictive of giving for this university, but I was surprised to discover that one-time, long-ago attendance was equally as predictive as recent attendance.

This may not hold true across institutions. You may want to break historical event attendance data into separate year categories to see whether they vary in their correlation with your predicted value. If they don’t differ significantly, then your most powerfully predictive variable will probably be a simple count of number of events attended: Repeat attendance, in my experience, is predictive in the extreme.

The enemy of relevance when it comes to data isn’t how old it is, but how incomplete or biased it is. For example, if you have good data on involvement on athletic teams up until 1985, and then nothing after that — that’s a problem. In that case, your variables for athletic involvement will be more informative about how old your alumni are than how engaged they might be. If you build a model that is restricted to older alumni, you’ll be fine, but if you include the entire database, ‘athletics’ will be highly correlated with ‘age’ and may add little or no predictive value.

What can you do? I see three strategies for addressing issues with older data, each one being appropriate in different situations.

  1. Leave it alone.
  2. Input the data.
  3. Impute the data.

We should leave the data alone when we know that the alum is the one who is primarily responsible for the presence or absence of data. All alumni who are not lost have more or less the same opportunity to provide us their contact and employment information. When all alumni have equal ability to influence some specific data point, the absence of data at that point is not a problem, but rather indicative of an attitude. No intervention is required. (A complicating factor is contact information that is purchased and appended — you should consult the ‘source’ code, if you have it, to distinguish between alum-provided data and data from other sources. The same might go for contact information that has been researched — but probably the number of records that have been researched is small in comparison with the entire database, and not a significant confounding factor.)

We should input the data when we know it to exist outside the database, when it is based on simple historical fact, and when it is practical to do so. An example would be student involvement in athletics. Unless capturing this information is someone’s explicit responsibility, the data will often be spotty; some class years will be covered and others won’t. Someone has this information — it’s probably in a file cabinet in the Athletics Office or, as a last resort, there’s always the yearbook — it just hasn’t been entered into the database. It’s a project, maybe a big project, but it might be quite doable with the help of a student or two. Would you go to this trouble just for the sake of predictive modeling? No. The risk is that the variable would still not be predictive. However, it isn’t hard to see that having the data will prove useful someday, perhaps for a special appeal directed at former student athletes. (An Alumni Records office with this sort of forward-looking, project-oriented mindset is a joy to work with!) If no data has ever been entered at all, and entering it retroactively isn’t a realistic goal, then why not just start tracking it from this point forward? It may be a long time before it becomes useful for data mining — you’ll be long-gone — but remember that our work rests on the shoulders of employees who have gone before, people who never heard of data mining but intuited that this or that category of data would someday prove useful.

And finally, we should impute the data when a variable is useful for prediction but excludes some sector of the alumni population through no fault of their own. Old wealth-screening data is a good example. If the data is ten years old, none of your recent graduates will have a wealth score. This might not be a problem if you’re building a Major Gift model or a Planned Giving model and excluding your younger alumni anyway, but for Annual Giving likelihood you should employ some of the techniques I discussed in previous posts on dealing with missing data. In those posts I was talking about survey data, but the idea is exactly the same. (See Surveys and missing data and More on surveys and missing data.) Essentially, the simpler techniques for imputing missing data involve substituting average values when we don’t know how an alum would have scored (or answered) had he or she had the opportunity to be included.

Search far and wide across the database for your predictors, but go deep as well — backwards in time!

Older Posts »

Create a free website or blog at