CoolData blog

5 May 2015

Predictive modelling for the nonprofit organization

Filed under: Non-university settings, Why predictive modeling? — Tags: , , — kevinmacdonell @ 6:15 pm


Predictive modelling enables an organization to focus its limited resources of time and money where they will earn the best return, using data. People who work at nonprofits can probably relate to the “limited resources” part of that statement. But is it a given that predictive analytics is possible or necessary for any organization?


This week, I’m in Kingston, Ontario to speak at the conference of the Association of Fundraising Professionals, Southeastern Ontario Chapter (AFP SEO). As usual I will be talking about how fundraisers can use data. Given the range of organizations represented at this conference, I’m considering questions that a small nonprofit might need to answer before jumping in. They boil down to two concerns, “when” and “what”:


When is the tipping point at which it makes sense to employ predictive modelling? And how is that tipping point defined — dollars raised, number of donors, size of database, or what?


What kind of data do we need to collect in order to do predictive modelling? How much should we be willing to spend to gather that data? What type of model should we build?


These sound like fundamental questions, yet I’ve rarely had to consider them. In higher education advancement, the questions are answered already.


In the first case, most universities are already over the tipping point. Even relatively small institutions have more non-donor alumni than they can solicit all at once via mail and phone — it’s just too expensive and it takes too much time. Prioritization is always necessary. Not all universities are using predictive modelling, but all could certainly benefit from doing so.


Regarding the second question — what data to collect — alumni databases are typically rich in the types of data useful for gauging affinity and propensity to give. Knowing everyone’s age is a huge advantage, for example. Even if the Advancement office doesn’t have ages for everyone, at least they have class year, which is usually a good proxy for age. Universities don’t always do a great job of tracking key engagement factors (event attendance, volunteering, and so on), but I’ve been fortunate in being able to have enough of this already-existing data with which to build robust models.


The situation is different for nonprofits, including small organizations that may not have real databases. (That situation was the topic I wrote about in my previous post: When does a small nonprofit need a database?) One can’t simply assume that predictive modelling is worth the trouble, nor can one assume that the data is available or worth investing in.


Fortunately the first question isn’t hard to answer, and I’ve already hinted at it. The tipping point occurs when the size of your constituency is so large that you cannot afford to reach out to all of them simultaneously. Your constituency may consist of any combination of past donors, volunteers, clients of your services, ticket buyers and subscribers, event attendees — anyone who has a reason to be in your database due to some connection with your organization.


Here’s an extreme example from the non-alumni charity world. Last year’s ALS Ice-Bucket Challenge already seems like a long time ago (which is the way of any social media-driven frenzy), but the real challenge is now squarely on the shoulders of ALS charities. Their constituency has grown by millions of new donors, but there is no guarantee that this windfall will translate into an elevated level of donor support in the long run. It’s a massive donor-retention problem: Most new donors will not give again, but retaining even a fraction could lead to a sizeable echo of giving. It always makes sense to ask recent donors to give again, but I think it would be incredibly wasteful to attempt reaching out to 2.5 million one-time donors. The organization needs to reach out to the right donors. I have no special insight into what ALS charities are doing, but this scenario screams “predictive modelling” to me. (I’ve written about it here: Your nonprofit’s real ice bucket challenge.)


None of us can relate to the ice-bucket thing, because it’s almost unique, but smaller versions of this dilemma abound. Let’s say your theatre company has a database with 20,000 records in it — people who have purchased subscriptions over the years, plus single-ticket buyers, plus all your donors (current and long-lapsed). You plan to run a two-week phone campaign for donations, but there’s no way you can reach everyone with a phone number in that limited time. You need a way to rank your constituents by likelihood to give, in order to maximize your return.


(About five years ago, I built a model using data from a symphony orchestra’s database. Among other things, I found that certain combinations of concert series subscriptions were associated with higher levels of giving. So: you don’t need a university alumni database to do this work!)


It works with smaller numbers, too. Let’s say your college has 1,000 alumni living in Toronto, and you want to invite them all to an event. Your budget allows a mail piece to be sent to just 250, however. If you have a predictive model for likelihood to attend an event, you can send mail to only the best prospective attendees, and perhaps email the rest.


In a reverse scenario, if your charity has 500 donors and you’re fully capable of contacting and visiting them all as often as you like, then there’s no business need for predictive modelling. I would also note that modelling is harder to do with small data sets, entailing  problems such as overfitting. But that’s a technical issue; it’s enough to know that modelling is something to consider only at the point when resources won’t cover the need to engage with your whole constituency.


Now for the second question: What data do you need?


My first suggestion is that you look to the data you already have. Going back to the example of the symphony orchestra: The data I used actually came from two different systems — one for donor management, the other for ticketing and concert series subscriptions. The key was that donors and concert attendees were each identified with a unique ID that spanned both databases. This allowed me to discover that people who favoured the great Classical composers were better donors than those who liked the “pops” concerts — but that people who attended both were the best donors of all! If the orchestra intended to identify a pool of prospects for leadership gifts, this would be one piece of the ranking score that would help them do it.


So: Explore your existing data. And while you’re doing so, don’t assume that messy, old, or incomplete data is not useable. It’s usually worth a look.


What about collecting new data? This can be an expensive proposition, and I think it would be risky to gather data just so you can build predictive models. There is no guarantee that what you’re spending time and money to gather is actually correlated with giving or other behaviours. My suggestion would be to gather data that serves operational purposes as well as analytical ones. A good example might be event attendance. If your organization holds a lot of events, you’ll want to keep statistics on attendance and how effective each event was. If you can find ways to record which individuals were at the event (donors, volunteers, community members), you will get this information, plus you will get a valuable input for your models.


Surveying is another way organizations can collect useful data for analysis while also serving other purposes. It’s one way to find out how old donors are — a key piece of information. Just be sure that your surveys are not anonymous! In my experience, people are not turned off by non-anonymous surveys so long as you’re not asking deeply personal questions. Offering a chance to win a prize for completing the survey can help.


Data you might gather on individuals falls into two general categories: Behaviours and attributes.


Behaviours are any type of action people take that might indicate affinity with your organization. Giving is obviously the big one, but other good examples would be event attendance or volunteering, or any type of interaction with your organization.


Attributes are just characteristics that prospects happen to have. This includes gender, where a person lives, age, wealth information, and so on.


Of the two types, behavioural factors are always the more powerful. You can never go wrong by looking at what people actually do. As the saying has it, people give of their time, talent, and treasure. Focus on those interactions first.


People also give of something else that is increasingly valuable: Their attention. If your organization makes use of a broadcast email platform, find out if it tracks opens and click-throughs — not just at the aggregate level, but at the individual level. Some platforms even assign a score to each email address that indicates the level of engagement with your emails. If you run phone campaigns, keep track of who answers the call. The world is so full of distractions, these periods of time when you have someone’s full attention are themselves gifts — and they are directly associated with likelihood to give financially.


Attributes are trickier. They can lead you astray with correlations that look real, but aren’t. Age is always a good thing to have, but gender is only sometimes useful. And I would never purchase external data (census and demographic data, for example) for predictive modelling alone. Aggregate data at the ZIP or postal code level is useful for a lot of things, but is not the strongest candidate for a model input. The correlations with giving to your organization will be weak, especially in comparison with the behavioural data you have on individuals.


What type of model does it make sense for a nonprofit to try to build first? Any modelling project starts with a clear statement of the business need. Perhaps you want to identify which ticket buyers will convert to donors, or which long-lapsed donors are most likely to respond positively to a phone call, or who among your past clients is most likely to be interested in becoming a volunteer.


Whatever it is, the key thing is that you have plenty of historical examples of the behaviour you want to predict. You want to have a big, fat target to aim for. If you want to predict likelihood to attend an event and your database contains 30,000 addressable records, you can be quite successful if 1,000 of those records have some history of attending events — but your model will be a flop if you’ve only got 50. The reason is that you’re trying to identify the behaviours and characteristics that typify the “event attendee,” and then go looking in your “non-attendee” group for those people who share those behaviours and characteristics. The better they fit the profile, the more likely they are to respond to an event invitation. Fifty people is probably not enough to define what is “typical.”


So for your first foray into modelling, I would avoid trying to hit very small targets. Major giving and planned giving propensity tend to fall into that category. I know why people choose to start there — because it implies high return on investment — but you would be wise to resist.


At this point, someone who’s done some reading may start to obsess about which highly advanced technique to use. But if you’re new to hands-on work, I strongly suggest using a simple method that requires you to study each variable individually, in relation to the outcome you’re trying to model. The best beginning point is to get familiar with comparing groups (attendees vs. non-attendees, donors vs. non-donors, etc.) using means and medians, preferably with the aid of a stats software package. (Peter Wylie’s book, Data Mining for Fundraisers has this covered.) From there, learn a bit more about exploring associations and correlations between variables by looking at scatterplots and using Pearson Product-Moment Correlation. That will set you up well for learning to do multiple linear regression, if you choose to take it that far.


In sum: Predictive modeling isn’t for everyone, but you don’t need Big Data or a degree in statistics to get some benefit from it. Start small, and build from there.


3 May 2015

When does a small nonprofit need a database?

Filed under: Non-university settings — Tags: , , , — kevinmacdonell @ 9:25 am


I had a dream a few nights ago in which I was telling my wife about a job interview I’d just had. A small rural Anglican church serving British expats was hiring a head of fund development. (I have very specific dreams.) I lamented that I had forgotten to ask some key questions: “I don’t even know if they have a database!”


Not all of my dreams are that nerdy. The fact is, nonprofit organizations (as opposed to higher education institutions — my usual concern) are on my mind lately, as I am preparing a conference presentation for a group that includes the full range of organizations, many of them small. I’m presenting on predictive modelling, but like that rural church, some organizations may not yet have a proper database.


When should an organization acquire some kind of database system or CRM?


Any organization, no matter how small, has to track activity and record information for operational purposes. This may be especially true for nonprofits that need to report on the impact they’re having in the community. I usually think in terms of tracking donors, but nonprofits may have an additional need to track clients and services.


Alas, the go-to is often the everyday Excel spreadsheet. It’s clear way: Excel is flexible, adaptable, comprehensible, and ubiquitous. Plus, if you’re a whiz, there are advanced features to explore. But while an Excel file can store data, it is NOT a true database. For a growing nonprofit, managing everything in spreadsheets will become an expensive liability. You may have already achieved a painful awareness of that fact. For others who aren’t there yet, here are a few warning signs that spreadsheets have outstayed their welcome in your office.


One: Even on a wide screen at 80% zoom, you have to do a lot of horizontal scrolling.


At the start, a spreadsheet seems so straightforward … A column each for First Name, Last Name, and some more columns for address information, phone and email. Then one day, you have a client or donor who has a second address — a business or seasonal address — and she wants to get your newsletter at one or the other, depending on the time of year. Both addresses are valid, so you need to add more columns. Hmm, and of course you want to track who attended your last event. If someone attends an event in July and another in December, you’ll need a column to record each event. As each volunteer has a new activity, as each client has a new interaction with your services, you are adding more and more columns until the sideways scrolling gets ridiculous.


Two: Your spreadsheet has so many rows that it is unwieldy to find or update individual records.


It’s technically true that an Excel file can store a million rows, but you probably wouldn’t want to open such a file on your computer. Files with just a few thousand rows can cause trouble after they’ve been worked over long enough. You can always tell a spreadsheet that’s been used to store data in the place of a true database, especially if more than one person has been mucking around in it. It’s in rough shape. In particular, errors made while sorting rows can lead to lost data and headaches all round.


Three: Several spreadsheets are being maintained separately, tracking different types of data on the same people.


Given the issues with large files, you’ll soon be tempted to have a separate sheet for each type of data. If you have a number of people on staff, each might be independently tracking the information that is relevant to their own work: One person tracking donors, another volunteers, another event attendees. John Doe might exist as a row in one or more of these separate files. If each file contains contact information, every change of address becomes a big deal, as it has to be applied in multiple places. Inevitably, the files get out of sync. As bad or worse, insights are not being shared across data files. Reporting is cumbersome, and anything like predictive modelling is impossible.


If this sounds like your situation, know that you’re not alone. I would be lying if I said rampant Excel use doesn’t occur in the (often) better-resourced world of higher education. Of course it does. Sometimes people don’t have the kind of access to the data they need, sometimes the database doesn’t have a module tailored to their business requirements, and sometimes people can’t be bothered to care about institution-wide data integrity. Shadow databases are a real problem on large campuses, and some of those orphan data stores are in Excel.


There’s nothing magic about a true database. It’s all about structure. A database stores data in tables, behind the scenes, and each table is very similar to a spreadsheet: it’s rectangular, and made up of rows and columns. The difference is that a single table usually holds only one type of data: Addresses, for example, or gift transactions. A table may be very long, with millions of rows, but it is typically not very wide, because each table serves only one purpose. As a consequence, a database has to have many tables, one for each thing needing to be stored. A complex enterprise database could have thousands of tables.


This sounds like chaos, but every record in a table contains a reference to data in another table. Tables are joined together by these identifiers, or keys. This allows a query of the database to retrieve John Smith from the ‘names’ table, the proper address for John Smith from the ‘addresses’ table, a sum of gifts made by John Smith from the ‘gifts’ table, and a volunteer status code for John Smith from the ‘volunteers’ table. When John Smith moves and provides his new address, that information is added as a new record in the ‘addresses’ table, attached to his unique identifier (i.e., his ID number). The old address is not deleted, but is marked ‘invalid’, so that the information is retained but never appears on a list of valid addresses. One place, one change — and it’s done.


That’s a quick and rather inadequate description of what a database is and does. There’s more to a donor management system than just a table structure, and I could say plenty more about user interfaces, reporting, and data integrity and security. But there is no shortage of information and guidance online, so I will leave you with a few places to go for good advice. There are many software solutions out there for organizations big and small.


Robert L. Weiner is a nonprofit technology consultant, helping fundraisers choose software tools. Check out his Ten Common Mistakes in Selecting Donor Databases (And How to Avoid Them). As you proceed toward acquiring a system, here is a piece published by AFP that has good, basic advice about how to manage it: Overcoming Database Demons.


Andrew Urban is author of a great book that helps guide nonprofits large and small in making wise choices in software and systems investments: The Nonprofit Buyer: Strategies for Success from a Nonprofit Technology Sales Veteran.


That’s all from me on this … CoolData’s domain is not systems or databases, but the data itself. A good system is simply a basic requirement for analysis. In my next post, I will address another question a small nonprofit might have: At one point is a nonprofit “big” enough to be able to get benefit from doing predictive modelling?


19 April 2015

Planned Giving prospect identification, driven by data

Filed under: Planned Giving, Prospect identification — Tags: , , , — kevinmacdonell @ 6:28 pm

I’m looking forward to giving two presentations in my home city in connection with this week’s national conference of the Canadian Association of Gift Planners (CAGP). In theory I’ll be talking about data-driven prospect identification for Planned Giving … “in theory” because my primary aim isn’t to provide a how-to for analyzing data.


Rather, I will urge fundraisers to seek “data partners” in their organizations — finding that person who is closest to the data — and posing some good questions. There’s a lot of value hidden in your data, and you can’t realize this value alone: You’ve got to work closely with your colleagues in Advancement Services or with any researcher, analyst, or IT person who can get you what you need. And you have to be able to tell that person what you’re looking for.


For a shop that’s done little or no analysis of their data, I would start with these two basic questions:


  1. What is the average age of new expectancies, at the time they became known to your organization?
  2. What is the size of your general prospect pool?
The answer to the first question might suggest that more active prospect identification is required, of the type more often associated with major-gift fundraising. If the average age is 75 or older, I have to think that earlier identification of bequest intentions would benefit donor and cause alike, by allowing for a longer period for the conversation to mature and for the relationship to develop.


The answer to the second question gives an indication of the potential that exists in the database — but also the challenge of zeroing in on the few people (the top 100, say) in that universe of prospects who are most likely to accept a personal visit. Again, I’m talking about high-touch fundraising — more like Major Gifts, less like Annual Fund.


As Planned Giving professionals get comfortable asking questions of the data, the quality of the questions should improve. Ideally, the analyses will move from one-off projects to an ongoing process of gathering insights and then applying them. Along those lines, I will be giving attendees for both presentations a taste of how some simple targeting via data mining might work. As Peter Wylie and I wrote in our book, “Score!”, data mining for Planned Giving is primarily about improving the odds of success.


I said that I’m giving two presentations. Actually, it’s the same presentation, for two audiences. The first talk will be for a higher ed audience in advance of the conference, and the second will be for a more general nonprofit audience attending the conference proper. I expect the questions and conversations to differ significantly, and I also expect some of my assertions about Planned Giving fundraising to be challenged. Should be interesting!


Since you’ve read this far, you might be interested in downloading the handout I’ve prepared for these talks: Data-driven prospect ID for Planned Giving. There’s nothing like being there in person for the conversation we’re going to have, but this discussion paper does cover most of what I’ll be talking about.


If you’re visiting Halifax for the conference, welcome! I look forward to meeting with you.


1 April 2015

Mind the data science gap

Filed under: Training / Professional Development — Tags: , , — kevinmacdonell @ 8:10 pm


Being a forward-thinking lot, the data-obsessed among us are always pondering the best next step to take in professional development. There are more options every day, from a Data Science track on Coursera to new masters degree programs in predictive analytics. I hear a lot of talk about acquiring skills in R, machine learning, and advanced modelling techniques.


All to the good, in general. What university or large non-profit wouldn’t benefit from having a highly-trained, triple-threat chameleon with statistics, programming, and data analytics skills? I think it’s great that people are investing serious time and brain cells pursuing their passion for data analysis.


And yet, one has to wonder, are these advanced courses and tools helping drive bottom-line results across the sector? Are they helping people at nonprofits and university advancement offices do a better job of analyzing their data toward some useful end?


I have a few doubts. The institutions and causes that employ these enterprising learners may be fortunate to have them, but I would worry about retention. Wouldn’t these rock stars eventually feel constrained in the nonprofit or higher ed world? It’s a great place to apply one’s creativity, but aren’t the problems and applications one can address with data in our field relatively straightforward in comparison with other fields? (Tailoring medical treatment to an individual’s DNA, preventing terrorism or bank fraud, getting an American president elected?) And then there’s the pay.


Maybe I’m wrong to think so. Clearly there are talented people working in our sector who are here because they have found the perfect combination of passions. They want to be here.


Anyway — rock star retention is not my biggest concern.


I’m more concerned about the rest of us: people who want to make better use of data, but aren’t planning to learn way more than we need or are capable of. I’m concerned for a couple of reasons.


First, many of the professional development options available are pitched at a level too advanced to be practical for organizations who haven’t hired a full-time predictive analytics specialist. The majority of professionals working in the non-profit and higher-ed sectors are mainly interested in getting better at their jobs, whether that’s increasing dollars raised or boosting engagement among their communities. They don’t need to learn to code. They do need some basic, solid training options. I’m not sure these are easy to spot among all the competing offerings and (let’s be honest) the Big Data hype.


These people need support and appropriate training. There’s a place for scripting and machine learning, but let’s ensure we are already up to speed on means/medians, bar charts, basic scoring, correlation, and regression. Sexy? No. But useful, powerful, necessary. Relatively simple and manual techniques that are accessible to a range of advancement professionals — not just the highly technical — offer a high return on investment. It would be a shame if the majority were cowed into thinking that data analysis isn’t for them just because they don’t see what neural networks have to do with their day to day work.


My second concern is that some of the advanced tools of data science are deceptively easy to use. I read an article recently that stated that when it’s done really well, data science looks easy. That’s a problem. A machine-learning algorithm will spit out answers, but are they worth anything? (Maybe.) Does an analyst learn anything about their data by tweaking the knobs on a black box? (Probably not.) Is skipping over the inconvenience of manual data exploration detrimental to gaining valuable insights? (Yes!)


Don’t get me wrong — I think R, Python, and other tools are extremely useful for predictive modelling, although not for doing the modelling itself (not in my hands, at least). I use SQL and Python to automate the assembly of large data files to feed into Data Desk — it’s so nice to push a button and have the script merge together data from the database, from our phonathon database, from our broadcast email platform and other sources, as well as automatically create certain indicator variables, pivoting all kinds of categorical variables and handling missing data elegantly. Preparing this file using more manual methods would take days.


But this doesn’t automate exploration of the data, it doesn’t remove the need to be careful about preparing data to answer the business question, and it does absolutely nothing to help define that business question. Rather than let a script grind unsupervised through the data to spit out a result seconds later without any subject-matter expertise being applied, the real work of building a model is still done manually, in Data Desk, and right now I doubt there is a better way.


When it comes to professional development, then, all I can say is, “to each their own.” There is no one best route. The important thing is to ensure that motivated professionals are matched to training that is a good fit with their aptitudes and with the real needs of the organization.


18 January 2015

Why blog? Six reasons and six cautions

Filed under: CoolData, Off on a tangent, Training / Professional Development — Tags: , — kevinmacdonell @ 4:12 pm

THE two work-related but extracurricular activities I have found the most rewarding, personally and professionally, are giving conference presentations and writing for CoolData. I’ve already written about the benefits of presenting at conferences, explaining why the pain is totally worth it. Today: six reasons why you might want to try blogging, followed by six (optional) pieces of advice.

I’ve been blogging for just over five years, and I can say that the best way to start, and stay started, is to seek out motives that are selfish. The type of motivation I’m thinking of is intrinsic, such as personal satisfaction, as opposed to extrinsic, such as aiming to have a ton of followers and making money. It’s a good selfish.

Three early reasons for getting started with a blog are:

1. Documenting your work: One of my initial reasons for starting was to have a place to keep snippets of knowledge in some searchable place. Specific techniques for manipulating data in Excel, for example. I have found myself referring to older published pieces to remind me how I carried out an analysis or when I need a block of SQL. A blog has the added benefit of being shareable, but if your purpose is personal documentation, it doesn’t matter if you have any audience at all.

2. Developing your thoughts: Few activities bring focus and clarity to your thoughts like writing about them. Some of my ideas on more abstract issues have been shaped and developed this way. Sometimes the office is not the best environment for this sort of reflective work. A blog can be a space for clarity. Again — no need for an audience.

3. Solidifying your learning: One of the best ways to learn something new is by teaching it to someone else. I may have had an uncertain grasp of multiple linear regression, for example, when I launched CoolData, but the exercise of trying to explain data mining concepts and techniques was a great way to get it all straight in my head. If I were to go back today and re-read some of my early posts on the subject, which I rarely do, I would find things I probably would disagree with. But the likelihood of being wrong is not a good enough reason to avoid putting your thoughts out there. Being naive and wrong about things is a stage of learning.

Let’s say that, motivated by these or other reasons, you’ve published a few posts. Suddenly you’ve got something to share with the world. Data analysis lends itself perfectly to discussion via blogs. Not only analysts and data miners, but programmers, prospect researchers, business analysts, and just about anyone engaged in knowledge work can benefit personally while enriching their profession by sharing their thoughts with their peers online.

As you slowly begin to pick up readers, new reasons for blogging will emerge. Three more reasons for blogging are:

4. Making professional connections: As a direct result of writing the blog I have met all kinds of interesting people in the university advancement, non-profit, and data analysis worlds. Many I’ve met only virtually, others I’ve been fortunate to meet in person. It wasn’t very long after I started blogging that people would approach me at conferences to say they had seen one of my posts. Some of them learned a bit from me, or more likely I learned from them. A few have even found time to contribute a guest post.

5. Sharing knowledge: This is the obvious one, so no need to say much more. Many advancement professionals share online already, via various listservs and discussion forums. The fact this sharing goes on all the time makes me wonder why more people don’t try to make their contributions go even farther by taking the extra step of developing them into blog posts that can be referred to anytime.

6. Building toward larger projects: If you keep at it, slowly but surely you will build up a considerable body of work. Blogging can feed into conference presentations, discussion papers, published articles, even a book.

Let me return to the distinction I made earlier between intrinsic and extrinsic motivators — the internal, more personal rewards of blogging versus the external, often monetary, goals some people have. As it happens, the personal reasons for blogging are realistic, with a high probability of success, while the loftier goals are likely to lead to premature disillusionment. A new blog with no audience is a fragile thing; best not burden it with goals you cannot hope to realize in the first few years.

I consider CoolData a success, but not by any external measure. I simply don’t know how many followers a blog about data analysis for higher education advancement ought to have, and I don’t worry about it. I don’t have goals for number of visitors or subscribers, or even number of books sold. (Get your copy of “Score!” here. … OK — couldn’t resist.)

The blog does what I want it to do.

That’s mostly what I have to say, really. I have a few bits of advice, but my strongest advice is to ignore what everybody else thinks you should do, including me. Most expert opinion on posting frequency, optimum length for posts, ideal days and times for publishing, click-bait headlines, search engine optimization and the like is a lot of hot air.

If you’re still with me, here are a few cautions and pieces of advice, take it or leave it:

1. On covering your butt: Some employers take a dim view of their employees publishing blogs and discussing work-related issues on social media. You might want to clear your activity with your supervisor first. When I changed jobs, I disclosed that I intended to keep up my blog. I explained that connecting with counterparts at other universities was a big part of my professional development. There’s never been an issue. Be clear that you’re writing for a small readership of professionals who share your interests, an activity not unlike giving a conference presentation. Any enlightened organization should embrace someone who takes the initiative. (You could blog secretly and anonymously, but what’s the point?)

2. On “permission”: Beyond ensuring that you are not jeopardizing your day job, you do not require anyone’s permission. You don’t have to be an expert; you simply have to be interested in your subject and enthusiastic about sharing your new knowledge with others. Beginners have an advantage over experts when it comes to blogging; an expert will often struggle to relate to beginners, and assume too much about what they know or don’t know. So what if that post from two years ago embarrasses you now? You can always just delete it. If you’re reticent about speaking up, remember that blogging is not about claiming to be an authority on anything. It’s about exploring and sharing. It’s about promoting helpful ideas and approaches. You can’t prevent small minds from interpreting your activity as self-promotion, so just keep writing. In the long run, it’s the people who never take the risk of putting themselves out there who pay the higher price.

3. On writing: The interwebs ooze with advice for writers so I won’t add to the noise. I’ll just say that, although writing well can help, you don’t need to be an exceptional stylist. I read a lot of informative yet sub-par prose every day. The misspellings, mangled English, and infelicities that would be show-stoppers if I were reading a novel just aren’t that important when I’m reading for information that will help me do my job.

4. On email: In the early days of email I thought it rude not to respond. Today things are different: It’s just too easy to bombard people. Don’t get me wrong: I have received many interesting questions from readers (some of which have led to new posts, which I love), as well as great opportunities to connect, participate in projects, and so on. But just because you make yourself available for interaction doesn’t mean you need to answer every email. You can lay out the ground rules on an “About” page. If someone can’t be bothered to consider your guidelines for contact, then an exchange with that person is not going to be worth the trouble. On my “About this Blog” page I make it clear that I don’t review books or software, yet the emails offering me free stuff for review keep coming. I have no problem deleting those emails unanswered. … Then there are emails that I fully intend to respond to, but don’t get the chance. Before long they are buried in my inbox and forgotten. I do regret that a little, but I don’t beat myself up over it. (However — I do hereby apologize.)

5. On protecting your time: Regardless of how large or small your audience, eventually people will ask you to do things. Sometimes this can lead to interesting partnerships that advance the interests of both parties, but choose wisely and say no often. Be especially wary of quid pro quo arrangements that involve free stuff. I rarely read newspaper travel writing because I know so much of it is bought and paid for by tour companies, hotels, restaurants and so on, without disclosure. However, I’m less concerned about high-minded integrity than I am about taking on extra burdens. I’m a busy guy, and also a lazy guy who jealously guards his free time, so I’m careful about being obliged to anyone, either contractually or morally. Make sure your agenda is set exclusively by whatever has your full enthusiasm. You want your blogging to be a free activity, where no one but you calls the shots.

6. On the peanut gallery: Keeping up a positive conversation with people who are receptive to your message is productive. Trying to convince skeptics and critics who are never going to agree with you is not. When you’re pushing back, you’re not pushing forward. Keep writing for yourself and the people who want to hear what you’ve got to say, and ignore the rest. This has nothing to do with being nice or avoiding conflict. I don’t care if you’re nice. It’s about applying your energies in a direction where they are likely to produce results. Focus on being positive and enabling others with solutions and knowledge, not on indulging in opinions, fruitless debates, and pointless persiflage among the trolls in the comments section. I haven’t always followed my own advice, but I try.

Some say “know your audience.” Actually, it would be better if you know yourself. Readers respond to your personality and they can only get to know you if you are consistent. You can only be consistent if you are genuine. There are 7.125 billion people in the world and almost half of them have an internet connection (and access to Google Translate). Some of those will become your readers — be true to them by being true to yourself. There is no need to waste your time chasing the crowd.

Your overarching goals are not to convince or convert or market, but to 1) fuel your own growth, and 2) connect with like-minded people. Growth and connection: That’s more than enough payoff for me.

7 January 2015

New finds in old models


When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.


Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.


In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?


I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.


Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.


As it turned out, yes, I could.


Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.


My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.


For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.


In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.


As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.


“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.


First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):




(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)


But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix


I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.


This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.


This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.
« Newer PostsOlder Posts »

The Silver is the New Black Theme. Blog at


Get every new post delivered to your Inbox.

Join 1,213 other followers