CoolData blog

3 October 2017

Our “data-informed decision making” journey

Filed under: Analytics, Business Intelligence, Dalhousie University — Tags: , , — kevinmacdonell @ 6:34 pm

 

Building a business intelligence and analytics program can take years, and the move toward data-informed decision making is a cultural evolution that might never be complete. In my previous blog post, I talked about what advancement BI looks like in its ideal state (Analytics as an organizing principle). Today I want to talk about the messy reality.

 

Looking back at our own journey at Dalhousie University, I realize that we didn’t pursue the most direct and well-lit path, but we did learn a lot along the way. Eight years ago or so, we had very limited capability for supporting decisions with data. We still haven’t “arrived” — there is plenty more to do — but our progress is worth looking back on. It’s this progress I’ve been recounting for audiences across the country lately; it seems everyone is attempting to plan their own journey, or at least compare notes.

 

Here I’ll recount a few of the steps that got us to where we are today, starting with some of the obvious ingredients for a successful BI program — quality data, good software tools, and so on — and then talk about some of the perhaps less obvious influences that were essential for driving us forward.

 

First of all, DATA: Years ago, the general perception in our office was that our data was in bad shape. Our coverage rate for contact and employment information was believed to be low, and the accuracy of the data was frequently called into question, based largely on errors spotted in lists. But aside from anecdotes we really had no objective idea.

 

We developed reports to get a handle on coverage rates, as well as the ability to automatically archive these data points to be able to track progress and gaps over time. With our alumni constituency alone growing by several thousand individuals a year, we stepped away from imagining that we should aim to find every lost alum, and instead used a score to prioritize who to trace first. More importantly, the purely clerical “Alumni Records” team was reinvented as the “Constituent Data Integrity” team, tasked with going beyond data entry to developing and acting on data integrity audits, leading a large, cross-functional Data Integrity group to discuss integrity issues, and working much more closely with Prospect Research and Alumni Engagement to provide better support. We have also worked with frontline staff to encourage them to think of Advancement data as something they “own” and will benefit from directly, with a responsibility to feed information and intelligence to records and research staff.

 

We also made a concerted effort to establish written definitions for fundraising terminology and drafted a standardized set of counting rules, agreed to and approved by our leadership. Embedded in a single, core reporting view from which all reports and analyses are derived, these rules enforce a single version of the truth across all reports and dashboards. This work is not complete, but well advanced, and starting with Development data was a good idea.

 

A second enabler was the development of a three-year strategic plan for Advancement Services, something we’d never had. The plan charted a way forward for the team and became the foundation for much-needed investments in personnel. Without a doubt, the most important element in a successful BI program is people — hiring well has been the biggest driver of momentum for us — but resources won’t be made available in the absence of a plan and a roadmap for the future. Our plan did not necessarily lay out everything we intended to do with reporting, BI, and analytics — we didn’t know what the ideal team and technology would look like — but we were able to clearly articulate our gaps and what we needed to help us bridge those gaps.

 

Developing the plan required a commitment to change. This was a big step, because our team had gotten adept at concealing issues. For example, whenever leadership and deans needed an update on fundraising progress against campaign priorities, someone would take the raw data home each night and crunch the data manually in Excel. People were getting the information they wanted, so why change? But in fact we had zero agility, and reporting was never going to be able to grow beyond the basics. The fact that our AVP of Development felt forced to author his own reports should have been a wakeup call.

 

With qualified people making smart decisions, we have invested considerable time adopting Tableau as a reporting tool to bridge over a years-long period of uncertainty with regards to a centrally-supported BI tool. Over the years we evolved from having senior staff being served with pots of raw data and having to fend for themselves in Excel, to having our standard Development reporting automated in Tableau, with progress being made in reporting for other units. At the same time, we hired a BI Analyst to perform more ad hoc analytical and predictive modelling work. At this time, we are hiring two additional BI analysts, each with more specialized roles.

 

Greater demand for more sophisticated reports, dashboards, and analyses meant a greater need for complex transformations of our raw transactional data. We therefore put some emphasis on hiring people who knew SQL or could learn it. My colleague Darrell Rhodenizer puts it this way: Being able to use reporting tools such as Tableau Desktop or Cognos Reporting is one thing, but being able to directly speak the language of our database enables us to use all sorts of tricks to better shape our data for the reporting environment. Other departments that have not invested in the ability to look under the hood seem to be at a disadvantage.

 

As a result, our team has taken over from central IT the primary responsibility for modelling our data — that is, assembling our database tables into complex data structures to serve reporting and analysis. This works well for Advancement, which at most universities is far down the list of departments in terms of central IT support, and often has frequently-changing needs as priorities shift and campaigns roll through.

 

It’s gone beyond just learning SQL. Darrell, as our Associate Director, Advancement Systems & Reporting, has developed a new ETL tool which has accelerated our progress and promises to change the game for years to come. Our unit’s data is extracted nightly from the university’s centrally-managed data warehouse and multiple transformations are applied to it before it is re-stored to the same data warehouse. Under the full control of Advancement, the transformed data is available to all the same users using whatever tool they have. Data model changes are made with agility and with minimal disruption to business.

 

One final enabler: Outside Advancement, a new attitude to working cooperatively across departments and a new appreciation of data as an institutional asset has led to development of a data governance model and policies for opening up access to data. Before, if data was shared at all, it was done haphazardly and insecurely via Excel files. Today, we have a process for responsible use of data across the institution.

 

These elements of progress — technology, tools, people, skills — had a combined effect that was more than additive. We achieved an increasing momentum over the years, such that newer staff members struggle to imagine how bad things used to be in “them days.”

 

These and other factors were important enablers of change. Without some of them, we could not have made the improvements we did. However, they were not sufficient themselves to drive change. I suspect we are too often prone to falsely equate analytics competence with a piece of software, or an employee with a certain title, or a team, when really it’s none of those things. We would not have hired key people, and we would not have sought out and effectively deployed new tools, had there not been forces driving us in that direction.

 

Internally, we faced increased demand from Advancement leadership for information and insight. The closing of a comprehensive campaign was very revealing of our gaps in reporting and analysis — and the eventual ramp-up to another campaign spurs us to ensure that we are ready.

 

As well, for some years now a new culture of strategic planning has taken hold, with the development and adoption of an Advancement Balanced Scorecard. This plan for the whole department has had a focusing and integrative effect — everyone sees how functions fit together, and how their own job supports the mission. As great as that is for Development or Marketing or Alumni Engagement, it’s been essential for Operations. We now have a vision for what priorities we will need to support into the future, and a chunk of that support consists of data, information, reporting, dashboards, analyses, and other analytical products — not to mention the development of KPIs directly tied to measuring Advancement’s progress against the goals and objectives of the Balanced Scorecard itself. To date, high-level strategic planning has been the most significant “focusing” factor for our BI work.

 

You may have noticed that these and other internal drivers of change all come from the top, whereas the “enablers” tended to rely on initiative from lower down in the organization. Again, without both, not much would have happened.

 

But some drivers of a culture of analytics aren’t coming from the organization itself at all. We’re growing increasingly aware of external drivers. There are some new realities out there, and the organizations that position their data teams to address these new realities will have a better chance of succeeding.

 

First, alumni and donors have a different relationship with institutions than they once did, and their expectations are different. Alumni populations are growing, the number of donors is decreasing, and traditional engagement methods are less effective. Friend-raising and “one size fits all” approaches to engagement are increasingly seen as unsustainable wastes of resources. University leaders are questioning the very purpose and value of typical alumni relations activities.

 

According to current wisdom, engaged alumni are seeking meaningful interactions that make a difference, especially interactions with students in the form of advice, mentorship, or career development. If they have anything to do with the institution itself, it’s less about nostalgia for student life than it is being a part of the university’s role in society and community. Barbecues and pub nights hold little appeal for truly engaged alumni who believe in your brand of higher education (or your cause), and believe in the power of your students to change the world for the better. They want to be part of the mission.

 

Donors, too, are looking for meaningful engagement. Through their giving they want to accomplish things in the world. If they’re giving to your institution, it is because they feel your institution is uniquely qualified to carry out the change they’re seeking. Society’s needs, not the institution’s needs, are of greatest importance to this donor. They are not interested in “giving back.” Instead of giving TO institutions, they give THROUGH institutions.

 

This is partly borne out in what many of our organizations are seeing happening in our Annual Fund: for years now, donor numbers have been trending down, while average gift size has been going up. Donors are being more strategic with their giving, pooling resources and being more deliberate with their dollars.

 

These global shifts are not new, but I don’t think their real impact on the sector has yet been fully realized. Certainly for many of us, our strategies are not keeping pace. Analytics is going to be increasingly important for responding to these global shifts. A few examples follow …

 

In order to move from one-size-fits-all messages and programs, and evolve toward more targeted, relevant opportunities to engage, we need to understand how engaged each individual is right now. So we, along with many other institutions, have developed a means to measure alumni engagement. Every alumnus and alumna has a score that reflects where on the engagement spectrum they are, just as we know where on the donor spectrum they are. With those two pieces of information we can invest more time and money developing opportunities aimed at the upper niche of engaged individuals where it will have the most impact. (See: Why we measure engagement.) We need to engage with them on their own level, not ours, via relevant events and volunteerism. What information, programs, and services do they need, and which connect with their interests and talents?

 

In place of “one size fits all,” engaged alumni need more fulfilling experiences such as guest lecturing, student recruiting, and mentorship, career development and networking for students and new grads. Engagement measurement, then, is really a tool that enables alumni relations to better align itself with the mission of Advancement and the university.

 

Second, we aspire to understand our constituents not just based on their degree or by how much they’ve given, but through their interests and values — data we are just starting to bring together from a variety of sources in order to inform more intelligent segmentation of alumni and donors.

 

Third, we are doing what we can to measure impact of programming and events. We might report that we had 100 events that attracted 10,000 attendees, but why stop there? We should also be able to say we moved 2,000 people, say, to the next level of engagement, or that this or that event inspired 50 people to give. According to research conducted by the Education Advisory Board, a consulting firm, alumni relations does the poorest job of any office on campus in providing hard data on its real contribution to the university’s mission. Too many offices are stuck on tracking activities instead of results and outcomes.

 

Wonderful as these examples sound, and as far as we’ve come, we haven’t done everything right. There are areas where I wish we had made more progress, and things I discovered along the way that I wish I’d thought of earlier.

 

We’ve never had a long-range plan for the BI/analytics team. Yes, BI was a component of our three-year strategic plan and we have yearly operational plans, but there was no overarching vision of what the team would finally look like, along the lines of the three-tiered structure I outlined in my previous post. Our growth has been organic, addressing the gaps as we saw them from year to year. Perhaps that’s the right way to grow, especially as employees themselves grow and discover new strengths, but I think in a perfect world we might have had an idea of what the ideal future state would look like.

 

More fundamentally, a major all-at-once investment in rapid growth absolutely requires a plan. The way we did it, each new person who came on had to be somewhat self-sufficient in provisioning themselves with data to analyze, being responsible for transforming it and so on. That’s not the way it is now – as we evolve, positions are becoming more specialized.

 

Second, in hindsight I would have given more thought to how data-informed decisions are made. I mentioned earlier that the Balanced Scorecard exercise for Advancement has provided a main focus for BI, but I can see that it’s not enough. There has to be a framework for prioritizing and directing data-informed decision making below the level addressed by the Scorecard. (I wrote about this in my previous post.) I could have spent some time earlier on thinking about the structure and processes to make that happen.

 

A third thing I wish we had devoted more brainpower to was tackling self-serve list generation. Automation of the generation of lists of contact information for event invitations, solicitations, and so on is surprisingly challenging for a whole host of reasons, and this has prevented us from putting that ability into the hands of users. Had we cracked that nut early on in the journey, we would have freed resources for more interesting work. And more generally, “self-serve” is a cultural shift which takes a lot of time, training, and reinforcement. Even if we had developed a good tool for users to pull their own ready-to-use lists, it would have taken a long time to get people to use it (regardless of what they might say about the idea of it). If you’re considering a big push for self-serve, I would warn you that the payoff will come years, not months, from now.

 

Data-informed decision making in general is a cultural shift; it’s not just a series of technical problems to be solved. Nothing will happen without the technology, to be sure, but the technology enables — it does not drive. You can invest heavily in a BI team and software and still not achieve a state of making decisions informed by data.

 

When I talk about how poorly we did some years ago, that’s not intended as a critique of the people doing the work at that time. Everyone always did the best they could with what they had to work with. In the same way, when I speak with folks from other universities who are struggling with how to make progress in this area, it’s not a lack of will or even skill that I detect: It’s more a lack of clarity about the way forward. It’s rarely obvious how to pull the pieces and people together, but with progress comes momentum. I wish you luck on your own journey!

 

Advertisements

10 July 2017

Analytics as an organizing principle

Filed under: Analytics, Business Intelligence — kevinmacdonell @ 7:51 am

 

I’ve been thinking a lot lately about how an organization gets good at making decisions informed by data. Or, in other words, how to build business intelligence and analytics teams. This preoccupation started with a talk I gave a couple of months ago to a gathering of Advancement leaders from across Canada. I was asked to talk about analytics in general and how our department in particular got to where we are today. Since then, I’ve also spoken to folks from other universities on the same topic.

 

All this talking has been helpful for me in organizing my thoughts, and I’ve come to realize a number of things in retrospect, ways in which we might have evolved more quickly. One of these is a realization about what it means to make data and analytics an “organizing principle.”

 

For my talk in May I was asked to begin with an overview of analytics, so I’ll devote this post to that topic. In a future post, I will share what we learned on our journey.

 

Because analytics is an ever-evolving field, I avoid dictionary-like definitions for analytics. I find it more helpful to talk about what analytics “looks like” in terms of the types of work it consists of, the skill sets of the people doing the work, and the organizational structure of the team (if it’s a team).

 

In my mind, these concepts have resolved into a “triad of threes” … The work itself fits into three tiers, the ideal analytics practitioner is a “triple threat”, and the team is made up of three distinct teams or functions. (If what I’m presenting here is an oversimplification, at least it’s a structurally satisfying one.) What I’m talking about is fairly conventional — I’m not inventing anything — but it’s supported by my own experience.

 

First, the work itself. Analytics practice today works at three distinct levels: Descriptive, predictive, and prescriptive.

 

Descriptive analytics serves the business with information, specifically information about the past, which helps us understand current performance in relation to the past. It attempts to answer the questions, “How have we done?” and “How are we doing now?” This is the realm of reporting and a lot of what is referred to as Business Intelligence. Although this is a starting point for any analytics program, that doesn’t mean it’s easy or that it doesn’t have aspects that are advanced. KPI development, support for performance management, and ad hoc data analyses to answer specific business questions might be included in this tier.

 

Predictive analytics is about predicting the future. Not “the future” in general, but the behaviour of individuals. Predictive modelling is a set of techniques for ranking individuals by their likelihood to engage in some behaviour of interest (making a bequest, becoming a donor, attending an event, etc.). The business goal might be prospect identification, or focusing limited resources to save time or money.

 

And finally, prescriptive analytics provides advice on what action to take to influence a behaviour of interest. While predictive analytics gives us an idea who’s more likely to, say, sign up for a high-end credit card from a financial institution, prescriptive analytics suggests what types of interventions (targeting advertisements, for example) that would inspire a customer to actually do it.

 

Prescriptive analytics is the newest type of analytics and the most advanced — I don’t think it’s the same as A/B testing found in direct marketing — and still rare in the nonprofit and advancement sector. I’m using an example from the financial services industry for a reason: my team is just beginning to explore this type of work, and I’m not aware of anyone else doing it. (If you’re reading this in a year or two from now, the situation might be different.)

 

If your organization is doing a good job on reporting, business intelligence, predictive modelling, and maybe some forecasting as well — then you’re most likely doing very well in comparison with your peer institutions in terms of function.

 

So much for the work. What about the people?

 

There is a popular notion of what the ideal analytics practitioner looks like in terms of education, work experience, and skills. That person, who might be styled a Data Scientist, is what I have called a “triple threat” — he or she has extensive domain expertise (fundraising, engagement, and/or marketing), a background in computer science (adept at writing scripts in SQL, R, Python or other language to extract and transform data for analysis and advanced modelling), and mathematics (with an array of advanced statistical methods in his or her toolbox).

 

The problem is, such professionals are both rare and in high demand. You won’t find many of these folks working in our sector — at least not for very long. Their natural habitat is more likely to feature Big Data, not the “little data” we’ve got, and machine learning, rather than our old standbys such as multiple linear regression. I have already elaborated on these points in the blog post I link to above, Mind the data science gap. Suffice to say, we do not currently aspire to hire data scientists.

 

That doesn’t mean the ideal isn’t a useful model, however. When we hire, it makes sense to single out candidates with skills in one of the three areas, and who seem to have some aptitude for picking up skills in complementary areas. The strategy here is not to hire a data scientist, but to grow a reasonable facsimile of one. If you’ve got an employee who has some subject-matter knowledge, has a penchant for self-learning technical skills (on her own time perhaps), is curious about things and diving into the data, and who is a good communicator — such a person will add a lot of value in a BI role.

 

You can have the right people doing the right work, but they need to work in an organizational structure that promotes data-informed decision making. So, the third and final aspect: The organizational structure. There is no one perfect structure, but keeping with the theme of “three,” I think that a three-tier setup makes sense. In a large organization, each tier might be a team. In a smaller organization, each tier might be one person. (If one person is responsible for everything, this “structure” can be thought of as a way to organize or compartmentalize one’s own work.)

 

The first and foundational tier is the Technical Team, consisting of Advancement staff who might be responsible for building and/or maintaining a data warehouse dedicated to Advancement needs, building and maintaining materialized views and data models for use in BI software, developing complex reports and dashboards, integrating internal and external systems and platforms so that data from disparate systems can be merged or federated, and liaising with central IT.

 

This tier sounds very “IT”, but it’s important to recognize that it is distinct from the institution’s centralized IT department, which is responsible for maintaining hardware, servers, and the core database software itself, as well as managing the network and security.

 

So you’re not trying to replicate an IT shop, but you are building a team with specific technical skills. For any higher ed institution in which departments are not supported equally by central IT, having in-house expertise to integrate systems and develop data models tailored to business needs is definitely a key to success. Someone has to supply and support the data infrastructure, if central IT is too overtaxed to provide.

 

The next team is the Analysis Team, the people who build predictive models, define KPIs, do ad hoc analyses, and so on. This team (or person) benefits directly from the work of the Technical Team, freed from having to always extract and transform their own data. While analysis often implies exploration of the raw, unaggregated data, there’s a huge payoff in having a lot of the standard transformations (tedious and repetitive) pushed to the data warehouse level. Analysts add the most value when they’re interacting with clients to define business questions and present results, not struggling yet again with raw, transactional data that could be processed more efficiently and accurately with an ETL tool.

 

In my own workplace, the distinction between these two teams is something of an oversimplification, but it’s roughly analogous.

 

The third team is harder to define, as it may take various forms, depending on the organization. I’ve seen it referred to as the Executive Team, but a better name might be the Analytic Strategy Team or the BI Decision Team. We don’t have a name for it in my workplace, because our department doesn’t have such a group — yet. In fact, this is less a “team” than a solid business process. In any case, I’ve come to think it’s essential for data-informed decision making, and at the heart of analytics as an organizing principle.

 

The Analytic Strategy Team would be a cross-functional team made up of business sponsors (directors and managers of programs and units) and analysts from both the Technical and Analysis teams. In a data-driven organization, this team meets regularly to rank and prioritize analysis projects that have been submitted to the team as requests, called for by department leadership, or generated by the team members themselves. Projects rank higher for being supportive of current strategy, having a high perceived impact, having executive sponsorship, and so on.

 

Prioritizing is not the team’s most important role, however. As the hub of a framework for Advancement decision-making, the Analytic Strategy Team is there to ensure that when a business question is answered through analysis, there will be follow-through. The Team nails down the “why” and “how” of every analysis project: Zeroing in on the real business question that needs to be answered, drafting the general approach to answering the question, and (most critically) determining what actions will be taken if the answer is x, y, or z. Results and recommendations are channeled to a decision maker, who has agreed in advance to the definition of the business question.

 

Ideally, the department’s leadership team approves the ongoing analytics agenda. Having leadership sign off on the list of priorities fosters an integrated approach to making decisions as a whole department.

 

This team is important for focus — analysts do their best work if they can focus — but it’s even more important for driving decisions. Your team can be kept endlessly busy generating analyses, but it’s when it comes to the consequences of analyses that BI programs risk falling flat. Without the accountability implied by an agreed-on process of question, answer, and follow-through, analysts end up floating from one fishing expedition to another, generating “findings” that never get acted on, or fulfilling requests to support program managers’ foregone conclusions with “evidence.”

 

Of course we want to do some purely exploratory analyses without a defined outcome — but that’s not how data-informed decisions get made. As Thomas Davenport has written, “In the traditional analytics world, analysts may have lacked the ability to work closely with decision-makers to frame decisions appropriately, engage stakeholders, and structure decision processes and actions. Decision analysts in a business analytics environment need to move from back-office decision support to front-office decision consultants.”

 

Again I say, these observations about the “third team” are not drawn from my first-hand experience. These are things I’ve come to understand only recently. My naiveté is evident in “Score!” the book I co-authored with Peter Wylie and which was published just two years ago. What we wrote seemed to imply that all it takes is a supportive leader driving change from the top and engaged staff people with an aptitude for data work driving change from the bottom. They would somehow meet in the middle, and magic would happen. Well, we do need both of those forces, but nowadays I don’t see organizational change happening in the absence of a well-functioning business process that guides decision-making.

 

I’ve talked about the people, the types of work they do, and the structure of the team — all from a general perspective. In my next post, I will talk about the journey our own shop has taken towards building a BI/analytics program. Not surprisingly, the real-world program doesn’t arrive as neatly packaged as this general overview would suggest.

 

1 February 2016

Regular-season passing yardage and the NFL playoffs

Filed under: Analytics, Fun, John Sammis, Off on a tangent, Peter Wylie — Tags: , , , , — kevinmacdonell @ 7:37 pm

Guest post by Peter B. Wylie, with John Sammis

 

How much is regular-season passing yardage related to success in the NFL playoffs? (Click link to download .PDF: Passing yardage in the NFL.)

 

Peter was really interested in finding out how strong the relationship might be between an NFL team’s passing during the regular season and its performance in the playoffs. There’s been plenty of talk about this relationship, but he wanted to see for himself.

 

A bit of a departure for CoolData, but still all about data and analysis … hope you enjoy!

 

6 October 2014

Don’t worry, just do it

2014-10-03 09.45.37People trying to learn how to do predictive modelling on the job often need only one thing to get them to the next stage: Some reassurance that what they are doing is valid.

Peter Wylie and I are each just back home, having presented at the fall conference of the Illinois chapter of the Association of Professional Researchers for Advancement (APRA-IL), hosted at Loyola University Chicago. (See photos, below!) Following an entertaining and fascinating look at the current and future state of predictive analytics presented by Josh Birkholz of Bentz Whaley Flessner, Peter and I gave a live demo of working with real data in Data Desk, with the assistance of Rush University Medical Center. We also drew names to give away a few copies of our book, Score! Data-Driven Success for Your Advancement Team.

We were impressed by the variety and quality of questions from attendees, in particular those having to do with stumbling blocks and barriers to progress. It was nice to be able to reassure people that when it comes to predictive modelling, some things aren’t worth worrying about.

Messy data, for example. Some databases, particularly those maintained by non higher ed nonprofits, have data integrity issues such as duplicate records. It would be a shame, we said, if data analysis were pushed to the back burner just because of a lack of purity in the data. Yes, work on improving data integrity — but don’t assume that you cannot derive valuable insights right now from your messy data.

And then the practice of predictive modelling itself … Oh, there is so much advice out there on the net, some of it highly technical and involving a hundred different advanced techniques. Anyone trying to learn on their own can get stymied, endlessly questioning whether what they’re doing is okay.

For them, our advice was this: In our field, you create value by ranking constituents according to their likelihood to engage in a behaviour of interest (giving, usually), which guides the spending of scarce resources where they will do the most good. You can accomplish this without the use of complex algorithms or arcane math. In fact, simpler models are often better models.

The workhorse tool for this task is multiple linear regression. A very good stand-in for regression is building a simple score using the techniques outlined in Peter’s book, Data Mining for Fundraisers. Sticking to the basics will work very well. Fussing with technical issues or striving for a high degree of accuracy are distractions that the beginner need not be overly concerned with.

If your shop’s current practice is to pick prospects or other targets by throwing darts, then even the crudest model will be an improvement. In many situations, simply performing better than random will be enough to create value. The bottom line: Just do it. Worry about perfection some other day.

If the decisions are high-stakes, if the model will be relied on to guide the deployment of scarce resources, then insert another step in the process. Go ahead and build the model, but don’t use it. Allow enough time of “business as usual” to elapse. Then, gather fresh examples of people who converted to donors, agreed to a bequest, or made a large gift — whatever the behaviour is you’ve tried to predict — and check their scores:

  • If the chart shows these new stars clustered toward the high end of scores, wonderful. You can go ahead and start using the model.
  • If the result is mixed and sort of random-looking, then examine where it failed. Reexamine each predictor you used in the model. Is the historical data in the predictor correlated with the new behaviour? If it isn’t, then the correlation you observed while building the model may have been spurious and led you astray, and should be excluded. As well, think hard about whether the outcome variable in your model is properly defined: That is, are you targeting for the right behaviour? If you are trying to find good prospects for Planned Giving, for example, your outcome variable should focus on that, and not lifetime giving.

“Don’t worry, just do it” sounds like motivational advice, but it’s more than that. The fact is, there is only so much model validation you can do at the time you create the model. Sure, you can hold out a generous number of cases as a validation sample to test your scores with. But experience will show you that your scores will always pass the validation test just fine — and yet the model may still be worthless.

A holdout sample of data that is contemporaneous with that used to train the model is not the same as real results in the future. A better way to go might be to just use all your data to train the model (no holdout sample), which will result in a better model anyway, especially if you’re trying to predict something relatively uncommon like Planned Giving potential. Then, sit tight and observe how it does in production, or how it would have done in production if it had been deployed.

  1. Observe, learn, tweak, and repeat. Errors are hard to avoid, but they can be discovered.
  2. Trust the process, but verify the results. What you’re doing is probably fine. If it isn’t, you’ll get a chance to find out.
  3. Don’t sweat the small stuff. Make a difference now by sticking to basics and thinking of the big picture. You can continue to delve and explore technical refinements and new methods, if that’s where your interest and aptitude take you. Data analysis and predictive modelling are huge subjects — start where you are, where you can make a difference.

* A heartfelt thank you to APRA-IL and all who made our visit such a pleasure, especially Sabine Schuller (The Rotary Foundation), Katie Ingrao and Viviana Ramirez (Rush University Medical Center), Leigh Peterson Visaya (Loyola University Chicago), Beth Witherspoon (Elmhurst College), and Rodney P. Young, Jr. (DePaul University), who took the photos you see below. (See also: APRA IL Fall Conference Datapalooza.)

Click on any of these for a full-size image.

DSC_0017 DSC_0018 DSC_0026 DSC_0051 DSC_0054 DSC_0060 DSC_0066 DSC_0075 DSC_0076 DSC_0091

25 June 2014

How our sector is getting its butt kicked by just about everyone

Filed under: Analytics, Data, Off on a tangent, skeptics — kevinmacdonell @ 8:24 pm

There isn’t a lot to do at my wife’s family summer cottage when it rains, especially if I’ve forgotten to bring a book. I find myself scanning the shelves for something — anything — to read. On one such recent rainy weekend, I picked up a book my niece had left on a table. It was a heavy hardcover textbook, and it contained a mild surprise.

What I found was an introduction to such topics as liner and non-linear relationships, probability, scatterplots, best-fit lines, and correlation — concepts that I’ve come to have a deep interest in, mainly because I have profitably put them to work in the service of fundraising and alumni engagement.

Was this a college textbook? A manual for budding data scientists?

No, not at all. My niece is in Grade 9, and this was her mathematics textbook.

I don’t know if the Nova Scotia math curriculum is typical, nor am I qualified to judge the quality of a textbook. And my niece may not be thrilled about learning statistics. But some group of experts in math education apparently believe these concepts are well within the grasp of young Nova Scotian minds. Power to them.

What does this have to do with you? Yes, plastic young minds may grasp with relative ease what we oldsters struggle with (new languages, for example), but we have one distinct advantage. Where adolescents view these concepts as abstractions without a purpose, we may immediately see how we can use them to advance our causes, and our careers.

Yet, we all know otherwise intelligent people in our field whose eyes glaze over when they see a chart or hear anything that sounds like math — even Grade 9 math. Somehow, we must be failing to demonstrate the connection between analytics and success in fundraising and alumni engagement.

So in what fields is analytics really taking root? Well, every field. Including farmers’ fields.

Food production has been a focus of science and statistics for many decades. But today it’s not confined to experimental farms or the labs of agribusiness companies. Real, honest-to-goodness farmers are enthusiastic quants compared to most of us working in the nonprofit sector.

tweetJennifer Cunningham @jenlynham is Senior Director, Metrics and Marketing in the Office of Alumni Affairs at Cornell University. In a recent email to me and my “Score!” co-writer Peter Wylie she writes: “Just gave a talk today at the National Agricultural Alumni and Development Association (NAADA) conference. Went on a hayride this afternoon with the group here at Penn State. The farmers here are using data like you wouldn’t believe. Guys have been farming for 30+ years and they’re going on and on about the importance of measuring input vs output … it’s so interesting to hear these old-school guys go on about the importance of it in their worlds. And yet, some people in our industry, raising billions, still don’t get it?!?!”

It’s a fair observation.

I have a lot of time for people who are not enamoured with analytics due to an unfamiliarity with working with numbers. They require explanations and justifications for using analytical methods. That’s fine. I myself didn’t see math as having much to do with my working life until I entered my mid-thirties, and sometimes I still think the right story beats numbers.

But like our friend Jennifer, I feel less sympathy for ignorance when it’s a deliberate choice. There’s a line where lack of interest in data equates to wilful illiteracy. Someday soon, being on the wrong side of that line is going to disqualify a person from working for important causes.

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

est_age_vingt

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

Older Posts »

Create a free website or blog at WordPress.com.