CoolData blog

10 July 2017

Analytics as an organizing principle

Filed under: Analytics, Business Intelligence — kevinmacdonell @ 7:51 am

 

I’ve been thinking a lot lately about how an organization gets good at making decisions informed by data. Or, in other words, how to build business intelligence and analytics teams. This preoccupation started with a talk I gave a couple of months ago to a gathering of Advancement leaders from across Canada. I was asked to talk about analytics in general and how our department in particular got to where we are today. Since then, I’ve also spoken to folks from other universities on the same topic.

 

All this talking has been helpful for me in organizing my thoughts, and I’ve come to realize a number of things in retrospect, ways in which we might have evolved more quickly. One of these is a realization about what it means to make data and analytics an “organizing principle.”

 

For my talk in May I was asked to begin with an overview of analytics, so I’ll devote this post to that topic. In a future post, I will share what we learned on our journey.

 

Because analytics is an ever-evolving field, I avoid dictionary-like definitions for analytics. I find it more helpful to talk about what analytics “looks like” in terms of the types of work it consists of, the skill sets of the people doing the work, and the organizational structure of the team (if it’s a team).

 

In my mind, these concepts have resolved into a “triad of threes” … The work itself fits into three tiers, the ideal analytics practitioner is a “triple threat”, and the team is made up of three distinct teams or functions. (If what I’m presenting here is an oversimplification, at least it’s a structurally satisfying one.) What I’m talking about is fairly conventional — I’m not inventing anything — but it’s supported by my own experience.

 

First, the work itself. Analytics practice today works at three distinct levels: Descriptive, predictive, and prescriptive.

 

Descriptive analytics serves the business with information, specifically information about the past, which helps us understand current performance in relation to the past. It attempts to answer the questions, “How have we done?” and “How are we doing now?” This is the realm of reporting and a lot of what is referred to as Business Intelligence. Although this is a starting point for any analytics program, that doesn’t mean it’s easy or that it doesn’t have aspects that are advanced. KPI development, support for performance management, and ad hoc data analyses to answer specific business questions might be included in this tier.

 

Predictive analytics is about predicting the future. Not “the future” in general, but the behaviour of individuals. Predictive modelling is a set of techniques for ranking individuals by their likelihood to engage in some behaviour of interest (making a bequest, becoming a donor, attending an event, etc.). The business goal might be prospect identification, or focusing limited resources to save time or money.

 

And finally, prescriptive analytics provides advice on what action to take to influence a behaviour of interest. While predictive analytics gives us an idea who’s more likely to, say, sign up for a high-end credit card from a financial institution, prescriptive analytics suggests what types of interventions (targeting advertisements, for example) that would inspire a customer to actually do it.

 

Prescriptive analytics is the newest type of analytics and the most advanced — I don’t think it’s the same as A/B testing found in direct marketing — and still rare in the nonprofit and advancement sector. I’m using an example from the financial services industry for a reason: my team is just beginning to explore this type of work, and I’m not aware of anyone else doing it. (If you’re reading this in a year or two from now, the situation might be different.)

 

If your organization is doing a good job on reporting, business intelligence, predictive modelling, and maybe some forecasting as well — then you’re most likely doing very well in comparison with your peer institutions in terms of function.

 

So much for the work. What about the people?

 

There is a popular notion of what the ideal analytics practitioner looks like in terms of education, work experience, and skills. That person, who might be styled a Data Scientist, is what I have called a “triple threat” — he or she has extensive domain expertise (fundraising, engagement, and/or marketing), a background in computer science (adept at writing scripts in SQL, R, Python or other language to extract and transform data for analysis and advanced modelling), and mathematics (with an array of advanced statistical methods in his or her toolbox).

 

The problem is, such professionals are both rare and in high demand. You won’t find many of these folks working in our sector — at least not for very long. Their natural habitat is more likely to feature Big Data, not the “little data” we’ve got, and machine learning, rather than our old standbys such as multiple linear regression. I have already elaborated on these points in the blog post I link to above, Mind the data science gap. Suffice to say, we do not currently aspire to hire data scientists.

 

That doesn’t mean the ideal isn’t a useful model, however. When we hire, it makes sense to single out candidates with skills in one of the three areas, and who seem to have some aptitude for picking up skills in complementary areas. The strategy here is not to hire a data scientist, but to grow a reasonable facsimile of one. If you’ve got an employee who has some subject-matter knowledge, has a penchant for self-learning technical skills (on her own time perhaps), is curious about things and diving into the data, and who is a good communicator — such a person will add a lot of value in a BI role.

 

You can have the right people doing the right work, but they need to work in an organizational structure that promotes data-informed decision making. So, the third and final aspect: The organizational structure. There is no one perfect structure, but keeping with the theme of “three,” I think that a three-tier setup makes sense. In a large organization, each tier might be a team. In a smaller organization, each tier might be one person. (If one person is responsible for everything, this “structure” can be thought of as a way to organize or compartmentalize one’s own work.)

 

The first and foundational tier is the Technical Team, consisting of Advancement staff who might be responsible for building and/or maintaining a data warehouse dedicated to Advancement needs, building and maintaining materialized views and data models for use in BI software, developing complex reports and dashboards, integrating internal and external systems and platforms so that data from disparate systems can be merged or federated, and liaising with central IT.

 

This tier sounds very “IT”, but it’s important to recognize that it is distinct from the institution’s centralized IT department, which is responsible for maintaining hardware, servers, and the core database software itself, as well as managing the network and security.

 

So you’re not trying to replicate an IT shop, but you are building a team with specific technical skills. For any higher ed institution in which departments are not supported equally by central IT, having in-house expertise to integrate systems and develop data models tailored to business needs is definitely a key to success. Someone has to supply and support the data infrastructure, if central IT is too overtaxed to provide.

 

The next team is the Analysis Team, the people who build predictive models, define KPIs, do ad hoc analyses, and so on. This team (or person) benefits directly from the work of the Technical Team, freed from having to always extract and transform their own data. While analysis often implies exploration of the raw, unaggregated data, there’s a huge payoff in having a lot of the standard transformations (tedious and repetitive) pushed to the data warehouse level. Analysts add the most value when they’re interacting with clients to define business questions and present results, not struggling yet again with raw, transactional data that could be processed more efficiently and accurately with an ETL tool.

 

In my own workplace, the distinction between these two teams is something of an oversimplification, but it’s roughly analogous.

 

The third team is harder to define, as it may take various forms, depending on the organization. I’ve seen it referred to as the Executive Team, but a better name might be the Analytic Strategy Team or the BI Decision Team. We don’t have a name for it in my workplace, because our department doesn’t have such a group — yet. In fact, this is less a “team” than a solid business process. In any case, I’ve come to think it’s essential for data-informed decision making, and at the heart of analytics as an organizing principle.

 

The Analytic Strategy Team would be a cross-functional team made up of business sponsors (directors and managers of programs and units) and analysts from both the Technical and Analysis teams. In a data-driven organization, this team meets regularly to rank and prioritize analysis projects that have been submitted to the team as requests, called for by department leadership, or generated by the team members themselves. Projects rank higher for being supportive of current strategy, having a high perceived impact, having executive sponsorship, and so on.

 

Prioritizing is not the team’s most important role, however. As the hub of a framework for Advancement decision-making, the Analytic Strategy Team is there to ensure that when a business question is answered through analysis, there will be follow-through. The Team nails down the “why” and “how” of every analysis project: Zeroing in on the real business question that needs to be answered, drafting the general approach to answering the question, and (most critically) determining what actions will be taken if the answer is x, y, or z. Results and recommendations are channeled to a decision maker, who has agreed in advance to the definition of the business question.

 

Ideally, the department’s leadership team approves the ongoing analytics agenda. Having leadership sign off on the list of priorities fosters an integrated approach to making decisions as a whole department.

 

This team is important for focus — analysts do their best work if they can focus — but it’s even more important for driving decisions. Your team can be kept endlessly busy generating analyses, but it’s when it comes to the consequences of analyses that BI programs risk falling flat. Without the accountability implied by an agreed-on process of question, answer, and follow-through, analysts end up floating from one fishing expedition to another, generating “findings” that never get acted on, or fulfilling requests to support program managers’ foregone conclusions with “evidence.”

 

Of course we want to do some purely exploratory analyses without a defined outcome — but that’s not how data-informed decisions get made. As Thomas Davenport has written, “In the traditional analytics world, analysts may have lacked the ability to work closely with decision-makers to frame decisions appropriately, engage stakeholders, and structure decision processes and actions. Decision analysts in a business analytics environment need to move from back-office decision support to front-office decision consultants.”

 

Again I say, these observations about the “third team” are not drawn from my first-hand experience. These are things I’ve come to understand only recently. My naiveté is evident in “Score!” the book I co-authored with Peter Wylie and which was published just two years ago. What we wrote seemed to imply that all it takes is a supportive leader driving change from the top and engaged staff people with an aptitude for data work driving change from the bottom. They would somehow meet in the middle, and magic would happen. Well, we do need both of those forces, but nowadays I don’t see organizational change happening in the absence of a well-functioning business process that guides decision-making.

 

I’ve talked about the people, the types of work they do, and the structure of the team — all from a general perspective. In my next post, I will talk about the journey our own shop has taken towards building a BI/analytics program. Not surprisingly, the real-world program doesn’t arrive as neatly packaged as this general overview would suggest.

 

1 February 2016

Regular-season passing yardage and the NFL playoffs

Filed under: Analytics, Fun, John Sammis, Off on a tangent, Peter Wylie — Tags: , , , , — kevinmacdonell @ 7:37 pm

Guest post by Peter B. Wylie, with John Sammis

 

How much is regular-season passing yardage related to success in the NFL playoffs? (Click link to download .PDF: Passing yardage in the NFL.)

 

Peter was really interested in finding out how strong the relationship might be between an NFL team’s passing during the regular season and its performance in the playoffs. There’s been plenty of talk about this relationship, but he wanted to see for himself.

 

A bit of a departure for CoolData, but still all about data and analysis … hope you enjoy!

 

6 October 2014

Don’t worry, just do it

2014-10-03 09.45.37People trying to learn how to do predictive modelling on the job often need only one thing to get them to the next stage: Some reassurance that what they are doing is valid.

Peter Wylie and I are each just back home, having presented at the fall conference of the Illinois chapter of the Association of Professional Researchers for Advancement (APRA-IL), hosted at Loyola University Chicago. (See photos, below!) Following an entertaining and fascinating look at the current and future state of predictive analytics presented by Josh Birkholz of Bentz Whaley Flessner, Peter and I gave a live demo of working with real data in Data Desk, with the assistance of Rush University Medical Center. We also drew names to give away a few copies of our book, Score! Data-Driven Success for Your Advancement Team.

We were impressed by the variety and quality of questions from attendees, in particular those having to do with stumbling blocks and barriers to progress. It was nice to be able to reassure people that when it comes to predictive modelling, some things aren’t worth worrying about.

Messy data, for example. Some databases, particularly those maintained by non higher ed nonprofits, have data integrity issues such as duplicate records. It would be a shame, we said, if data analysis were pushed to the back burner just because of a lack of purity in the data. Yes, work on improving data integrity — but don’t assume that you cannot derive valuable insights right now from your messy data.

And then the practice of predictive modelling itself … Oh, there is so much advice out there on the net, some of it highly technical and involving a hundred different advanced techniques. Anyone trying to learn on their own can get stymied, endlessly questioning whether what they’re doing is okay.

For them, our advice was this: In our field, you create value by ranking constituents according to their likelihood to engage in a behaviour of interest (giving, usually), which guides the spending of scarce resources where they will do the most good. You can accomplish this without the use of complex algorithms or arcane math. In fact, simpler models are often better models.

The workhorse tool for this task is multiple linear regression. A very good stand-in for regression is building a simple score using the techniques outlined in Peter’s book, Data Mining for Fundraisers. Sticking to the basics will work very well. Fussing with technical issues or striving for a high degree of accuracy are distractions that the beginner need not be overly concerned with.

If your shop’s current practice is to pick prospects or other targets by throwing darts, then even the crudest model will be an improvement. In many situations, simply performing better than random will be enough to create value. The bottom line: Just do it. Worry about perfection some other day.

If the decisions are high-stakes, if the model will be relied on to guide the deployment of scarce resources, then insert another step in the process. Go ahead and build the model, but don’t use it. Allow enough time of “business as usual” to elapse. Then, gather fresh examples of people who converted to donors, agreed to a bequest, or made a large gift — whatever the behaviour is you’ve tried to predict — and check their scores:

  • If the chart shows these new stars clustered toward the high end of scores, wonderful. You can go ahead and start using the model.
  • If the result is mixed and sort of random-looking, then examine where it failed. Reexamine each predictor you used in the model. Is the historical data in the predictor correlated with the new behaviour? If it isn’t, then the correlation you observed while building the model may have been spurious and led you astray, and should be excluded. As well, think hard about whether the outcome variable in your model is properly defined: That is, are you targeting for the right behaviour? If you are trying to find good prospects for Planned Giving, for example, your outcome variable should focus on that, and not lifetime giving.

“Don’t worry, just do it” sounds like motivational advice, but it’s more than that. The fact is, there is only so much model validation you can do at the time you create the model. Sure, you can hold out a generous number of cases as a validation sample to test your scores with. But experience will show you that your scores will always pass the validation test just fine — and yet the model may still be worthless.

A holdout sample of data that is contemporaneous with that used to train the model is not the same as real results in the future. A better way to go might be to just use all your data to train the model (no holdout sample), which will result in a better model anyway, especially if you’re trying to predict something relatively uncommon like Planned Giving potential. Then, sit tight and observe how it does in production, or how it would have done in production if it had been deployed.

  1. Observe, learn, tweak, and repeat. Errors are hard to avoid, but they can be discovered.
  2. Trust the process, but verify the results. What you’re doing is probably fine. If it isn’t, you’ll get a chance to find out.
  3. Don’t sweat the small stuff. Make a difference now by sticking to basics and thinking of the big picture. You can continue to delve and explore technical refinements and new methods, if that’s where your interest and aptitude take you. Data analysis and predictive modelling are huge subjects — start where you are, where you can make a difference.

* A heartfelt thank you to APRA-IL and all who made our visit such a pleasure, especially Sabine Schuller (The Rotary Foundation), Katie Ingrao and Viviana Ramirez (Rush University Medical Center), Leigh Peterson Visaya (Loyola University Chicago), Beth Witherspoon (Elmhurst College), and Rodney P. Young, Jr. (DePaul University), who took the photos you see below. (See also: APRA IL Fall Conference Datapalooza.)

Click on any of these for a full-size image.

DSC_0017 DSC_0018 DSC_0026 DSC_0051 DSC_0054 DSC_0060 DSC_0066 DSC_0075 DSC_0076 DSC_0091

25 June 2014

How our sector is getting its butt kicked by just about everyone

Filed under: Analytics, Data, Off on a tangent, skeptics — kevinmacdonell @ 8:24 pm

There isn’t a lot to do at my wife’s family summer cottage when it rains, especially if I’ve forgotten to bring a book. I find myself scanning the shelves for something — anything — to read. On one such recent rainy weekend, I picked up a book my niece had left on a table. It was a heavy hardcover textbook, and it contained a mild surprise.

What I found was an introduction to such topics as liner and non-linear relationships, probability, scatterplots, best-fit lines, and correlation — concepts that I’ve come to have a deep interest in, mainly because I have profitably put them to work in the service of fundraising and alumni engagement.

Was this a college textbook? A manual for budding data scientists?

No, not at all. My niece is in Grade 9, and this was her mathematics textbook.

I don’t know if the Nova Scotia math curriculum is typical, nor am I qualified to judge the quality of a textbook. And my niece may not be thrilled about learning statistics. But some group of experts in math education apparently believe these concepts are well within the grasp of young Nova Scotian minds. Power to them.

What does this have to do with you? Yes, plastic young minds may grasp with relative ease what we oldsters struggle with (new languages, for example), but we have one distinct advantage. Where adolescents view these concepts as abstractions without a purpose, we may immediately see how we can use them to advance our causes, and our careers.

Yet, we all know otherwise intelligent people in our field whose eyes glaze over when they see a chart or hear anything that sounds like math — even Grade 9 math. Somehow, we must be failing to demonstrate the connection between analytics and success in fundraising and alumni engagement.

So in what fields is analytics really taking root? Well, every field. Including farmers’ fields.

Food production has been a focus of science and statistics for many decades. But today it’s not confined to experimental farms or the labs of agribusiness companies. Real, honest-to-goodness farmers are enthusiastic quants compared to most of us working in the nonprofit sector.

tweetJennifer Cunningham @jenlynham is Senior Director, Metrics and Marketing in the Office of Alumni Affairs at Cornell University. In a recent email to me and my “Score!” co-writer Peter Wylie she writes: “Just gave a talk today at the National Agricultural Alumni and Development Association (NAADA) conference. Went on a hayride this afternoon with the group here at Penn State. The farmers here are using data like you wouldn’t believe. Guys have been farming for 30+ years and they’re going on and on about the importance of measuring input vs output … it’s so interesting to hear these old-school guys go on about the importance of it in their worlds. And yet, some people in our industry, raising billions, still don’t get it?!?!”

It’s a fair observation.

I have a lot of time for people who are not enamoured with analytics due to an unfamiliarity with working with numbers. They require explanations and justifications for using analytical methods. That’s fine. I myself didn’t see math as having much to do with my working life until I entered my mid-thirties, and sometimes I still think the right story beats numbers.

But like our friend Jennifer, I feel less sympathy for ignorance when it’s a deliberate choice. There’s a line where lack of interest in data equates to wilful illiteracy. Someday soon, being on the wrong side of that line is going to disqualify a person from working for important causes.

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

est_age_vingt

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

20 February 2013

The ‘analytic’ investment

Filed under: Analytics, Data — Tags: , — kevinmacdonell @ 10:49 am

Everyone’s talking about predictive analytics, Big Data, yadda yadda. The good news is, many institutions and organizations in our sector are indeed making investments in analytics and inching towards becoming data-driven. I have to wonder, though, how much of current investment is based on hype, and how much is going to fall away when data is no longer a hot thing.

Becoming a data-driven organization is a journey, not a destination. Forward progress is not inevitable, and it is possible for an office, a department or an institution to slip backward on the path, even when it seems they’ve “arrived”. In order for analytics to mature from a cutting-edge “nice-to-have” into a regular part of operations, the enterprise needs to be aware of its returns to the bottom line.

In my view, current investments in analytics are often done for reasons that are well-intentioned but vague: It seems to be the right thing to do these days … we see others doing it, so we feel we need to as well … we have an agenda for innovation and this fits the bill … and so on. I’m glad to see the investment, but not every promising innovation gets to stick around. Demonstrating ability to generate revenue — either through savings or through identifying new sources of revenue — will carry the day in the long run.

As I write this, I hear the jangle of railway bells at the level crossing in the early-morning dark outside my hotel room on the city’s downtown waterfront. I’m in Seattle today to attend the DRIVE 2013 conference, hosted by the University of Washington. I’ll be speaking on this topic — the “analytic” investment — later today. I have to admit to having struggled with making the session relevant for this group. For one, they don’t need convincing that making the investment is worth it. And second, if they think that I and my employer have figured out how to calculate the return on investment for analytics programs, they may be in for a disappointment. We have not.

In fact, when it comes right down to it, I like to spend my day working on cool things, interesting problems that face our department, and not so much on stuff that sounds like accounting (“ROI”). I’m betting many of the attendees of my session feel the same way. So I’ll be asking them to stop thinking about how they can get their managers, directors and vice presidents to understand the language of data and analytics. They’ll be far more successful if they try to speak the language their bosses respond to: Return on investment.

I may be a little short on answers for you, but I do have some pretty good questions.

Older Posts »

Blog at WordPress.com.