CoolData blog

20 January 2018

Download my free handbook on predictive modeling

CoolDataBook

I like to keep things simple, so here’s the gist: I wrote another book. It’s free. Download it here.

 

The title says it all: “Cool Data: A how-to guide for predictive modeling for higher education advancement and nonprofits using multiple linear regression in Data Desk.” It’s a 190-page “cookbook,” a guide for folks who aren’t looking for deep understanding of stats, regression, or even predictive modelling, but just enough knowledge — a recipe, really — to mine the value in their organizations’ databases. It’s the kind of book I would have loved to have when I was starting out.

 

Take a look, dive in if it’s your thing, share it with someone who might be interested.

 

I remember talking about the idea as long ago as 2010. I wanted to write something not too technical, yet valid, practical, and actionable. On getting into it I quickly realized that I couldn’t talk about multiple linear regression without talking about how to clean, transform, and prepare data for modelling. And I couldn’t talk about data prep without talking about querying a database. As a result, a large portion of the book is an introduction to SQL; again, not a deep dive into writing queries, but just enough for a motivated person to learn how to build an analysis-ready file.

 

I don’t have to sell you on it, though, because it’s free — download it and do whatever you want with it. If it looks interesting to you, buy the Data Desk software and work through the book using the sample data and your own data. (Be sure to check back for updates to the book which may be necessary as the Data Desk software continues to evolve.) And, of course, consider getting training, preferably one-on-one.

 

Unlike this handbook, Data Desk and training are not free, but they’re investments that will pay themselves back countless times over — if you stick with it.

 

 

Advertisement

23 February 2017

Proceeds from sales of “Score!” to be donated to ACLU

Filed under: Book, Peter Wylie, Score! — Tags: , — kevinmacdonell @ 9:09 pm

 

Peter Wylie and I are pleased to tell you that all our current and future royalties from sales of the book “Score!: Data-Driven Success for Your Advancement Team” will be donated to the American Civil Liberties Union.

 

A good seller since it came out a couple of years ago, “Score!” is available for order online, in both print and e-book versions. (Click here to enter the CASE book store.)

 

Each year around late August, I am delighted to see that cheque in my mail from the Council for Advancement and Support of Education. (Peter of course gets his cheque at the same time, only he spells it “check”.) The next cheque (or check) we receive will be our third. We never know how sales have gone for the year until we get paid; since “Score!” continues to be featured prominently in the CASE catalogue, and people continue to click through this blog to the CASE bookstore every day, we have reason to think sales are still healthy.

 

A good opportunity, then, to extend our little book’s modest influence in a positive direction in these strange times. The ACLU works to defend and preserve the individual rights and liberties guaranteed by the Constitution and laws of the United States. As you may know, I live in Canada, but I recognize that holding the current administration to account is in everyone’s interest.

 

If you’ve been meaning to get a copy and just needed that extra reason to act, click here order online. Or, even better, consider making a contribution directly to the ACLU or whatever organization you feel is best positioned to undo the poison of xenophobia in your community, region, or country.

 

3 January 2016

CoolData (the book) beta testers needed

 

UPDATE (Jan 5): 16 people have responded to my call for volunteers, so I am going to close this off now. I have been in touch with each person who has emailed me, and I will be making a final selection within a few days. Thank you to everyone who considered taking a crack at it.

 

Interested in being a guinea pig for my new handbook on predictive modelling? I’m looking for someone (two or three people, max) to read and work through the draft of “CoolData” (the book), to help me make it better.

 

What’s it about? This long subtitle says it all: “A how-to guide for predictive modelling for higher education advancement and nonprofits using multiple linear regression in Data Desk.”

 

The ideal beta tester is someone who:

 

  • has read or heard about predictive modelling and understands what it’s for, but has never done it and is keen to learn. (Statistical concepts are introduced only when and if they are needed – no prior stats knowledge is required. I’m looking for beginners, but beginners who aren’t afraid of a challenge.);
  • tends to learn independently, particularly using books and manuals to work through examples, either in addition to training or completely on one’s own;
  • does not have an IT background but has some IT support at his or her organization, and would not be afraid to learn a little SQL in order to query a database him- or herself, and
  • has a copy of Data Desk, or intends to purchase Data Desk. (Available for PC or Mac).

 

It’s not terribly important that you work in the higher ed or nonprofit world — any type of data will do — but the book is strictly about multiple linear regression and the stats software Data Desk. The methods outlined in the book can be extended to any software package (multiple linear regression is the same everywhere), but because the prescribed steps refer specifically to Data Desk, I need someone to actually go through the motions in that specific package.

 

Think of a cookbook full of recipes, and how each must be tested in real kitchens before the book can go to press. Are all the needed ingredients listed? Has the method been clearly described? Are there steps that don’t make sense? I want to know where a reader is likely to get lost so that I can fix those sections. In other words, this is about more than just zapping typos.

 

I might be asking a lot. You or your organization will be expected to invest some money (for the software, sales of which I do not benefit from, by the way) and your time (in working through some 200 pages).

 

As a return on your investment, however, you should expect to learn how to build a predictive model. You will receive a printed copy of the current draft (electronic versions are not available yet), along with a sample data file to work through the exercises. You will also receive a free copy of the final published version, with an acknowledgement of your work.

 

One unusual aspect of the book is that a large chunk of it is devoted to learning how to extract data from a database (using SQL), as well as cleaning it and preparing the data for analysis. This is in recognition of the fact that data preparation accounts for the majority of time spent on any analysis project. It is not mandatory that you learn to write queries in SQL yourself, but simply knowing which aspects of data preparation can be dealt with at the database query level can speed your work considerably. I’ve tried to keep the sections about data extraction as non-technical as possible, and augmented with clear examples.

 

For a sense of the flavour of the book, I suggest you read these excerpts carefully: Exploring associations between variables and Testing associations between two categorical variables.

 

Contact me at kevin.macdonell@gmail.com and tell me why you’re interested in taking part.

 

 

 

3 September 2015

Testing associations between two categorical variables

 

Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.

 

(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)

 

In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.

 

These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).

 

As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.

 

In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)

 

Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)

 

image1

 

A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).

 

The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.

 

image2

 

If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.

 

Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.

 

We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.

 

Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.

 

image3

 

“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.

 

Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.

 

Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.

 

An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.

 

The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)

 

Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.

 

image4

 

The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.

 

By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.

 

We will encounter p-values again when we do regression. (2)

 

Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:

 

  1. Select the icons of the two variables.
  2. Create a contingency table from the Calc menu.
  3. In the table options menu, select “Chi-square value.”
  4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.

 

You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.

 

Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.

 

A more relevant example

 

Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.

 

Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.

 

Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:

 

‘Lifetime HC’ > 999

 

This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than $999, and ‘0’ for everyone else.

 

You can use whatever dollar value you like, but $1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of $1,000 or more.

 

Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”

 

Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”

 

image5

 

This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:

 

  • The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
  • The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.

 

People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.

 

(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)

 

Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.

 

In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.

 

End notes

 

1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.

 

2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!

 

 

26 August 2015

Exploring associations between variables

Filed under: Book, CoolData, Predictor variables — Tags: , , , — kevinmacdonell @ 6:57 pm

 

CoolData has been quiet over the summer, mainly because I’ve been busy writing another book. (Fine weather has a bit to do with it, too.) The book will be for nonprofit and higher education advancement professionals interested in learning how to use multiple regression to build predictive models. Over the next few months, I will adapt various bits from the work-in-progress as individual posts here on CoolData.

 

I’ll have more to say about the book later, so if you’re interested, I suggest subscribing via email (see the box to the right) to have the inside track on this project. (And if you aren’t familiar with the previous book, co-written with Peter Wylie, then have a look here.)

 

A generous chunk of the book is about the specifics of getting your hands dirty with cleaning up your messy data, transforming it to make it suitable for regression analysis, and exploring it for interesting patterns that can strengthen a predictive model.

 

When you import a data set into Data Desk or other statistics package, you are looking at more than just a jumble of variables. All these variables are in a relation; they are linked by patterns. Some variables are strongly associated with each other, others have weaker associations, and some are hardly related to each other at all.

 

What is meant by “association”? A classic example is a data set of children’s weights, heights, and ages. Older children tend to weigh more and be taller than younger children. Heavier children tend to be older and taller than younger children. We say that there is an association between age and weight, between height and weight, and between age and height.

 

Another example: Alumni who are bigger donors tend to attend more reunion events than alumni who give more modestly or don’t give at all. Or put the other way, alumni who attend more events tend to give more than alumni who attend fewer or no events. There is an association between giving and attending events.

 

This sounds simple enough — even obvious. The powerful consequence of these truths is that if we know the value of one variable, we can make a guess at the value of another, as long as the association is valid. So if we know a child’s weight and height, we can make a good guess of his or her age. If we know a child’s height, we can guess weight. If we know how many reunions an alumna has attended, we can make a guess about her level of giving. If we know how much she has given, we can guess whether she’s attended more or fewer reunions than other alumni.

 

We are guessing an unknown value (say, giving) based on a known value (number of events attended). But note that “giving” is not really an unknown. We’ve got everyone’s giving recorded in the database. What is really unknown is an alum’s or a donor’s potential for future giving. With predictive modeling, we are making a guess at what the value of a variable will be in the (near) future, based on the current value of other variables, and the type and degree of association they have had historically.

 

These guesses will be far from perfect. We aren’t going to be bang-on in our guesses of children’s ages based on weight and height, and we certainly aren’t going to be very accurate with our estimates of giving based on event attendance. Even trickier, projecting into the future — estimating potential — is going to be very approximate.

 

Still, our guesses will be informed guesses, as long as the associations we detect are real and not due to random variation in our data. Can we predict exactly how much each donor is going to give over this coming year? No, that would be putting too much confidence in our powers. But we can expect to have plenty of success in ranking our constituents in order by how likely they are to engage in whatever behaviour we are interested in, and that knowledge will be of great value to the business.

 

Looking for potentially useful associations is part of data exploration, which is best done in full hands-on mode! In a future post I will talk about specific techniques for exploring different types of variables.

 

19 August 2014

Score! … As pictured by you

Filed under: Book, Peter Wylie, Score! — Tags: , , — kevinmacdonell @ 7:25 pm
2014-07-18 06.41.41

Left to right: Elisa Shoenberger, Leigh Petersen Visaya, Rebekah O’Brien, and Alison Rane in Chicago. (Click for full size.)

During the long stretch of time that Peter Wylie and I were writing our book, Score! Data-Driven Success for Your Advancement Team, there were days when I thought that even if we managed to get the thing done, it might not be that great. There were just so many pieces that needed to fit together somehow … I guess we each didn’t want to let the other down, so we plugged on despite doubts and delays, and then, somehow, it got finished.

Whew, I thought. Washed my hands of that! I expected I would walk away from it,  move on to other projects, and be glad that I had my early mornings and weekends back.

That’s not what happened.

These few months later, my eye will still be caught now and then by the striking, colourful cover of the book sitting on my desk. It draws me to pick it up and flip through it — even re-read bits. I find myself thinking, “Hey, I like this.”

Of course, who cares, right? I am not the reader. However, whatever I might think about Score!, it has been even more gratifying for Peter and I to hear from folks who seem to like it as much as we do. How fun it has been to see that bright cover popping up in photos and on social media every once in a while.

I’ve collected a few of those photos and tweets here, along with some other images related to the book. Feel free to post your own “Score selfies” on Twitter using the hashtag #scorethebook. Or if you’re not into Twitter, send me a photo at kevin.macdonell@gmail.com.

Click here to order your copy of Score! from the CASE Bookstore.

2014-10-03 19.37.25

2014-09-23 11.56.58

2014-08-14 06.34.25

jen

Jennifer Cunningham, Senior Director, Metrics+Marketing for the Office of Alumni Affairs, Cornell University. @jenlynham

Click here to order your copy of Score! from the CASE Bookstore.

While we would like for you to buy it, we would LOVE for you to read it and put it to work in your shop. Your buying it earns us each enough money to buy a cup of coffee. Your READING it furthers the reach and impact of ideas and concepts that fascinate us and which we love to share.

Older Posts »

Create a free website or blog at WordPress.com.