CoolData blog

20 January 2018

Download my free handbook on predictive modeling

CoolDataBook

I like to keep things simple, so here’s the gist: I wrote another book. It’s free. Download it here.

 

The title says it all: “Cool Data: A how-to guide for predictive modeling for higher education advancement and nonprofits using multiple linear regression in Data Desk.” It’s a 190-page “cookbook,” a guide for folks who aren’t looking for deep understanding of stats, regression, or even predictive modelling, but just enough knowledge — a recipe, really — to mine the value in their organizations’ databases. It’s the kind of book I would have loved to have when I was starting out.

 

Take a look, dive in if it’s your thing, share it with someone who might be interested.

 

I remember talking about the idea as long ago as 2010. I wanted to write something not too technical, yet valid, practical, and actionable. On getting into it I quickly realized that I couldn’t talk about multiple linear regression without talking about how to clean, transform, and prepare data for modelling. And I couldn’t talk about data prep without talking about querying a database. As a result, a large portion of the book is an introduction to SQL; again, not a deep dive into writing queries, but just enough for a motivated person to learn how to build an analysis-ready file.

 

I don’t have to sell you on it, though, because it’s free — download it and do whatever you want with it. If it looks interesting to you, buy the Data Desk software and work through the book using the sample data and your own data. (Be sure to check back for updates to the book which may be necessary as the Data Desk software continues to evolve.) And, of course, consider getting training, preferably one-on-one.

 

Unlike this handbook, Data Desk and training are not free, but they’re investments that will pay themselves back countless times over — if you stick with it.

 

 

Advertisement

3 January 2016

CoolData (the book) beta testers needed

 

UPDATE (Jan 5): 16 people have responded to my call for volunteers, so I am going to close this off now. I have been in touch with each person who has emailed me, and I will be making a final selection within a few days. Thank you to everyone who considered taking a crack at it.

 

Interested in being a guinea pig for my new handbook on predictive modelling? I’m looking for someone (two or three people, max) to read and work through the draft of “CoolData” (the book), to help me make it better.

 

What’s it about? This long subtitle says it all: “A how-to guide for predictive modelling for higher education advancement and nonprofits using multiple linear regression in Data Desk.”

 

The ideal beta tester is someone who:

 

  • has read or heard about predictive modelling and understands what it’s for, but has never done it and is keen to learn. (Statistical concepts are introduced only when and if they are needed – no prior stats knowledge is required. I’m looking for beginners, but beginners who aren’t afraid of a challenge.);
  • tends to learn independently, particularly using books and manuals to work through examples, either in addition to training or completely on one’s own;
  • does not have an IT background but has some IT support at his or her organization, and would not be afraid to learn a little SQL in order to query a database him- or herself, and
  • has a copy of Data Desk, or intends to purchase Data Desk. (Available for PC or Mac).

 

It’s not terribly important that you work in the higher ed or nonprofit world — any type of data will do — but the book is strictly about multiple linear regression and the stats software Data Desk. The methods outlined in the book can be extended to any software package (multiple linear regression is the same everywhere), but because the prescribed steps refer specifically to Data Desk, I need someone to actually go through the motions in that specific package.

 

Think of a cookbook full of recipes, and how each must be tested in real kitchens before the book can go to press. Are all the needed ingredients listed? Has the method been clearly described? Are there steps that don’t make sense? I want to know where a reader is likely to get lost so that I can fix those sections. In other words, this is about more than just zapping typos.

 

I might be asking a lot. You or your organization will be expected to invest some money (for the software, sales of which I do not benefit from, by the way) and your time (in working through some 200 pages).

 

As a return on your investment, however, you should expect to learn how to build a predictive model. You will receive a printed copy of the current draft (electronic versions are not available yet), along with a sample data file to work through the exercises. You will also receive a free copy of the final published version, with an acknowledgement of your work.

 

One unusual aspect of the book is that a large chunk of it is devoted to learning how to extract data from a database (using SQL), as well as cleaning it and preparing the data for analysis. This is in recognition of the fact that data preparation accounts for the majority of time spent on any analysis project. It is not mandatory that you learn to write queries in SQL yourself, but simply knowing which aspects of data preparation can be dealt with at the database query level can speed your work considerably. I’ve tried to keep the sections about data extraction as non-technical as possible, and augmented with clear examples.

 

For a sense of the flavour of the book, I suggest you read these excerpts carefully: Exploring associations between variables and Testing associations between two categorical variables.

 

Contact me at kevin.macdonell@gmail.com and tell me why you’re interested in taking part.

 

 

 

26 August 2015

Exploring associations between variables

Filed under: Book, CoolData, Predictor variables — Tags: , , , — kevinmacdonell @ 6:57 pm

 

CoolData has been quiet over the summer, mainly because I’ve been busy writing another book. (Fine weather has a bit to do with it, too.) The book will be for nonprofit and higher education advancement professionals interested in learning how to use multiple regression to build predictive models. Over the next few months, I will adapt various bits from the work-in-progress as individual posts here on CoolData.

 

I’ll have more to say about the book later, so if you’re interested, I suggest subscribing via email (see the box to the right) to have the inside track on this project. (And if you aren’t familiar with the previous book, co-written with Peter Wylie, then have a look here.)

 

A generous chunk of the book is about the specifics of getting your hands dirty with cleaning up your messy data, transforming it to make it suitable for regression analysis, and exploring it for interesting patterns that can strengthen a predictive model.

 

When you import a data set into Data Desk or other statistics package, you are looking at more than just a jumble of variables. All these variables are in a relation; they are linked by patterns. Some variables are strongly associated with each other, others have weaker associations, and some are hardly related to each other at all.

 

What is meant by “association”? A classic example is a data set of children’s weights, heights, and ages. Older children tend to weigh more and be taller than younger children. Heavier children tend to be older and taller than younger children. We say that there is an association between age and weight, between height and weight, and between age and height.

 

Another example: Alumni who are bigger donors tend to attend more reunion events than alumni who give more modestly or don’t give at all. Or put the other way, alumni who attend more events tend to give more than alumni who attend fewer or no events. There is an association between giving and attending events.

 

This sounds simple enough — even obvious. The powerful consequence of these truths is that if we know the value of one variable, we can make a guess at the value of another, as long as the association is valid. So if we know a child’s weight and height, we can make a good guess of his or her age. If we know a child’s height, we can guess weight. If we know how many reunions an alumna has attended, we can make a guess about her level of giving. If we know how much she has given, we can guess whether she’s attended more or fewer reunions than other alumni.

 

We are guessing an unknown value (say, giving) based on a known value (number of events attended). But note that “giving” is not really an unknown. We’ve got everyone’s giving recorded in the database. What is really unknown is an alum’s or a donor’s potential for future giving. With predictive modeling, we are making a guess at what the value of a variable will be in the (near) future, based on the current value of other variables, and the type and degree of association they have had historically.

 

These guesses will be far from perfect. We aren’t going to be bang-on in our guesses of children’s ages based on weight and height, and we certainly aren’t going to be very accurate with our estimates of giving based on event attendance. Even trickier, projecting into the future — estimating potential — is going to be very approximate.

 

Still, our guesses will be informed guesses, as long as the associations we detect are real and not due to random variation in our data. Can we predict exactly how much each donor is going to give over this coming year? No, that would be putting too much confidence in our powers. But we can expect to have plenty of success in ranking our constituents in order by how likely they are to engage in whatever behaviour we are interested in, and that knowledge will be of great value to the business.

 

Looking for potentially useful associations is part of data exploration, which is best done in full hands-on mode! In a future post I will talk about specific techniques for exploring different types of variables.

 

18 January 2015

Why blog? Six reasons and six cautions

Filed under: CoolData, Off on a tangent, Training / Professional Development — Tags: , — kevinmacdonell @ 4:12 pm

THE two work-related but extracurricular activities I have found the most rewarding, personally and professionally, are giving conference presentations and writing for CoolData. I’ve already written about the benefits of presenting at conferences, explaining why the pain is totally worth it. Today: six reasons why you might want to try blogging, followed by six (optional) pieces of advice.

I’ve been blogging for just over five years, and I can say that the best way to start, and stay started, is to seek out motives that are selfish. The type of motivation I’m thinking of is intrinsic, such as personal satisfaction, as opposed to extrinsic, such as aiming to have a ton of followers and making money. It’s a good selfish.

Three early reasons for getting started with a blog are:

1. Documenting your work: One of my initial reasons for starting was to have a place to keep snippets of knowledge in some searchable place. Specific techniques for manipulating data in Excel, for example. I have found myself referring to older published pieces to remind me how I carried out an analysis or when I need a block of SQL. A blog has the added benefit of being shareable, but if your purpose is personal documentation, it doesn’t matter if you have any audience at all.

2. Developing your thoughts: Few activities bring focus and clarity to your thoughts like writing about them. Some of my ideas on more abstract issues have been shaped and developed this way. Sometimes the office is not the best environment for this sort of reflective work. A blog can be a space for clarity. Again — no need for an audience.

3. Solidifying your learning: One of the best ways to learn something new is by teaching it to someone else. I may have had an uncertain grasp of multiple linear regression, for example, when I launched CoolData, but the exercise of trying to explain data mining concepts and techniques was a great way to get it all straight in my head. If I were to go back today and re-read some of my early posts on the subject, which I rarely do, I would find things I probably would disagree with. But the likelihood of being wrong is not a good enough reason to avoid putting your thoughts out there. Being naive and wrong about things is a stage of learning.

Let’s say that, motivated by these or other reasons, you’ve published a few posts. Suddenly you’ve got something to share with the world. Data analysis lends itself perfectly to discussion via blogs. Not only analysts and data miners, but programmers, prospect researchers, business analysts, and just about anyone engaged in knowledge work can benefit personally while enriching their profession by sharing their thoughts with their peers online.

As you slowly begin to pick up readers, new reasons for blogging will emerge. Three more reasons for blogging are:

4. Making professional connections: As a direct result of writing the blog I have met all kinds of interesting people in the university advancement, non-profit, and data analysis worlds. Many I’ve met only virtually, others I’ve been fortunate to meet in person. It wasn’t very long after I started blogging that people would approach me at conferences to say they had seen one of my posts. Some of them learned a bit from me, or more likely I learned from them. A few have even found time to contribute a guest post.

5. Sharing knowledge: This is the obvious one, so no need to say much more. Many advancement professionals share online already, via various listservs and discussion forums. The fact this sharing goes on all the time makes me wonder why more people don’t try to make their contributions go even farther by taking the extra step of developing them into blog posts that can be referred to anytime.

6. Building toward larger projects: If you keep at it, slowly but surely you will build up a considerable body of work. Blogging can feed into conference presentations, discussion papers, published articles, even a book.

Let me return to the distinction I made earlier between intrinsic and extrinsic motivators — the internal, more personal rewards of blogging versus the external, often monetary, goals some people have. As it happens, the personal reasons for blogging are realistic, with a high probability of success, while the loftier goals are likely to lead to premature disillusionment. A new blog with no audience is a fragile thing; best not burden it with goals you cannot hope to realize in the first few years.

I consider CoolData a success, but not by any external measure. I simply don’t know how many followers a blog about data analysis for higher education advancement ought to have, and I don’t worry about it. I don’t have goals for number of visitors or subscribers, or even number of books sold. (Get your copy of “Score!” here. … OK — couldn’t resist.)

The blog does what I want it to do.

That’s mostly what I have to say, really. I have a few bits of advice, but my strongest advice is to ignore what everybody else thinks you should do, including me. Most expert opinion on posting frequency, optimum length for posts, ideal days and times for publishing, click-bait headlines, search engine optimization and the like is a lot of hot air.

If you’re still with me, here are a few cautions and pieces of advice, take it or leave it:

1. On covering your butt: Some employers take a dim view of their employees publishing blogs and discussing work-related issues on social media. You might want to clear your activity with your supervisor first. When I changed jobs, I disclosed that I intended to keep up my blog. I explained that connecting with counterparts at other universities was a big part of my professional development. There’s never been an issue. Be clear that you’re writing for a small readership of professionals who share your interests, an activity not unlike giving a conference presentation. Any enlightened organization should embrace someone who takes the initiative. (You could blog secretly and anonymously, but what’s the point?)

2. On “permission”: Beyond ensuring that you are not jeopardizing your day job, you do not require anyone’s permission. You don’t have to be an expert; you simply have to be interested in your subject and enthusiastic about sharing your new knowledge with others. Beginners have an advantage over experts when it comes to blogging; an expert will often struggle to relate to beginners, and assume too much about what they know or don’t know. So what if that post from two years ago embarrasses you now? You can always just delete it. If you’re reticent about speaking up, remember that blogging is not about claiming to be an authority on anything. It’s about exploring and sharing. It’s about promoting helpful ideas and approaches. You can’t prevent small minds from interpreting your activity as self-promotion, so just keep writing. In the long run, it’s the people who never take the risk of putting themselves out there who pay the higher price.

3. On writing: The interwebs ooze with advice for writers so I won’t add to the noise. I’ll just say that, although writing well can help, you don’t need to be an exceptional stylist. I read a lot of informative yet sub-par prose every day. The misspellings, mangled English, and infelicities that would be show-stoppers if I were reading a novel just aren’t that important when I’m reading for information that will help me do my job.

4. On email: In the early days of email I thought it rude not to respond. Today things are different: It’s just too easy to bombard people. Don’t get me wrong: I have received many interesting questions from readers (some of which have led to new posts, which I love), as well as great opportunities to connect, participate in projects, and so on. But just because you make yourself available for interaction doesn’t mean you need to answer every email. You can lay out the ground rules on an “About” page. If someone can’t be bothered to consider your guidelines for contact, then an exchange with that person is not going to be worth the trouble. On my “About this Blog” page I make it clear that I don’t review books or software, yet the emails offering me free stuff for review keep coming. I have no problem deleting those emails unanswered. … Then there are emails that I fully intend to respond to, but don’t get the chance. Before long they are buried in my inbox and forgotten. I do regret that a little, but I don’t beat myself up over it. (However — I do hereby apologize.)

5. On protecting your time: Regardless of how large or small your audience, eventually people will ask you to do things. Sometimes this can lead to interesting partnerships that advance the interests of both parties, but choose wisely and say no often. Be especially wary of quid pro quo arrangements that involve free stuff. I rarely read newspaper travel writing because I know so much of it is bought and paid for by tour companies, hotels, restaurants and so on, without disclosure. However, I’m less concerned about high-minded integrity than I am about taking on extra burdens. I’m a busy guy, and also a lazy guy who jealously guards his free time, so I’m careful about being obliged to anyone, either contractually or morally. Make sure your agenda is set exclusively by whatever has your full enthusiasm. You want your blogging to be a free activity, where no one but you calls the shots.

6. On the peanut gallery: Keeping up a positive conversation with people who are receptive to your message is productive. Trying to convince skeptics and critics who are never going to agree with you is not. When you’re pushing back, you’re not pushing forward. Keep writing for yourself and the people who want to hear what you’ve got to say, and ignore the rest. This has nothing to do with being nice or avoiding conflict. I don’t care if you’re nice. It’s about applying your energies in a direction where they are likely to produce results. Focus on being positive and enabling others with solutions and knowledge, not on indulging in opinions, fruitless debates, and pointless persiflage among the trolls in the comments section. I haven’t always followed my own advice, but I try.

Some say “know your audience.” Actually, it would be better if you know yourself. Readers respond to your personality and they can only get to know you if you are consistent. You can only be consistent if you are genuine. There are 7.125 billion people in the world and almost half of them have an internet connection (and access to Google Translate). Some of those will become your readers — be true to them by being true to yourself. There is no need to waste your time chasing the crowd.

Your overarching goals are not to convince or convert or market, but to 1) fuel your own growth, and 2) connect with like-minded people. Growth and connection: That’s more than enough payoff for me.

29 May 2014

Nate Silver on age-guessing from first names

Filed under: CoolData — Tags: , , — kevinmacdonell @ 3:22 pm

Friend and colleague Greg Pemberton (@GregPemberton) pointed me to this interesting post on the FiveThirtyEight blog: How to Tell Someone’s Age When All You Know Is Her Name. Wow, I thought … that rings a bell! I wrote a blog post on exactly that topic: How to infer age, when all you have is a name. That was nearly four years ago, and I’ve written a couple more posts on the subject since then.

I’m not suggesting that there’s any borrowing going on. The idea is hardly rocket science and has undoubtedly occurred to many people independently long before I got my noggin around it. So why am I posting this?

Ah.

I am a fan of Nate Silver and his blog. I devoured his book, “The Signal and the Noise,” shortly after it came out. And last year I dragged my butt out of bed in the early morning after an awesome conference reception with multiple open bars to hear him deliver a keynote address. So I was very interested to read his post, co-authored with Allison McCann.

And yes, I may also have been interested in posting a comment in response, with links to CoolData. I am a blogger, after all. So I carefully prepared my comment, and hit ‘Go’. What happened then? A Facebook fail!

facebook_fail

Really, Nate? I need a Facebook account to post a comment? I shut down my Facebook account years ago, for all sorts of reasons, and I don’t plan to go back. (Maybe I shouldn’t criticize. People can’t leave comments on CoolData at all. But Facebook??)

My comment needs a home. Why not right here? Thank you for reading.

I use these age/name/sex patterns to infer likely age in our university database work. We already know the name, gender and age for most people, so we can calculate mean and median ages for all combinations of name and sex, and apply those to any new records that are lacking this data (such as prospective donors). This is helpful, as ‘age’ is strongly correlated with likelihood to make a donation and the size of the gift. Gender can be an important factor … A number of first names have “flipped gender”, so they either belong to a relatively old man or a relatively young female. Examples I know of include Ainslie, Isadore, Sydney, Shelly, and Brooke.

I have written about this a few times:

How to infer age, when all you have is a name

New twists on inferring age from first name

Putting an age-guessing trick to the test

 

POSTSCRIPT

On re-reading the FiveThirtyEight post, I was struck by this passage, which I didn’t notice earlier: “There are quite a lot of websites devoted to tracking the popularity of American baby names over time. … But we haven’t seen anyone ask the age of living Americans with a given name.”

Oh really. … Let me Google that for you.

 

POST-POSTSCRIPT

When I first posted the “Let me Google that for you” link, CoolData was at the top of the results. It has since been crowded out by FiveThirtyEight and others. The benefits of a large web presence and the resources to optimize search results.

 

POST-POST-POSTSCRIPT

One thing was puzzling me … In my stats, I have seen a lot of people clicking on the link to FiveThirtyEight from my blog, but I also noticed that almost 200 people (to date) have come to CoolData from FiveThirtyEight. I couldn’t figure out how — there was no link to CoolData that I could find. Well, I’ve found it. CoolData is referenced in the first footnote below the FiveThirtyEight post on age-guessing from names. One has to click the plus sign in the circle (before the comments) to see the footnotes. So — thanks, Nate!

variations

23 December 2013

New from CASE Books: Score!

Filed under: Book, CoolData, Peter Wylie — Tags: , , , — kevinmacdonell @ 9:39 am

CASE_coverAs the year draws to a close, I’m pleased to announce that the book I’ve co-written with Peter Wylie will be available in January. ‘Score!’ joins a host of fine publications in CASE’s new catalog. I’m looking forward to having a look through this catalog for new books for the office. (‘Score’ is featured on page 12.)

So what is this new book about? The full title is Score!: Data-Driven Success for Your Advancement Team, and as a recent of issue of BriefCASE notes: “Kevin MacDonell and Peter Wylie walk readers through compelling arguments for why an organization should adopt data-driven decision-making as well as explanations of basic issues such as identifying and mining the pertinent data and what operations to perform once that data is in hand.”

You can read the rest of that article here: Ready to Score!?

Older Posts »

Blog at WordPress.com.