CoolData blog

3 January 2016

CoolData (the book) beta testers needed

 

UPDATE (Jan 5): 16 people have responded to my call for volunteers, so I am going to close this off now. I have been in touch with each person who has emailed me, and I will be making a final selection within a few days. Thank you to everyone who considered taking a crack at it.

 

Interested in being a guinea pig for my new handbook on predictive modelling? I’m looking for someone (two or three people, max) to read and work through the draft of “CoolData” (the book), to help me make it better.

 

What’s it about? This long subtitle says it all: “A how-to guide for predictive modelling for higher education advancement and nonprofits using multiple linear regression in Data Desk.”

 

The ideal beta tester is someone who:

 

  • has read or heard about predictive modelling and understands what it’s for, but has never done it and is keen to learn. (Statistical concepts are introduced only when and if they are needed – no prior stats knowledge is required. I’m looking for beginners, but beginners who aren’t afraid of a challenge.);
  • tends to learn independently, particularly using books and manuals to work through examples, either in addition to training or completely on one’s own;
  • does not have an IT background but has some IT support at his or her organization, and would not be afraid to learn a little SQL in order to query a database him- or herself, and
  • has a copy of Data Desk, or intends to purchase Data Desk. (Available for PC or Mac).

 

It’s not terribly important that you work in the higher ed or nonprofit world — any type of data will do — but the book is strictly about multiple linear regression and the stats software Data Desk. The methods outlined in the book can be extended to any software package (multiple linear regression is the same everywhere), but because the prescribed steps refer specifically to Data Desk, I need someone to actually go through the motions in that specific package.

 

Think of a cookbook full of recipes, and how each must be tested in real kitchens before the book can go to press. Are all the needed ingredients listed? Has the method been clearly described? Are there steps that don’t make sense? I want to know where a reader is likely to get lost so that I can fix those sections. In other words, this is about more than just zapping typos.

 

I might be asking a lot. You or your organization will be expected to invest some money (for the software, sales of which I do not benefit from, by the way) and your time (in working through some 200 pages).

 

As a return on your investment, however, you should expect to learn how to build a predictive model. You will receive a printed copy of the current draft (electronic versions are not available yet), along with a sample data file to work through the exercises. You will also receive a free copy of the final published version, with an acknowledgement of your work.

 

One unusual aspect of the book is that a large chunk of it is devoted to learning how to extract data from a database (using SQL), as well as cleaning it and preparing the data for analysis. This is in recognition of the fact that data preparation accounts for the majority of time spent on any analysis project. It is not mandatory that you learn to write queries in SQL yourself, but simply knowing which aspects of data preparation can be dealt with at the database query level can speed your work considerably. I’ve tried to keep the sections about data extraction as non-technical as possible, and augmented with clear examples.

 

For a sense of the flavour of the book, I suggest you read these excerpts carefully: Exploring associations between variables and Testing associations between two categorical variables.

 

Contact me at kevin.macdonell@gmail.com and tell me why you’re interested in taking part.

 

 

 

26 August 2015

Exploring associations between variables

Filed under: Book, CoolData, Predictor variables — Tags: , , , — kevinmacdonell @ 6:57 pm

 

CoolData has been quiet over the summer, mainly because I’ve been busy writing another book. (Fine weather has a bit to do with it, too.) The book will be for nonprofit and higher education advancement professionals interested in learning how to use multiple regression to build predictive models. Over the next few months, I will adapt various bits from the work-in-progress as individual posts here on CoolData.

 

I’ll have more to say about the book later, so if you’re interested, I suggest subscribing via email (see the box to the right) to have the inside track on this project. (And if you aren’t familiar with the previous book, co-written with Peter Wylie, then have a look here.)

 

A generous chunk of the book is about the specifics of getting your hands dirty with cleaning up your messy data, transforming it to make it suitable for regression analysis, and exploring it for interesting patterns that can strengthen a predictive model.

 

When you import a data set into Data Desk or other statistics package, you are looking at more than just a jumble of variables. All these variables are in a relation; they are linked by patterns. Some variables are strongly associated with each other, others have weaker associations, and some are hardly related to each other at all.

 

What is meant by “association”? A classic example is a data set of children’s weights, heights, and ages. Older children tend to weigh more and be taller than younger children. Heavier children tend to be older and taller than younger children. We say that there is an association between age and weight, between height and weight, and between age and height.

 

Another example: Alumni who are bigger donors tend to attend more reunion events than alumni who give more modestly or don’t give at all. Or put the other way, alumni who attend more events tend to give more than alumni who attend fewer or no events. There is an association between giving and attending events.

 

This sounds simple enough — even obvious. The powerful consequence of these truths is that if we know the value of one variable, we can make a guess at the value of another, as long as the association is valid. So if we know a child’s weight and height, we can make a good guess of his or her age. If we know a child’s height, we can guess weight. If we know how many reunions an alumna has attended, we can make a guess about her level of giving. If we know how much she has given, we can guess whether she’s attended more or fewer reunions than other alumni.

 

We are guessing an unknown value (say, giving) based on a known value (number of events attended). But note that “giving” is not really an unknown. We’ve got everyone’s giving recorded in the database. What is really unknown is an alum’s or a donor’s potential for future giving. With predictive modeling, we are making a guess at what the value of a variable will be in the (near) future, based on the current value of other variables, and the type and degree of association they have had historically.

 

These guesses will be far from perfect. We aren’t going to be bang-on in our guesses of children’s ages based on weight and height, and we certainly aren’t going to be very accurate with our estimates of giving based on event attendance. Even trickier, projecting into the future — estimating potential — is going to be very approximate.

 

Still, our guesses will be informed guesses, as long as the associations we detect are real and not due to random variation in our data. Can we predict exactly how much each donor is going to give over this coming year? No, that would be putting too much confidence in our powers. But we can expect to have plenty of success in ranking our constituents in order by how likely they are to engage in whatever behaviour we are interested in, and that knowledge will be of great value to the business.

 

Looking for potentially useful associations is part of data exploration, which is best done in full hands-on mode! In a future post I will talk about specific techniques for exploring different types of variables.

 

18 January 2015

Why blog? Six reasons and six cautions

Filed under: CoolData, Off on a tangent, Training / Professional Development — Tags: , — kevinmacdonell @ 4:12 pm

THE two work-related but extracurricular activities I have found the most rewarding, personally and professionally, are giving conference presentations and writing for CoolData. I’ve already written about the benefits of presenting at conferences, explaining why the pain is totally worth it. Today: six reasons why you might want to try blogging, followed by six (optional) pieces of advice.

I’ve been blogging for just over five years, and I can say that the best way to start, and stay started, is to seek out motives that are selfish. The type of motivation I’m thinking of is intrinsic, such as personal satisfaction, as opposed to extrinsic, such as aiming to have a ton of followers and making money. It’s a good selfish.

Three early reasons for getting started with a blog are:

1. Documenting your work: One of my initial reasons for starting was to have a place to keep snippets of knowledge in some searchable place. Specific techniques for manipulating data in Excel, for example. I have found myself referring to older published pieces to remind me how I carried out an analysis or when I need a block of SQL. A blog has the added benefit of being shareable, but if your purpose is personal documentation, it doesn’t matter if you have any audience at all.

2. Developing your thoughts: Few activities bring focus and clarity to your thoughts like writing about them. Some of my ideas on more abstract issues have been shaped and developed this way. Sometimes the office is not the best environment for this sort of reflective work. A blog can be a space for clarity. Again — no need for an audience.

3. Solidifying your learning: One of the best ways to learn something new is by teaching it to someone else. I may have had an uncertain grasp of multiple linear regression, for example, when I launched CoolData, but the exercise of trying to explain data mining concepts and techniques was a great way to get it all straight in my head. If I were to go back today and re-read some of my early posts on the subject, which I rarely do, I would find things I probably would disagree with. But the likelihood of being wrong is not a good enough reason to avoid putting your thoughts out there. Being naive and wrong about things is a stage of learning.

Let’s say that, motivated by these or other reasons, you’ve published a few posts. Suddenly you’ve got something to share with the world. Data analysis lends itself perfectly to discussion via blogs. Not only analysts and data miners, but programmers, prospect researchers, business analysts, and just about anyone engaged in knowledge work can benefit personally while enriching their profession by sharing their thoughts with their peers online.

As you slowly begin to pick up readers, new reasons for blogging will emerge. Three more reasons for blogging are:

4. Making professional connections: As a direct result of writing the blog I have met all kinds of interesting people in the university advancement, non-profit, and data analysis worlds. Many I’ve met only virtually, others I’ve been fortunate to meet in person. It wasn’t very long after I started blogging that people would approach me at conferences to say they had seen one of my posts. Some of them learned a bit from me, or more likely I learned from them. A few have even found time to contribute a guest post.

5. Sharing knowledge: This is the obvious one, so no need to say much more. Many advancement professionals share online already, via various listservs and discussion forums. The fact this sharing goes on all the time makes me wonder why more people don’t try to make their contributions go even farther by taking the extra step of developing them into blog posts that can be referred to anytime.

6. Building toward larger projects: If you keep at it, slowly but surely you will build up a considerable body of work. Blogging can feed into conference presentations, discussion papers, published articles, even a book.

Let me return to the distinction I made earlier between intrinsic and extrinsic motivators — the internal, more personal rewards of blogging versus the external, often monetary, goals some people have. As it happens, the personal reasons for blogging are realistic, with a high probability of success, while the loftier goals are likely to lead to premature disillusionment. A new blog with no audience is a fragile thing; best not burden it with goals you cannot hope to realize in the first few years.

I consider CoolData a success, but not by any external measure. I simply don’t know how many followers a blog about data analysis for higher education advancement ought to have, and I don’t worry about it. I don’t have goals for number of visitors or subscribers, or even number of books sold. (Get your copy of “Score!” here. … OK — couldn’t resist.)

The blog does what I want it to do.

That’s mostly what I have to say, really. I have a few bits of advice, but my strongest advice is to ignore what everybody else thinks you should do, including me. Most expert opinion on posting frequency, optimum length for posts, ideal days and times for publishing, click-bait headlines, search engine optimization and the like is a lot of hot air.

If you’re still with me, here are a few cautions and pieces of advice, take it or leave it:

1. On covering your butt: Some employers take a dim view of their employees publishing blogs and discussing work-related issues on social media. You might want to clear your activity with your supervisor first. When I changed jobs, I disclosed that I intended to keep up my blog. I explained that connecting with counterparts at other universities was a big part of my professional development. There’s never been an issue. Be clear that you’re writing for a small readership of professionals who share your interests, an activity not unlike giving a conference presentation. Any enlightened organization should embrace someone who takes the initiative. (You could blog secretly and anonymously, but what’s the point?)

2. On “permission”: Beyond ensuring that you are not jeopardizing your day job, you do not require anyone’s permission. You don’t have to be an expert; you simply have to be interested in your subject and enthusiastic about sharing your new knowledge with others. Beginners have an advantage over experts when it comes to blogging; an expert will often struggle to relate to beginners, and assume too much about what they know or don’t know. So what if that post from two years ago embarrasses you now? You can always just delete it. If you’re reticent about speaking up, remember that blogging is not about claiming to be an authority on anything. It’s about exploring and sharing. It’s about promoting helpful ideas and approaches. You can’t prevent small minds from interpreting your activity as self-promotion, so just keep writing. In the long run, it’s the people who never take the risk of putting themselves out there who pay the higher price.

3. On writing: The interwebs ooze with advice for writers so I won’t add to the noise. I’ll just say that, although writing well can help, you don’t need to be an exceptional stylist. I read a lot of informative yet sub-par prose every day. The misspellings, mangled English, and infelicities that would be show-stoppers if I were reading a novel just aren’t that important when I’m reading for information that will help me do my job.

4. On email: In the early days of email I thought it rude not to respond. Today things are different: It’s just too easy to bombard people. Don’t get me wrong: I have received many interesting questions from readers (some of which have led to new posts, which I love), as well as great opportunities to connect, participate in projects, and so on. But just because you make yourself available for interaction doesn’t mean you need to answer every email. You can lay out the ground rules on an “About” page. If someone can’t be bothered to consider your guidelines for contact, then an exchange with that person is not going to be worth the trouble. On my “About this Blog” page I make it clear that I don’t review books or software, yet the emails offering me free stuff for review keep coming. I have no problem deleting those emails unanswered. … Then there are emails that I fully intend to respond to, but don’t get the chance. Before long they are buried in my inbox and forgotten. I do regret that a little, but I don’t beat myself up over it. (However — I do hereby apologize.)

5. On protecting your time: Regardless of how large or small your audience, eventually people will ask you to do things. Sometimes this can lead to interesting partnerships that advance the interests of both parties, but choose wisely and say no often. Be especially wary of quid pro quo arrangements that involve free stuff. I rarely read newspaper travel writing because I know so much of it is bought and paid for by tour companies, hotels, restaurants and so on, without disclosure. However, I’m less concerned about high-minded integrity than I am about taking on extra burdens. I’m a busy guy, and also a lazy guy who jealously guards his free time, so I’m careful about being obliged to anyone, either contractually or morally. Make sure your agenda is set exclusively by whatever has your full enthusiasm. You want your blogging to be a free activity, where no one but you calls the shots.

6. On the peanut gallery: Keeping up a positive conversation with people who are receptive to your message is productive. Trying to convince skeptics and critics who are never going to agree with you is not. When you’re pushing back, you’re not pushing forward. Keep writing for yourself and the people who want to hear what you’ve got to say, and ignore the rest. This has nothing to do with being nice or avoiding conflict. I don’t care if you’re nice. It’s about applying your energies in a direction where they are likely to produce results. Focus on being positive and enabling others with solutions and knowledge, not on indulging in opinions, fruitless debates, and pointless persiflage among the trolls in the comments section. I haven’t always followed my own advice, but I try.

Some say “know your audience.” Actually, it would be better if you know yourself. Readers respond to your personality and they can only get to know you if you are consistent. You can only be consistent if you are genuine. There are 7.125 billion people in the world and almost half of them have an internet connection (and access to Google Translate). Some of those will become your readers — be true to them by being true to yourself. There is no need to waste your time chasing the crowd.

Your overarching goals are not to convince or convert or market, but to 1) fuel your own growth, and 2) connect with like-minded people. Growth and connection: That’s more than enough payoff for me.

27 December 2012

Holiday indulgence

Filed under: Book, CoolData, Off on a tangent — Tags: , — kevinmacdonell @ 4:35 pm

I’ve always tried to stay on-topic with CoolData content: If you subscribe, you know what you’re getting, and if you lose interest and unsubscribe, you know what you’re missing. But I’m on holiday, so I’m inclined to let content rules slip a bit. My wife and I are spending time with family on Cape Breton Island and in the Annapolis Valley in Nova Scotia. I’m less vigilant than usual about what I eat (more turkey, more sweets, more wine) and what I do (nothing, essentially). It is in this state of desuetude that I write this last blog post of the year.

Allow me to indulge by writing not about predictive analytics, but about CoolData itself, which has just turned three years old. That’s middle age for a blog, I figure. First I’ll go through some numbers, and then I’ll tell you about some things coming in the new year.

CoolData by the numbers

As of yesterday, CoolData has had 177,915 page views since it was launched. The number of visitors continues to grow gradually; 6,000 page views a month is the current average. These are page views, not unique visitors: WordPress has been informing me about unique visits only since early December. So far, each unique visitor averages 1.4 page views.

Visits have come from almost every country in the world, but of course most are from the United States. It is not unusual for my own country, Canada, to be edged out of second place by the UK, India or Australia on any given day. The top 20 or so countries since February 2012 are included in the WordPress-created graphic below. (Click for full size.)

countries

These visitor numbers are not small, but I’m not pretending they’re impressive, either. My subject is rather niche. As well, many visitors aren’t really looking for CoolData. Half of my traffic comes from people stumbling in from Google and other search engines, and they’re looking for simple (or simplistic) explanations of statistical concepts. The most popular post by far is How high, R squared? — published in April 2010, it is still heavily visited every day by confused and desperate grad students from all corners of the globe. I don’t consider these people part of the CoolData “tribe”, if I can call it that.

The tribe — the readers I care most about — are typically the ones who have subscribed to receive updates. (There are also a lot of RSS subscribers — I don’t have as good a handle on those numbers.*) As of today, there are 680 subscribers — 48 subscribers via WordPress accounts, and 632 via email. This number has been growing very gradually over the past three years. I realize many people sign up for things they never return to (I do it all the time), but when an update goes out, I estimate that about half of my subscribers click through to the new post, which I find encouraging. They are far more likely to click through than my followers on Twitter (@kevinmacdonell).

Most readers visit during the work week (readership drops off dramatically on weekends), so not surprisingly most subscribers use their real work address rather than a free Gmail, Hotmail, or Yahoo account. From my own research, I know providing a work email is associated with higher levels of engagement, and “.edu” addresses alone (US-affiliated higher ed institutions) account for 293 subscribers. Another 101 addresses have the less restrictive top-level domain of “.org”. Among country-specific top-level domains, the top ones are Canada (.ca) with 46 and the United Kingdom (.uk) with 29. There are 142 “.com” addresses, and roughly half of them are Gmail, Yahoo or Hotmail. There are 443 unique domains in all, the top ones being uw.edu (University of Washington) and ubc.ca (University of British Columbia).

Start writing!

Up to now I’ve been coy about answering questions about my stats, for no real reason. I figure I might as well come clean. I have long felt that there is more room for writing on this topic, so if knowing more about my readership encourages you to start your own blog, then I encourage you to make 2013 your year to step up. All it takes is a few minutes to sign up on WordPress or similar free service, and start writing.

If you’re not up for creating your own blog, then consider writing a guest post for CoolData. Up to this point, guest posting has been by invitation only, but starting today I am open to receiving post ideas from anyone interested in writing on the topic of predictive analytics for nonprofit fundraising or higher education advancement (including alumni engagement). I plan to limit submitted guest posts to one per month. Multiple submissions are welcome, but submissions that are completely off-topic will not get a response. Email me at kevin.macdonell@gmail.com to suggest/discuss your idea before you start writing.

No more comments

As I begin a new year, naturally I think of changes I’d like to make. For one, I will be taking a new approach to comments on posts. Only 514 comments have been contributed since December 2009, and 140 of those are mine. This is not a disappointment — I had no designs one way or the other — but the time has come to recognize the fact that CoolData has never been effective as a discussion forum. There have been a few good questions and observations made by commenters, but unfortunately too many comments are of the “drive-by” variety: Brief one-off criticisms that require rebuttal but never lead to any forward advance in the discussion or added enlightenment for beginning predictive modelers. The best questions, the most honest comments, and the most well-reasoned objections tend to come to me via private email.

For that reason, I am shutting off the ability to respond with public comments. There have been no nasty personal attacks, nor abusive language, nor anything I’ve felt forced to delete (aside from spam). I simply feel that, after three years of writing and editing this blog, I no longer feel the need to provide a platform for people whose main interest is something other than being part of a shared endeavour to learn, to grow, and to bring our institutions and organizations into the age of data. Responses, questions, critiques are always welcome via private email, and I may choose to gather the best responses for use in followup blog posts. Keep in mind, too, that the best forums for discussion are still the listservs (Prospect-dmm is the best example), and new conversations crop up every week in the many groups of interest you can find on social networking sites such as LinkedIn.

SCORE!

On a more positive note, 2013 will be the year that a new book, Score!, which I have co-written with Peter Wylie, will be published. I’ve said very little about it to date, in part because I won’t actually believe it until it’s in my hands. It’s a project with a long gestation … writing a book has nearly nothing in common with knocking off a blog post. However, I’m confident we’ll see it out sometime during the first half of the year.

That’s all for 2012. Best of luck in your data-related work in 2013!

{}{}{}{}{}{}{}{}

* A regular reader who subscribes via RSS reminded me that I have given short shrift to the RSS crowd — I just don’t know how many subscribe via RSS. It is quite possible, then, that I am overestimating the number of email subscribers who click through to the post.

Blog at WordPress.com.