CoolData blog

8 May 2012

Emerson’s big data

Filed under: Model building, Off on a tangent — Tags: — kevinmacdonell @ 11:35 am

On day in late March I got on a plane from Toronto (where I attended Annual Fund benchmarking meetings hosted by Target Analytics) to Las Vegas (for the Sungard Higher Education Summit), and picked up the Toronto Globe & Mail. I scanned a section that offered some ephemera, including the startling news that my fellow countryman William Shatner had turned 81. Once I got over that shock, I read the Globe’s “Thought du jour,” a quote from Ralph Waldo Emerson.

Because I’m an admirer of Emerson, and because I figured I could appropriate his quote for my own selfish purposes, I scribbled it down:

“The world can never be learned by learning all its details.”

Emerson did not live in the age of big data. But in a way, the world he experienced — the world we all experience through our senses — IS big data. We don’t perceive our surroundings directly, but only through our brain’s interpretations of sense impressions. We navigate the world via mental models of our own creation. These models leave out nearly everything. They are not reality, no more than a map of a city is faithful to the reality of the city, or than our memory of an event is faithful to the details of the event (which would overwhelm us every time it came to mind).

In our work with data, we measure things (or their proxies) in order to get a handle on them and in order to gain insight. We lose most of the detail in the process, but we need to in order to learn something. We build models based on general patterns. So as George E.P. Box said: All models are wrong, but some are useful.

26 April 2012

For agile data mining, start with the basics

Filed under: Analytics, Pitfalls, Training / Professional Development — Tags: , , , — kevinmacdonell @ 8:56 am

Lately I’ve been telling people that one of the big hurdles to implementing predictive analytics in higher education advancement is the “project mentality.” We too often think of each data mining initiative as a project, something with a beginning and end. We’d be far better off to think in terms of “process” — something iterative, always improving, and never-ending. We also need to think of it as a process with a fairly tight cycle: Deploy it, let it work for a bit, then quickly evaluate, and tweak, or scrap it completely and start over. The whole cycle works over the course of weeks, not months or years.

Here’s how it sometimes goes wrong, in five steps:

  1. Someone has the bright idea to launch a “major donor predictive modelling project.” Fantastic! A committee is struck. They put their heads together and agree on a list of variables that they believe are most likely to be predictive of major giving.
  2. They submit a request to their information management people, or whomever toils in extracting stuff from the database. Emails and phone calls fly back and forth over what EXACTLY THE HECK the data mining team is looking for.
  3. Finally, a massive Excel file is delivered, a thing the likes of which would never exist in nature — like the unstable, man-made elements on the nether fringes of the Periodic Table. More meetings are held to come to agreement about what to do about multiple duplicate rows in the data, and what to do about empty cells. The committee thinks maybe the IT people need to fix the file. Ummm — no!
  4. Half of the data mining team then spends considerable time in pursuit of a data file that gleams in its cleanliness and perfection. The other half is no longer sure what the goal of the project was.
  5. Somehow, a model is created and the records are scored by the one team member left standing. Unfortunately, a year has passed and the person for whom the model was built has left for a new job in California. Her replacement refers to the model as “astrology.”

Allow me a few observations that follow from these five stages:

  1. Successful models are rarely produced by committee, and variables cannot be pre-selected by popular agreement and intuition — although certainly experience is a valuable source of clues.
  2. Submitting requests to someone else for data, having to define exactly what it is you want, and then waiting for the request to be fulfilled — all of that is DEATH to creative data exploration.
  3. A massive, one-time, all-or-nothing data suction job is probably not the ideal starting point. Neither is handling an Excel file with 200,000 rows and a hundred columns.
  4. Perfect data is not a realistic goal, and is not a prerequisite for fruitful data mining.
  5. A year is too long. The cycle has to be much, much tighter than that.

And finally, here are some concrete steps, based on the observations, again point-for-point:

  1. If you’re interested in data mining, try going it alone. Ask for help when you need it, but you’ll make faster progress if you explore on your own or in a team of no more than two or three like-minded people. Don’t tell anyone you’re launching a “project,” and don’t promise deliverables unless you know what you’re doing.
  2. Learn how to build simple queries to pull data from your database. Get IT to set you up. Figure out how to pull a file of IDs along with sum of all their hard-credit giving. Then, pull that AND something else — anything else. Email address, class year, marital status, whatever. Practice, get comfortable with how your data is stored and how to limit it to what you want.
  3. Look into stats software, and learn some of the most common stats terms. Read up on correlation in particular. Build larger files for analysis in the stats software rather than in Excel. Read, read, read. Play, play, play.
  4. Think in terms of pattern detection, and don’t get hung up on the validity of individual data points.
  5. If you’ve done steps 1 to 4, you have the foundations in place for being an agile data miner.

Mind you, it could take considerable time — months, maybe even years — to get really comfortable with the basics, especially if data mining is a sideline to your “real” job.  But success and agility does depend on being able to work independently, being able to snag data on a whim, being able to understand a bit of what is going on in your software, having the freedom to play and explore, and losing notions about data that come from the business analysis and reporting side. In other words, the basics.

24 April 2012

Data I want to play with

Filed under: Data, Fun — Tags: — kevinmacdonell @ 5:23 am

Guest post by Marianne M. Pelletier, Director of Advancement Research and Data Support, Cornell University

In my present job, I deal with a whole lot of data – over 2,000 fields of data on gifts, names, addresses, relationships, segmenting codes, dates, attributes, interests, contacts, you name it. Yet getting to play in this playground as a donor modeler only leaves me lusting for other kinds of data to play with, so much that my hobbies often lead me to places where data lives so I can fool with it. This short article is my wish list, whether or not I’ll ever get to mine any of it.

Horse Races are tracked to the umpteenth degree by handicappers. Buy a copy of the Daily Racing Form and you’ll see more statistics presented than you can read in a week. DRF also has a web page where you can download even more statistics – tracking the horses’ pedigree generations back in time and the jockey’s entire career, ride by ride. So what do I do? I spend some Sundays diligently typing key statistics into a homemade database, along with the race results, to see if I can find the regression formula that would make me more money than just following the program picks. The answer? So far, on maiden sprints on dirt, the horse that had the fastest workout is most likely to win. For every other kind of race, I’m still wishing to buy the data in a format I can manipulate instead of having to type it.

Speaking of gambling, I’d give my remaining eye tooth to play in Harrah’s data. Harrah is an incredibly good marketing firm, from offering me a free weekend to their new casino in some remote place to being the only game in town that offers $10 craps all weekend long. Imagine if you will getting to download affinity player card data and tracking where a person wanders in the casino – how many mix slots with table play? How many are single game players? What if the casino moved the buffet closer to Keno? What’s the best game to put right inside the valet parking entrance? Do the longer, red craps tables make one bet more or lose more? Or play longer? What is the average time for a player at a blackjack table? What if she’s drinking alcohol? What if she’s an awards card member? What if the player is male? What if the dealer is the same gender as the player? I’d be a kid in a candy store to get a contract to work data like that.

On the other side of the coin, what is the effect of parking availability on local business? Wouldn’t it be fun to figure out the dependent variable on that? Ithaca recently changed its parking rates from the first hour free to charging for every hour. Was it that or the longstanding recession that caused local businesses to disappear? Or is the turnover normal? Would I have to study when the students are in town vs. when they are gone? Would local businesses share their profit numbers with me?

And then there’s the whole thing about the best time of year to go to Disney World. I’d want to offer Disney a study of some kind (like, which ride should go next to the Small World ride?) in order to get data on when I’m most likely to enjoy good weather, a maximum number of rides open, and the fewest number of screaming children and strollers under my feet.

And speaking of flying somewhere, I’d love for Delta to hire me to study when people want to fly somewhere. All that Expedia/Travelocity search data – does anyone use it? After all, what if airlines could arrange that people in Boston can fly midmorning but people in New York can fly at night? What if there were one extra flight at 11:00 am from somewhere that would double an airline’s traffic because of the ripple effect? I’d love to be the one who discovers that.

Lastly, who can resist wishing to forecast forex? The currency exchange market is very likely very well tested by experts, but not by me. What if I could predict the day of week and time of day that the Euro drifts off against the dollar? I’d place my bet once a week and then go off to the casino. Or Disney. Or shopping. Oh, bother! It all looks like there’s data teeming everywhere, everywhere, and I’m only going to live so long.

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.

28 March 2012

Are we missing too many alumni with web surveys?

Filed under: Alumni, John Sammis, Peter Wylie, Surveying, Vendors — Tags: , — kevinmacdonell @ 8:04 am

Guest post by Peter B. Wylie and John Sammis

(Download a printer-friendly PDF version here: Web Surveys Wylie-Sammis)

With the advent of the internet and its exponential growth over the last decade and a half, web surveys have gained a strong foothold in society in general, and in higher education advancement in particular. We’re not experts on surveys, and certainly not on web surveys.  However, let’s assume you (or the vendor you use to do the survey) e-mail either a random sample of your alumni (or your entire universe of alumni) and invite them to go to a website and fill out a survey. If you do this, you will encounter the problem of poor response rate. If you’re lucky, maybe 30% of the people you e-mailed will respond, even if you vigorously follow-up non-responders encouraging them to please fill the thing out.

This is a problem. There will always be the lingering question of whether or not the non-responders are fundamentally different from the responders with respect to what you’re surveying them about. For example, will responders:

  • Give you a far more positive view of their alma mater than the non-responders would have?
  • Tell you they really like new programs the school is offering, programs the non-responders may really dislike, or like a lot less than the responders?
  • Offer suggestions for changes in how alumni should be approached — changes that non-responders would not offer or actively discourage?

To test whether these kinds of questions are worth answering, you (or your vendor) could do some checking to see if your responders:

  • Are older or younger than your non-responders. (Looking at year of graduation for both groups would be a good way to do this.)
  • Have a higher or lower median lifetime giving than your non-responders.
  • Attend more or fewer events after they graduate than your non-responders.
  • Are more or less likely than your non-responders to be members of a dues paying alumni association.

It is our impression that most schools that conduct alumni web surveys don’t do this sort of checking. In their reports they may discuss what their response rates are, but few offer an analysis of how the responders are different from the non-responders.

Again, we’re talking about impressions here, not carefully researched facts. But that’s not our concern in this paper. Our concern here is that web surveys (done in schools where potential responders are contacted only by e-mail) are highly unlikely to be representative of the entire universe of alums — even if the response rate for these surveys is always one hundred percent. Why? Because our evidence shows that alumni who have an e-mail address listed with their schools are markedly different (in terms of two important variables) from alumni who do not have an e-mail address listed: Age and giving.

To make our case, we’ll offer some data from four higher education institutions spread out across North America; two are private, and two are public. Let’s start with the distribution of e-mail addresses listed in each school by class year decile. You can see these data in Tables 1-4 and Figures 1-4. We’ll go through Table 1 and Figure 1 (School A) in some detail to make sure we’re being clear.

Take a look at Table 1. You’ll see that the alumni in School A have been divided up into ten roughly equal size groups where Decile 1 represents the oldest group and Decile 10 represents the youngest. The table shows a very large age range. The youngest alums in Decile 1 graduated in 1958. (Most of you reading this paper were not yet born by that year.) The alums in Decile 10 (unless some of them went back to school late in life) are all twenty-somethings.

Table 1: Count, Median Class Year, and Minimum and Maximum Class Years for All Alums Divided into Deciles for School A

 

Now look at Figure 1. It shows the percentage of alums by class year decile who have an e-mail address listed in the school’s database. Later on in the paper we’ll discuss what we think are some of the implications of a chart like this. Here we just want to be sure you understand what the chart is conveying. For example, 43.0% of alums who graduated between 1926 and 1958 (Decile 1) have an e-mail listed in the school’s database. How about Decile 9, alums who graduated between 2001 and 2005? If you came up with 86.5%, we’ve been clear.

Go ahead and browse through Tables 2-4 and Figures 2-4. After you’ve done that, we’ll tell you what we think is one of the implications of what you’ve seen so far.

Table 2: Count, Median Class Year, and Minimum and Maximum Class Years for All Alums Divided into Deciles for School B

 

Table 3: Count, Median Class Year, and Minimum and Maximum Class Years for All Alums Divided into Deciles for School C

 

 

 

Table 4: Count, Median Class Year, and Minimum and Maximum Class Years for All Alums Divided into Deciles for School D

 

 

The most significant implication we can draw from what we’ve shown you so far is this: If any of these four schools were to conduct a web survey by only contacting alums with an e-mail address, they would simply not reach large numbers of alums whose opinions they are probably interested in gathering. Some specifics:

  • School A: They would miss huge numbers of older alums who graduated in 1974 and earlier. By rough count over 40% of these folks would not be reached. That’s a lot of senior folks who are still alive and kicking and probably have pronounced views about a number of issues contained in the survey.
  • School B: A look at Figure 2 tells us that even considering doing a web survey for School B is probably not a great idea. Fewer than 20% of their alums who graduated in 1998 or earlier have an e-mail address listed in their database.

Another way of expressing this implication is that each school (regardless of what their response rates were) would largely be tapping the opinions of younger alums, not older or even middle-aged alums. If that’s what a school really wants to do, okay. But we strongly suspect that’s not what it wants to do.

Now let’s look at something else that concerns us about doing web surveys if potential respondents are only contacted by e-mail: Giving. Figures 5-8 show the percentage of alums who have given $100 or more lifetime by e-mail address/no-email address across class year deciles.

As we did with Figure 1, let’s go over Figure 5 to make sure it’s clear. For example, in decile 1 (oldest alums) 87% of alumni with an e-mail address have given $100 or more lifetime to the school. Alums in the same decile who do not have an e-mail address? 71% of these alums have given $100 lifetime or more to the school.  How about decile 10, the youngest group? What are the corresponding percentages of giving for those alums with and without an e-mail address? If you came up with 14% versus 6%, we’ve been clear.

Take a look at Figures 6-8, for schools B, C and D. Then we’ll tell you the second implication we see in all these data.

The overall impression we get from these four figures is clear: Alumni who do not have an e-mail address listed give considerably less money to their schools than do alumni with an e-mail address listed. This difference can be particularly pronounced among older alums.

Some Conclusions

The title of this piece is: “Are We Missing Too Many Alumni with Web Surveys?” Based on the data we’ve looked at, we think the answer to this question has to be a “yes.” It can’t be a good thing that many web surveys don’t go out to so many older alums who don’t have an e-mail address, and to alums without an e-mail address who haven’t given as much (on average) as those with an e-mail address.

On the other hand, we want to stress that web surveys can provide a huge amount of valuable information from the alums who are reached and do respond. Even if the coverage of the whole alumni universe is incomplete, the thousands of alums who take the time to fill out these surveys can’t be ignored.

Here’s an example. We got to reading through the hundreds and hundreds of written comments from a recent alumni survey. We haven’t included any of the comments here, but my (Peter’s) reaction to the comments was visceral. Wading through all the typos, and misspellings, and fractured syntax, I found myself cheering these folks on:

  •  “Good for you.”
  • “Damn right.”
  • “Couldn’t have said it better myself.”
  • “I wish the advancement and alumni people at my college could read these.”

In total, these comments added up to almost 50,000 words of text, the length of a short novel. And they were a lot more interesting than the words in too many of the novels I read.

As always, we welcome your comments.

19 March 2012

Symposium on Data Analytics is a must-attend

If you’re interested in working with data for the benefit of a non-profit organization or for education institutional advancement, then you must make room in your calendar for the APRA Symposium on Data Analytics.

Kate Chamberlin of Memorial Sloan-Kettering Cancer Center recently posted the listserv message below which I am quoting in its entirety, with her blessing. Kate is Chair of this year’s Symposium, being held this summer in Minneapolis. I’ve attended a few of these symposiums (and presented at one), and I can tell you that they’re great. This is a conference where you can really learn, and meet the people who are doing cool stuff with data for their institutions and organizations.

Of particular interest are the Case Study sessions, which are brief (20 minutes) presentations of analytics projects that your colleagues at other institutions have carried out. If you’ve worked on a such a project, consider sharing! Contact information is included below.

Here’s Kate’s message:

Hello everyone!

Many of you may have noticed the fifth annual APRA Symposium on Data Analytics is definitely happening again this summer in conjunction with APRA’s International Conference in Minneapolis!  The dates are Wednesday and Thursday, August 1st and 2nd — some additional information is available here: http://www.aprahome.org/p/cm/ld/fid=72.

We don’t have the full schedule yet, but hopefully will within a week or so.  In the meantime, let me give you some preliminary details:

Wednesday morning the conference will open with a keynote from Rob Scott at MIT, who was instrumental in founding the Symposium, and has a bird’s-eye view of the history of analytics in fundraising, from the perspective of research, IT, front-line fundraising, and fundraising management.  Thursday morning, we will have the opportunity to join the larger conference to hear Penelope Burke, President of Cygnus Applied Research Inc., on Donor-Centered Fundraising.  http://www.aprahome.org/p/cm/ld/&fid=73

The fundamental track is intended as a two day introduction to analytics in fundraising, with the goal of giving participants a solid road map to approach their first project.  Topics will include: Various Variables: Data Preparation and Management for Successful Analytics, Walkthrough: Understanding the Problem and the Resources, Key Questions in Project Management, and Implementation.  Presenters will include Chuck McClenon at the University of Texas, James Cheng at Dana Farber Cancer Institute, Audrey Geoffroy at the University of Florida, and myself.  In addition, six short case studies from a variety of nonprofits will be presented in the fundamental track.

In the intermediate/advanced track, we will continue the focus on case study with nine short project presentations.  We will also have a presentation from Jeff Shuck of Event 360, who applies predictive modeling and segmentation to fundraising events and peer-to-peer fundraising programs.  Marianne Pelletier of Cornell University and Josh Birkholz of Bentz Whaley Flessner will present on constituent engagement.  Chuck McClenon of the University of Texas will lead a panel of practitioners to discuss the intricacies of collaborating with development IT.

Finally, we will have our usual faculty/committee panel to close the Symposium.  We will be asking our faculty, committee members, and a few guests to tell us about the one best idea they’ve heard recently in the area of development analytics, and follow up with a free-wheeling conversation including these ideas and any and all questions from the floor.

Last year we experimented with a case study format that gave us the opportunity to hear many of our colleagues present on projects they are working on at their institutions.  As you see above, with a few tweaks, we are continuing to set aside some time for case study this year.  If you’re planning to attend, I’m hoping some of you might have a project you’d be interested in presenting?  You will have 20 minutes to present a project of your choice and take a few questions.  Emma Hinke at Johns Hopkins has kindly agreed to handle the logistics of case studies for me, so if you have an idea, or would just like more information on the case studies, please be in touch with Emma at ehinke2@jhu.edu.  If we have a great flood of ideas, we may not be able to pack them all in, but wouldn’t that be a great problem to have?  Please send us your thoughts, and if we can’t manage them all this year, we’ll start a list for next year.

I do hope you will consider joining us — it’s the variety of attendees that makes the Symposium great.  I’ll let you know when we have the full schedule up on the Symposium web site.

Many thanks,

Kate Chamberlin
Chair, APRA Symposium on Data Analytics
Campaign Strategic Research Director, Memorial Sloan-Kettering Cancer Center

Older Posts »

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 554 other followers