CoolData blog

1 February 2016

Regular-season passing yardage and the NFL playoffs

Filed under: Analytics, Fun, John Sammis, Off on a tangent, Peter Wylie — Tags: , , , , — kevinmacdonell @ 7:37 pm

Guest post by Peter B. Wylie, with John Sammis

 

How much is regular-season passing yardage related to success in the NFL playoffs? (Click link to download .PDF: Passing yardage in the NFL.)

 

Peter was really interested in finding out how strong the relationship might be between an NFL team’s passing during the regular season and its performance in the playoffs. There’s been plenty of talk about this relationship, but he wanted to see for himself.

 

A bit of a departure for CoolData, but still all about data and analysis … hope you enjoy!

 

Advertisements

2 December 2013

How to learn data analysis: Focus on the business

Filed under: Training / Professional Development — Tags: , , , — kevinmacdonell @ 6:17 am

A few months ago I received an email from a prospect researcher working for a prominent theatre company. He wanted to learn how to do data mining and some basic predictive modeling, and asked me to suggest resources, courses, or people he could contact. 

I didn’t respond to his email for several days. I didn’t really have that much to tell him — he had covered so many of the bases already. He’d read the  book “Data Mining for Fund Raisers,”  by Peter Wylie, as well as “Fundraising Analytics: Using Data to Guide Strategy,” by Joshua Birkholz. He follows this blog, and he keeps up with postings on the Prospect-DMM list. He had dug up and read articles on the topic in the newsletter published by his professional association (APRA). And he’d even taken two statistics course — those were a long time ago, but he had retained a basic understanding of the terms and concepts used in modeling.

He was already better prepared than I was when I started learning predictive modeling in earnest. But as it happened, I had a blog post in draft form (one of many — most never see the light of day) which was loosely about what elements a person needs to become a data analyst. I quoted a version of this paragraph in my response to him:

There are three required elements for pursuing data analysis. The first and most important is curiosity, and finding joy in discovery. The second is being shown how to do things, or having the initiative to find out how to do things. The third is a business need for the work.

My correspondent had the first element covered. As for the second element, I suggested to him that he was more than ready to obtain one-on-one training. All that was missing was defining the business need … that urgent question or problem that data analysis is suited for.

Any analysis project begins with formulating the right question. But that’s also an effective way to begin learning how to do data analysis in the first place. Knowing what your goal is brings relevance, urgency and focus to the activity of learning.

Reflect on your own learning experiences over the years: Your schooling, courses you’ve taken, books and manuals you’ve worked your way through. More than likely, this third element was mostly absent. When we were young, perhaps relevance was not the most important thing: We just had to absorb some foundational concepts, and that was that. Education can be tough, because there is no satisfying answer to the question, “What is the point of learning this?” The point might be real enough, but its reality belongs to a seemingly distant future.

Now that we’re older, learning is a completely different game, in good ways and bad. On the bad side, daily demands and mundane tasks squeeze out most opportunities for learning. Getting something done seems so much more concrete than developing our potential. 

On the good side, now we have all kinds of purposes! We know what the point is. The problems we need to solve are not the contrived and abstract examples we encountered in textbooks. They are real and up close: We need to engage alumni, we need to raise more money, we need, we need, we need.

The key, then, is to harness your learning to one or more of these business needs. Formulate an urgent question, and engage in the struggle to answer it using data. Observe what happens then … Suddenly professional development isn’t such an open-ended activity that is easily put off by other things. When you ask for help, your questions are now specific and concrete, which is the best way to generate response on forums such as Prospect-DMM. When you turn to a book or an internet search, you’re looking for just one thing, not a general understanding.

You aren’t trying to learn it all. You’re just taking the next step toward answering your question. Acquiring skills and knowledge will be a natural byproduct of what should be a stimulating challenge. It’s the only way to learn.

 

26 January 2012

More mistakes I’ve made

Filed under: Best practices, Peter Wylie, Pitfalls, Validation — Tags: , , , — kevinmacdonell @ 1:38 pm

A while back I wrote a couple of posts about mistakes I’ve made in data mining and predictive modelling. (See Four mistakes I have made and When your predictive model sucks.) Today I’m pleased to point out a brand new one.

The last days of work leading up to Christmas had me evaluating my new-donor acquisition models to see how well they’ve been working. Unfortunately, they were not working well. I had hoped — I had expected — to see newly-acquired donors clustered in the upper ranges of the decile scores I had created. Instead they were scattered all along the whole range. A solicitation conducted at random would have performed nearly as well.

Our mailing was restricted by score (roughly the top two deciles only), but our phone solicitation was more broad, so donors came from the whole range of deciles:

Very disappointing. To tell the truth, I had seen this before: A model that does well predicting overall participation, but which fails to identify which non-donors are most likely to convert. I am well past the point of being impressed by a model that tells me what everyone already knows, i.e. that loyal donors are most likely to give again. I want to have confidence that acquisition mail dollars are spent wisely.

So it was back to the drawing board. I considered whether my model was suffering from overfit, whether perhaps I had too many variables, too much random noise, multicolinearity. I studied and rejected one possibility after another. After so much effort, I came rather close to concluding that new-donor acquisition is not just difficult — it might be darn near impossible.

Dire possibility indeed. If you can’t predict conversion, then why bother with any of this?

It was during a phone conversation with Peter Wylie that things suddenly became clear. He asked me one question: How did I define my dependent variable? I checked, and found that my DV was named “Recent Donors.” That’s all it took to find where I had gone wrong.

As the name of the DV suggested, it turned out that the model was trained on a binary variable that flagged anyone who had made a gift in the past two years. The problem was that included everybody: long-time donors and newly-acquired donors alike. The model was highly influenced by the regular donors, and the new donors were lost in the shuffle.

It was a classic case of failing to properly define the question. If my goal was to identify the patterns and characteristics of newly-acquired donors, then I should have limited my DV strictly to non-donors who had recently converted to donors!

So I rebuilt the model, using the same data file and variables I had used to build the original model. This time, however, I pared the sample down to alumni who had never given a cent before fiscal 2009. They were the only alumni I needed to have scores for. Then I redefined my dependent variable so that non-donors who converted, i.e., who made a gift in either fiscal 2009 or 2010, were coded ‘1’, and all others were coded ‘0’. (I used two years of giving data instead of just one in order to have a little more data available for defining the DV.) Finally, I output a new set of decile scores from a binary logistic regression.

A test of the new scores showed that the new model was a vast improvement over the original. How did I test this? Recall that I reused the same data file from the original model. Therefore, it contained no giving data from the current fiscal year; the model was innocent of any knowledge of the future. Compare this breakdown of new donors with the one above:

Much better. Not fan-flippin-tastic, but better.

My error was a basic one — I’ve even cautioned about it in previous posts. Maybe I’m stupid, or maybe I’m just human. But like anyone who works with data, I can figure out when I’m wrong. That’s a huge advantage.

  • Be skeptical about the quality of your work.
  • Evaluate the results of your decisions.
  • Admit your mistakes.
  • Document your mistakes and learn from them.
  • Stay humble.

Create a free website or blog at WordPress.com.