CoolData blog

31 May 2014

Presenting at a conference: Why the pain is totally worth it

One morning some years ago, when I was a prospect researcher, I was sitting at my desk when I felt a stab of pain in my back. I’d never had serious back pain before, but this felt like a very strong muscle spasm, low down and to one side. I stood up and stretched a bit, hoping it would go away. It got worse — a lot worse.

I stepped out into the hallway, rigid with pain. Down the hall, standing by the photocopier waiting for her job to finish, was Bernardine. She had a perceptive eye for stuff, especially medical stuff. She glanced in my direction and said, “Kidney stone.”

An hour later I was laying on a hospital gurney getting a Toradol injection and waiting for an X-ray. It was indeed a kidney stone, and not a small one.

This post is not about my kidney stone. But it is a little bit about Bernardine. Like I said, she knew stuff. She diagnosed my condition from 40 feet away, and she was also the first person to suggest that I should present at a conference.

At that time, there were few notions that struck terror in my heart like the idea of talking in front of a roomful of people. I thought she was nuts. ME? No! I’d rather have another kidney stone.

But Bernardine had also given me my first copy of Peter Wylie’s little blue book, “Data Mining for Fundraisers.” With that, and the subsequent training I had in data mining, I was hooked — and she knew it. Eventually, my absorption with the topic and my enthusiasm to talk about it triumphed over my doubts. I had something I really wanted to tell people about, and the fear was something I needed to manage. Which I did.

To date I’ve done maybe nine or ten conference presentations. I am not a seasoned presenter, nor has public speaking become one of my strengths. But I do know this: Presenting stuff to my counterparts at other institutions has proven one of the best ways to understand what it is I’m doing. These were the few times I got to step back and grasp not only the “how” of my work, but the “why”.

This is why I recommend it to you. The effort of explaining a project you’ve worked on to a roomful of people you’re meeting for the first time HAS to force some deeper reflection than you’re used to. Never moving beyond the company of your co-workers means you’re always swimming in the same waters of unspoken assumptions. Creating a presentation forces you to step outside the fishbowl, to see things from the perspective of someone you don’t know. That’s powerful.

Yes, preparing a presentation is a lot of work, if you care about it enough. But presenting can change your relationship with your job and career, and through that it can change your life. It changed mine. Blogging also changed my life, and I think a lot more people should be blogging too. (A post for another day.) Speaking and writing have rewarded me with an interesting career and professional friendships with people far and wide. These opportunities are not for the exceptional few; they are open to everyone.

I mentioned earlier that Bernardine introduced me me Peter Wylie’s book. Back then I could never have predicted that one day he and I would co-author another book. But there it is. It gave me great pleasure to give credit to Bernardine in the acknowledgements; I put a copy in the mail to her just this week. (I also give credit to my former boss, Iain. He was the one who drove me to the hospital on the day of the kidney stone. That’s not why he’s in the acknowledgements, FYI.)

Back to presenting … Peter and I co-presented a workshop on data mining for prospect researchers at the APRA-Canada conference in Toronto in 2010. I’m very much looking forward to co-presenting with him again this coming October in Chicago. (APRA-Illinois Data Analytics Fall Conference … Josh Birkholz will also present, so I encourage you to consider attending.)

Today, playing the role of a Bernardine, I am thinking of who I ought to encourage to present at a conference. I have at least one person in mind, who has worked long and hard on a project that I know people will want to hear about. I also know that the very idea would make her vomit on her keyboard.

But I’ve been there, and I know she will be just fine.

11 February 2014

Teach them to fish: A view from IT

Filed under: IT, Training / Professional Development — Tags: — kevinmacdonell @ 5:52 am

Guest post by Dwight Fischer, Assistant Vice President – CIO, Information Technology Services (ITS), Dalhousie University

(When I read this post by our university’s CIO on his internal blog, I thought “right on.” It’s not about predictive modelling, and CoolData is not about IT. But this message about taking responsibility for acquiring new skills hit the right note for me. Follow Dwight on Twitter at @cioDalhousieU – Kevin)

I recently recommended OneNote to a colleague. OneNote is a venerable note-taking and organizational tool that is part of the Microsoft Office suite. I spoke to the merits of the application and how useful and versatile a tool it is, particularly now that it is fully integrated with mobile devices through the cloud. I suggested that she look online and find some resources on how to use it.

Busy as she is, she asked her administrative assistant to look up how to use OneNote, who in turned called the HelpDesk looking for support. The Help Desk staff need to know a lot of information, but software expertise is not the type of thing they can and are able to provide deeper-level support. Unless they were to use the software on a day-in, day-out basis, how could they? As it was, the caller did not get the support she expected.

If that individual instead had gone to Google (or Bing, Yahoo, YouTube, whatever) and asked the question, they would have received a torrent of information. All she needs to understand is how to ask or phrase the question.

  • “Tips on using OneNote”
  • “OneNote quick Tutorial”
  • “Help with OneNote”

It occurs to me that we have provided support to our clients for so long, they have developed an unhealthy dependence on IT staff to answer all their issues. Meanwhile, the internet has developed a horde of information and with it, many talented individuals who simply like to share their knowledge. Is it all good information? Not always, but if you just do a little searching and modify your search terms, you’ll certainly find relevant information. Often times you’ll find some serendipitous learning as well.

We need to help our clients make this shift. Instead of answering their questions, coach them on how to ask questions in search engines. Give them a fish and they’ll eat for a day. Teach them to fish and they’ll eat heartily. And save the more unique technology questions for us.

P.S. I used to go to the bike store for repairs. I could do a lot of work on my bikes, but there were some things I just couldn’t do. But with a small fleet, that was getting expensive. I started looking up bike repair issues in YouTube and lo and behold, it’s all right there. I might have bought a tool or two, but I can darn near fix most things on the bikes. It just takes some patience and learning. There are some very talented bike mechanics who put out some excellent videos.

2 December 2013

How to learn data analysis: Focus on the business

Filed under: Training / Professional Development — Tags: , , , — kevinmacdonell @ 6:17 am

A few months ago I received an email from a prospect researcher working for a prominent theatre company. He wanted to learn how to do data mining and some basic predictive modeling, and asked me to suggest resources, courses, or people he could contact. 

I didn’t respond to his email for several days. I didn’t really have that much to tell him — he had covered so many of the bases already. He’d read the  book “Data Mining for Fund Raisers,”  by Peter Wylie, as well as “Fundraising Analytics: Using Data to Guide Strategy,” by Joshua Birkholz. He follows this blog, and he keeps up with postings on the Prospect-DMM list. He had dug up and read articles on the topic in the newsletter published by his professional association (APRA). And he’d even taken two statistics course — those were a long time ago, but he had retained a basic understanding of the terms and concepts used in modeling.

He was already better prepared than I was when I started learning predictive modeling in earnest. But as it happened, I had a blog post in draft form (one of many — most never see the light of day) which was loosely about what elements a person needs to become a data analyst. I quoted a version of this paragraph in my response to him:

There are three required elements for pursuing data analysis. The first and most important is curiosity, and finding joy in discovery. The second is being shown how to do things, or having the initiative to find out how to do things. The third is a business need for the work.

My correspondent had the first element covered. As for the second element, I suggested to him that he was more than ready to obtain one-on-one training. All that was missing was defining the business need … that urgent question or problem that data analysis is suited for.

Any analysis project begins with formulating the right question. But that’s also an effective way to begin learning how to do data analysis in the first place. Knowing what your goal is brings relevance, urgency and focus to the activity of learning.

Reflect on your own learning experiences over the years: Your schooling, courses you’ve taken, books and manuals you’ve worked your way through. More than likely, this third element was mostly absent. When we were young, perhaps relevance was not the most important thing: We just had to absorb some foundational concepts, and that was that. Education can be tough, because there is no satisfying answer to the question, “What is the point of learning this?” The point might be real enough, but its reality belongs to a seemingly distant future.

Now that we’re older, learning is a completely different game, in good ways and bad. On the bad side, daily demands and mundane tasks squeeze out most opportunities for learning. Getting something done seems so much more concrete than developing our potential. 

On the good side, now we have all kinds of purposes! We know what the point is. The problems we need to solve are not the contrived and abstract examples we encountered in textbooks. They are real and up close: We need to engage alumni, we need to raise more money, we need, we need, we need.

The key, then, is to harness your learning to one or more of these business needs. Formulate an urgent question, and engage in the struggle to answer it using data. Observe what happens then … Suddenly professional development isn’t such an open-ended activity that is easily put off by other things. When you ask for help, your questions are now specific and concrete, which is the best way to generate response on forums such as Prospect-DMM. When you turn to a book or an internet search, you’re looking for just one thing, not a general understanding.

You aren’t trying to learn it all. You’re just taking the next step toward answering your question. Acquiring skills and knowledge will be a natural byproduct of what should be a stimulating challenge. It’s the only way to learn.


30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:


Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:


The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

27 June 2013

Time management for data analysts

Filed under: Best practices, Training / Professional Development — Tags: , , — kevinmacdonell @ 5:32 am

Does it seem you never have enough time to get your work done? You’ve got a long list of projects, more than a few of which are labeled Top Priority — as if multiple projects could simultaneously be “top priority” — along with your own analysis projects which too often get pushed aside. We aren’t going to create more time for ourselves, and there’s only so much we are empowered to say “no” to. So we need a different strategy.

The world does not need another blog post about how to be more productive, or a new system to fiddle with instead of doing real work. However, I’ve learned a few things about how to manage my own time and tasks (I have done my share of reading and fiddling), and perhaps some of what works for me will be helpful to analysts … and to prospect researchers, alumni magazine feature writers, or anyone else with work that requires extended periods of focused work.

First and foremost, I’ve learned that “managing time” isn’t an effective approach. Time isn’t under your control, therefore you can’t manage it. What IS under your control (somewhat) is your attention. If you can manage your attention on a single task for a few stretches of time every day, you will be far more productive. You need to identify unambiguously what it is you should be working on right now from among an array of competing priorities, and you need to be mentally OK with everything you’re not doing, so that you can focus.

My “system” is hardly revolutionary but it is an uncomplicated way to hit a few nails on the head: prioritization and project management, focus and “flow”, motivation, and accountability and activity tracking. Again, it’s not about managing your time, it’s about managing your projects first so that you can choose wisely, and then managing your attention so you can focus on that choice.

Here is an Excel template you can use to get started: Download Projects & Calendar – As promised, it’s nothing special. There are two main elements: One is a simple list of projects, with various ways to prioritize them, and the other is a drop-dead simple calendar with four periods or chunks of time per day, each focused on a single project.

Regarding the first tab: A “project” is anything that involves more than one step and is likely to take longer than 60 minutes to complete. This could include anything from a small analysis that answers a single question, to a big, hairy project that takes months. The latter is probably better chunked into a series of smaller projects, but the important thing is that simple tasks don’t belong here — put those on a to-do list. Whenever a new project emerges — someone asks a complicated question that needs an answer or has a business problem to solve — add it to the projects list, at least as a placeholder so it isn’t forgotten.

You’ll notice that some columns have colour highlighting. I’ll deal with those later. The uncoloured columns are:

Item: The name of the project. It would be helpful if this matched how the project is named elsewhere, such as your electronic or paper file folders or saved-email folders.

Description: Brief summary of what the project is supposed to accomplish, or other information of note.

Area: The unit the project is intended to benefit. (Alumni Office, Donor Relations, Development, etc.)

Requester: If applicable, the person most interested in the project’s results. For my own research tasks, I use “Self”.

Complete By: Sometimes this is a hard deadline, usually it’s wishful thinking. This field is necessary but not very useful in the short term.

Status/Next Action: The very next thing to be done on the project. Aside from the project name itself, this is THE single most important piece of information on the whole sheet. It’s so important, I’m going to discuss it in a new paragraph.

Every project MUST have a Next Action. Every next action should be as specific as possible, even if it seems trivial. Not “Start work on the Planned Giving study, ” but rather, “Find my folder of notes from the Planned Giving meeting.” Having a small and well-defined task that can be done right now is a big aid to execution. Compare that to thinking about the project as a whole — a massive, walled fortress without a gate — which just creates anxiety and paralysis. Like the proverbial journey, executing one well-defined step after another gets the job done eventually.

A certain lack of focus might be welcome at the very beginning of an analysis project, when some aimless doodling around with pencil and paper or a few abortive attempts at pulling sample data might help spark some creative ideas. With highly exploratory projects things might be fuzzy for a long time. But sooner or later if a project is going to get done it’s going to have an execution stage, which might not be as much fun as the exploratory stage. Then it’s all about focus. You will need the encouragement of a doable Next Action to pull you along. A project without a next action is just a vague idea.

When a project is first added to the list as a placeholder until more details become available, the next action may be unclear. Therefore the Next Action is getting clarity on the next action, but be specific. That means, “Email Jane about what she wants the central questions in the analysis to be,” not “Get clarity.”

(The column is also labeled “Status.” If a project is on hold, that can be indicated here.)

Every Next Action also needs a Next Action Date. This may be your own intended do-by date, an externally-set deadline, or some reasonable amount of time to wait if the task is one you’ve delegated to someone else or you have requested more information. Whatever the case, the Next Action Date is more important than the overall (and mostly fictitious) project completion date. That’s why the Next Action Date is conditionally formatted for easy reference, and the Completion Date is not. The former is specific and actionable, the latter is just a container for multiple next actions and is not itself something that can be “done”. (I will say more about conditional formatting shortly.)

When you are done with a project for the day, your last move before going on to something else is to decide on and record what the very next action will be when you return to that project. This will minimize the time you waste in switching from one task to another, and you’ll be better able to just get to work. Not having a clear reentry point for a project has often sidetracked me into procrastinating with busy-work that feels productive but isn’t.

The workbook holds a tab called Completed Projects. When you’re done with a project, you can either delete the row, or add it to this tab. The extra trouble of copying the row over might be worth it if you need to report on activity or produce a list of the last year’s accomplishments. As well, you can bet that some projects that are supposedly complete (but not under your control) will come up again like a meal of bad shellfish. It’s helpful to be able to look up the date you “completed” something, in order to find the files, emails and documentation you created at the time. (By the way, if you don’t document anything, you deserve everything bad that comes to you. Seriously.) If the project was complex, a lot of valuable time can be saved if you can effectively trace your steps and pick up from where you left off.

I mentioned that several columns are conditionally formatted to display varying colour intensities which will allow you to assess priorities at a glance. We’re all familiar with the distinction between “important” and “urgent”. At any time we will have jobs that must get done today but are not important in the long run. Important work, on the other hand, might someday change the whole game yet is rarely “urgent” today. It has a speculative nature to it and it may not be evident why it makes sense to clear the decks for it. This is one reason for trying to set aside some time for speculative, experimental projects — you just never know.

The Priority Rating column is where I try to balance the two (urgent vs. important), using a scale of 1 to 10, with 1 being the top priority. I don’t bother trying to ensure that only one project is a ’1′, only one is a ’2′, etc. — I rate each project in isolation based on a sense of how in-my-face I feel it has to be, and of course that changes all the time.

Other columns use similar flagging:

Urgent: The project must be worked on now. The cell turns red if the value is “Y”. Although it may seem that everything is urgent, reserve this for emergencies and hard deadlines that are looming. It’s not unusual for me to have something flagged Urgent, yet it has a very low priority rating … which tells you how important I think a lot of “urgent stuff” is.

Percent Complete: A rough estimate of how far along you think you are in a project. The closer to zero, the darker the cell is. Consult these cells on days when you feel it’s time to move the yardsticks on some neglected projects.

Next Action Date: As already mentioned, this is the intended date or deadline for the very next action to be taken to move the project forward. The earlier in time the Next Action Date is, the darker the cell.

Date Added: I’m still considering whether I need this column, so it doesn’t appear in my sample file. This is the date a project made it onto the list. Conditional formatting would highlight the oldest items, which would reveal the projects that have been languishing the longest. If a project has been on your list for six months and it’s 0% done, then it’s not a project — it’s an idea, and it belongs somewhere else rather than cluttering today’s view, which should be all about action. You could move it to an On Hold tab or an external list. Or just delete it. If it’s worth doing, it’ll come back.

Here’s a far-away look at the first tab of my projects list. At a glance you can see how your eye is drawn to project needing attention, as variously defined by priority, urgency, completeness, and proximity of the next deadline. There is no need to filter or sort rows, although you could do so if you wanted.


The other main element in this workbook is a simple calendar, actually a series of calendars. Each day contains four blocks of time, with breaks in between. You’ll notice that there are no time indications. The time blocks are intended to be roughly 90 minutes, but they can be shorter or longer, depending on how long a period of time you can actually stay focused on a task.

If you’re like me, that period is normally about five minutes, and for that reason we need a bit of gentle discipline. I tell myself that I am about to begin a “sprint” of work. I commit wholly to a single project, and I clear the deck for just that project, based on the knowledge that there is a time limit to how long I will work to the exclusion of all distractions until I can goof off with Twitter or what have you. I have made a bargain with myself: Okay, FINE, I will dive into that THING I’ve been avoiding, but don’t bother me again for a week!

The funny thing is, that project I’ve been avoiding will often begin to engage me after I’ve invested enough time. The best data analysis work happens when you are in a state of “flow,” characterized by total absorption in a challenging task that is a match for your skills. If you have to learn new techniques or skills in order to meet that challenge, the work might actually feel like it is rewarding you with an opportunity to grow a bit.

Flow requires blocks of uninterrupted time. There may not be much you can do about people popping by your work station to chat or to ask for things, but you can control your self-interruptions, which I’ve found are far more disruptive. I’m going to assume you’ve already shut off all alerts for email and instant messaging apps on your computer. I would go a step farther and shut down your email client altogether while you’re working through one of your 90-minute sprints, and silence your phone.

If shutting off email and phone is not a realistic option for you, ask yourself why. If you’re in a highly reactive mode, responding to numerous small requests, then regardless of what your job title is, you may not be an analyst. If the majority of your time is spent looking up stuff, producing lists, and updating and serving reports, then you need to consider an automation project or a better BI infrastructure that will allow you more time for creative, analytical work. Just saying.

On the other hand, I’ve always been irritated by the productivity gurus who say you should avoid checking email at the start of the work day, or limit email checking to only twice a day. This advice cannot apply to anyone working in the real world. Sure, you can lose the first hour of the day getting sucked into email, but a single urgent message from on high can shuffle priorities for the day, and you’d better be aware of it. A good morning strategy would be to first open your projects file, identify what your first time block contains, reviewing the first action to take, and getting your materials ready to work. THEN you can quickly consult your email for any disruptive missives (don’t read everything!) before shutting down your client and setting off to do what you set out to do. You don’t necessarily have to tackle your first time block as soon as you sit down; you just need to ensure that you fit two time blocks into your morning.

Other time block tips:

  • While you’re in the midst of your time block, keep a pad of paper handy (or a Notepad file open) to record any stray thoughts about unrelated things that occur to you, or any new tasks or ideas that occur to you and threaten to derail you. You may end up getting derailed, if the new thing is important or interesting enough, but if not, jotting a note can prevent you from having to fire up your email again or make a phone call, or whatever, and save the interruption for when you’ve reached a better stopping point.
  • Try to exert some control over when meetings are scheduled. For meetings that are an hour or longer, avoid scheduling them so that they knock a hole right in the centre of either the morning or the afternoon, leaving you with blocks of time before and after that are too short to allow you to really get into your project.
  • Keep it fluid, and ignore the boundaries of time blocks when you’re in “flow” and time passes without your being conscious of it. If you’re totally absorbed in a project that you’ve been dreading or avoiding previously, then by all means press on. Just remember to take a break.
  • When you come to the end of a block, take a moment to formulate the next action to take on that project before closing off.
  • If you happen to be called away on something urgent when you’re in the middle of a time block, try to record the next action as a placeholder. Task-switching is expensive, both in time and in mental energy. Always be thinking of leaving a door open, even if the next action seems obvious at the time. You will forget.

I usually fill projects into time blocks only a few days in advance. The extra two weeks are there in case I want to do more long-term planning. The more important the project, the more time blocks it gets, and the more likely I am to schedule it for the first time block in the morning. Note that this tool isn’t used to schedule your meetings — that’s a separate thing and you probably already have something for that. It would be nice if meetings and project focusing could happen in the same view, but to me they are different things.

At the end of a week, I move the tab for the current calendar to the end of the row, rename it to show the date range it represents, and replace it with next week’s calendar, renaming and copying tabs as needed to prepare for the week to come. I am not sure if saving old calendars serves a purpose — it might make more sense to total up the estimated number of hours invested in the project that week, keeping a running total by project on the first tab — but like everything this is a work in progress.

Your Excel file might be saved on a shared drive and made accessible to anyone who needs to know what you’re working on. In that case, I suggest adding a password, one that allows users to open the file for reading, but prevents them from saving any changes.

And finally … this workbook thing is just a suggestion. Use a system or tool that works for you. What I’ve outlined here is partly inspired by books such as David Allen’s “Getting Things Done: The Art of Stress-Free Productivity” (which is also a whole system that goes by the same name), and Mihaly Csikszentmihalyi’s “Flow: The Psychology of Optimal Experience,” as well as a host of blog posts and media stories about creativity and productivity the details of which I’ve long forgotten but which have influenced the way I go about doing work.

Your employer might mandate the use of a particular tool for time and/or project management; use it if you have to, or if it serves your needs. More likely than not, though, it won’t help you manage the most limited resource of all: your attention. Find your own way to marshall that resource, and your time and projects will take care of themselves.

30 November 2012

Analytics conferences: Two problems, two antidotes

A significant issue for gaining data-related skills is finding the right method of sharing knowledge. No doubt conferences are part of the answer. They attract a lot of people with an interest in analytics, whose full-time job is currently non-analytical. That’s great. But I’m afraid that a lot of these people assume that attending a conference is about passively absorbing knowledge doled out by expert speakers. If that’s what you think, then you’re wasting your money, or somebody’s money.

There are two problems here. One is the passive-absorption thing. The other is a certain attitude towards the “expert”. Today I want to describe both problems, and prescribe a couple of conferences related to data and analytics which offer antidotes.

Problem One: “Just Tell Me What To Do”

You know the answer already: Knowledge can’t be passively absorbed. It is created, built up inside you, through engagement with an other (a teacher, a mentor, a book, whatever). We don’t get good ideas from other people like we catch a cold. We actively recognize an idea as good and re-create it for ourselves. This is work, and work creates friction — this is why good ideas don’t spread as quickly as mere viral entertainment, which passes through our hands quickly and leaves us unchanged. Sure, this can be exciting or pleasant work, but it requires active involvement. That’s pretty much true for anything you’d call education.

Antidote One: DRIVE

Ever wish you could attend a live TED event? Well, the DRIVE conference (Feb. 20-21 in Seattle — click for details) captures a bit of that flavour: Ideas are front and centre, not professions. Let me explain … Many or most conferences are of the “birds of a feather” variety — fundraisers talking to fundraisers, analysts talking to analysts, researchers talking to researchers, IT talking to IT. The DRIVE conference (which I have written about recently) is a diverse mix of people from all of those fields, but adds in speakers from whole other professional universes, such a developmental molecular biologist and a major-league baseball scout.

Cool, right? But if you’re going to attend, then do the work: Listen and take notes, re-read your notes later, talk to people outside your own area of expertise, write and reflect during the plane ride home, spin off tangential ideas. Dream. Better: dream with a pencil and paper at the ready.

Problem Two: “You’re the Expert, So Teach Me Already”

People may assume the person at the podium is an expert. The presenter has got something that the audience doesn’t, and that if it isn’t magically communicated in those 90 minutes then the session hasn’t lived up to its billing. Naturally, those people are going to leave dissatisfied, because that’s not how communicating about analytics works. If you’re setting up an artificial “me/expert” divide every time you sit down, you’re impeding your ability to be engaged as a conference participant.

Antidote Two: APRA Analytics Symposium

Every year, the Association of Professional Researchers for Advancement runs its Data Analytics Symposium in concert with its international conference. (This year it’s Aug 7-8 in Baltimore.) The Symposium is a great learning opportunity for all sorts of reasons, and yes, you’ll get to hear and meet experts in the field. One thing I really like about the Symposium is the  case-study “blitz” that offers the opportunity for colleagues to describe projects they are working on at their institutions. Presenters have just 20 or so minutes to present a project of their choice and take a few questions. Some experienced presenters have done these, but it’s also a super opportunity for people who have some analytics experience but are novice presenters. It’s a way to break through that artificial barrier without having to be up there for 90 minutes. If you have an idea, or would just like more information on the case studies, get in touch with me at, or with conference chair Audrey Geoffroy: Slots are limited, so you must act quickly.

I present at conferences, but I assure you, I have never referred to myself as an “expert”. When I write a blog post, it’s just me sweating through a problem nearly in real time. If sometimes I sound like I knew my way through the terrain all along, you should know that my knowledge of the lay of the land came long after the first draft. I like to think the outlook of a beginner or an avid amateur might be an advantage when it comes to taking readers through an idea or analysis. It’s a voyage of discovery, not a to-do list. Experts have written for this blog, but they’re good because although they know their way around, every new topic or study or analysis is like starting out anew, even for them. The mind goes blank for a bit while one ponders the best way to explore the data — some of the most interesting explorations begin in confusion and uncertainty. When Peter Wylie calls me about an idea he has for a blog post, he doesn’t say, “Yeah, let’s pull out Regression Trick #47. You know the one. I’ll find some data to fit.” No — it’s always something fresh, and his deep curiosity is always evident.

So whichever way you’re facing when you’re in that conference room, remember that we are all on this road together. We’re at different places on the road, but we’re all traveling in the same direction.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at


Get every new post delivered to your Inbox.

Join 1,044 other followers