CoolData blog

12 July 2012

Evaluate models with fresh data using Tableau heat maps

When I build predictive models, I normally don’t build just one for each purpose. Presumably the model is going to be used, so I want it to be the best one possible. Yes, I test the model scores against a holdout data sample, but if I built only one model, I wouldn’t have anything solid on which to base my evaluation of the results. I might reject a lone model if it truly failed against the validation set, but that has never happened to me — even a lackluster performance can be better than nothing, and therefore the model is flawed, but useful. That statement is true of models in general. So testing results with nothing to compare against is pointless.

I usually produce one multiple linear regression model and one binary logistic regression model using the stats software package Data Desk. Many permutations are possible, though: The sample to be scored can be limited in various ways, and the dependent variable can be formulated any number of ways. The choice of technique (for me, one type of regression or another) is usually determined by the nature of the DV (though not always). Given unlimited time, I would produce multiple models, but doing two at a time is manageable and keeps the task of comparison simple. The model that does the best classifying the members of the holdout sample wins the prize, and the loser is discarded.

But there’s a problem. I’ve never had a model bomb when it comes to scoring the validation set, but I HAVE had models fail after deployment. Data that is held out for validation of the model is one thing — the real world outside the model can be a whole OTHER thing. Logically it should not be so: If the model doesn’t “know” anything about the holdout data, then you’d think its performance on it would indicate how it will perform in the future.

Not so. At least, not always.

I am not so quick, then, to discard the loser. I like to evaluate both models on fresh data as it comes in (new gifts, for example). The loser might be the better choice overall, or it might turn out that a combination of the two models performs better than one on its own. Maybe one model works better for a subset of the population (young alumni, say), which suggests that adding interaction terms or even using a multiple-model approach is something to consider in the future. If the models predict slightly different propensities (as a result of how the DVs were formulated), with both of them contributors to a desirable result, then it might be worthwhile keeping both score sets by multiplying them together.

I don’t have an extended period of time for such testing — the model needs to be put into operation before it gets stale. Unfortunately, evaluation has always been a cumbersome process. I need to query the database for fresh results (conversions, upgrades, new planned giving expectancies — whatever) and then match it up by ID and score for each model (scores for untested models are not going to be in the database, obviously), and then produce some charts in Excel to visualize and compare results. It’s not a ton of work, but it takes just long enough to prevent me from doing it more than once before it’s time to commit. Even if I am evaluating the models after the fact, in order to learn for the next iteration of model-building, it’s not an exercise I will want to carry out repeatedly.

There is a better way. Think reports.

What does a report do? A report pulls real-time (or nightly-refreshed) data and assembles it in an interpretable way in a tabular or visual display. It performs this service on a regular or semi-regular basis, or on-demand. (Yeah, okay, maybe I should have said an ideal report). If part of your job consists in report preparation as well as predictive modeling, then you should be building model scores into your reports.

Here’s a tutorial on how to use Tableau to easily create a report that compares the performance of two sets of model scores in a single visualization called a heat map. This visualization can be refreshed with live data as often as desired. If you want, you can add other fields (age, sex, degree, donor status, etc.) and easily filter the data to see how model performance differs depending on the composition of the population. Note that this is probably not a report you’ll be sharing with your vice president. It does look cool, but it is mainly a diagnostic and exploration tool for your own use. The small initial investment of time is worth it if you build multiple models — it can be reused again and again.

This tutorial assumes you’re already somewhat familiar with the basics of Tableau. If you don’t have the software, and you don’t want to download a free trial, stick around anyway — other software packages offer ways to create heat maps, and the basic idea is the same.

In this example, I am comparing percentile scores from two models I developed to predict which alumni are most likely to give at least $1,000 in the current fiscal year. One is a multiple linear regression model with a dependent variable defined as the sum of giving for the past five years (log-transformed). The other is a logistic regression model with a binary dependent variable defined as ‘has giving of at least $1,000 in any one of the past three years’. The exact definitions of the DVs are reasonable but somewhat arbitrary. They are closely related, but different. The techniques and the predictor variables are also different, so we should expect the models to yield different results. Tested against the validation set (which was the same for both models), the logistic model proved superior. But only a test on new gift data will be truly convincing.

I want to take the entire population of alumni whom I have scored (a sample of about 27,000 individuals), and match them up with what they have given since the model was created. In this made-up example, let’s suppose I created my models last August, and I want to see what those 27,000 alumni have given since the day I completed the work. In reality, I would have chosen a winning model months ago and this would be an after-the-fact analysis, but I am doing this in order to enrich the visualization for the purposes of this example. (Cheating, in other words.)

Tableau allows you to combine data from multiple sources. In this case, you will connect to an Excel file to get your model scores (since they’re not in the database), and then connect to your database for giving results since September 1. If you do not connect directly to your database from Tableau, then you can paste your gifts data into a second sheet in your Excel workbook and extract the data via a single connection to that file — no problem. The first worksheet will have three columns: One for unique ID, and one each for the scores from the two models. In this example, the scores were output from Data Desk as percentiles. If you want, you can add columns for key attributes such as age, sex and so on. The second worksheet (or the custom SQL that retrieves data directly from your data warehouse) will provide ID and sum of giving since September 1.

Normally in report creation, Tableau handles all the aggregation of the data — the input is raw transaction data, with each ID potentially appearing on multiple rows. In this example, however, we have aggregated the data already (summing giving by ID), and there is only one row of data for each ID. It doesn’t matter, but it might have implications for some of the specific steps that follow.

You should refer to your Tableau references for connecting to data sources. All I will add is that when you add the table (or worksheet) that contains the giving data, be sure to left-join on ID, because obviously not everyone you have scored has given since Sept. 1. From here on in, I will use Tableau terminology that won’t make any sense if you don’t know the software (specifically, Tableau Desktop version 7.0). Let’s build our first view:

  1. If your data has been extracted correctly, ‘ID’ will be listed under Dimensions, and your two model score sets will be listed under Measures. In this example, I will from now on refer to them as MLR (for Multiple Linear Regression) and Logistic. Obviously I’m referring to my own data — just try to replicate what I’m talking about using your own data file.
  2. For now, pause Auto Updates (or turn off automatic updates).
  3. Right-click on Logistic and select “Create bins …” This will bin the percentile score into whatever size we desire. Change the default bin size to 5 and click OK. Note that a new variable is created in the Dimensions pane, because bins are categorical, not numerical.
  4. Right-click on MLR and do the same thing.
  5. Drag Logistic (bins) to the Columns shelf. Drag MLR (bins) to the Rows shelf.
  6. Drag ID to the Text shelf. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will create a count of all IDs that fall into each cell. It turns green to indicate it’s now a measure instead of a dimension. (Because each ID appears in our data only once, it doesn’t matter whether we use either Count or Count Distinct.)
  7. Change the Marks type from Automatic to Square (right above the Text shelf). Notice that the Text shelf suddenly turns into a Label shelf — each square of the heat map will be labeled with the number of IDs.
  8. Drag ID from the Dimensions pane again, and this time drop it onto the Color shelf.
  9. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will base the color or shading of the cell on the number of IDs that fall into that cell.

The top left corner of your screen will look like this:

Now we’re ready to allow the view to automatically update. The result won’t look much like a heat map: Probably just a bunch of little squares with numbers beside them. We need to enlarge the squares. Under the Size shelf is a slider: Move this to the centre of the size range. Then drag one of the rows in the view to make it taller — hover over the axis for MLR (on the far left) until the pointer turns into an up-and-down arrow, then click and drag. When you let go, the squares will resize and the alleys of white space should start to close up. Keep messing with it until the squares touch on all sides. With a little formatting of labels for readability, the final product will look something like this. (Click on thumbnail image for full size.)

A heat map can convey a lot of information at a glance. You can immediately see where a lot of individuals are concentrated: They’re in the darkest squares. The numbers are hard to read, but up in the top left of the map, we see that the number of people who fall into the 0-4 bin in both the MLR and Logistic models is 572. In the lower right area of the map, we see that 563 people fell into the 95 to 99 bin in both models. Notice that Tableau didn’t bin evenly: Every single bin has 5 score levels in it except for the bin labeled 100, which contains only individuals with a score of 100. In the map, we see that 147 people scored exactly 100 in both models. This can be corrected (using a calculated field instead of automatic binning), but I have decided to leave it the way it is. Due to the nature of this modeling exercise, I am mainly interested in the top few percentile scores anyway, and the 100 group is of particular interest. Having them mapped separately from the rest is not a problem.

The names of the bins don’t reflect what they include. For example, “90” really means “90 to 94”. You can rename them using aliases. Right-click on Logistic in the Dimensions pane, select Field Properties –> Aliases…, and change the displayed values in the Values column. Do the same for MLR.

We haven’t looked at the recent-gift data yet, but before we move on, what can we learn from this view? It appears the models agree on the individuals with extremely high or extremely low scores. In the middle range, there is still a lot of agreement but also many more cases of divergence, in which an individuals scores high in one model but low in the other. This is clear, at-a-glance evidence that our models are similar but different. Depending on the application, choosing one model over the other could have a big effect on the result, for better or worse. In this particular application, where I am interested mainly in very high-scoring alumni only, it may not make that much difference at all … but let’s not jump to that conclusion just yet.

If your data set included some key grouping information such as age or sex, it might be interesting to create a filter to examine whether the models differ on those factors. Here’s an example with ‘Age’:

  1. Drag Age from the Measures pane into the Filters shelf.
  2. When Tableau asks you how you want to filter on Age, select “All Values” and click Next.
  3. On the next box, select Range of Values, and click OK.
  4. Hover over the green Age pill on the filters shelf, click the down-arrow on the right end of the pill, and select Show Quick Filter.

Now you can set the upper and lower age bounds of the individuals you want to be counted in the heat map. As you slide the scale, it will display Age with numbers after the decimal, even though your values are all whole numbers. If this bothers you, right-click on Age in the Measures pane, select Field Properties –> Number Format…, and click on Number (Custom). Adjust the number of decimal places to zero. Here’s what the quick filter looks like:

The next two images show the heat map for different age ranges. The first one is ages 20 to 50, the second is 51 to 80. Again, click on the thumbnails for full-size images — although the beauty of a heat map is that you can see the pattern from a distance.

Right off the bat, it’s evident that it’s harder for younger individuals to get a high score, but they fare better in the MLR model than they do in the Logistic model. Imagine a 45-degree line sloping from the top left corner to the bottom right corner — the presence of more dark-shaded squares under that line indicates individuals with higher MLR scores than Logistic scores. The logistic model, on the other hand, slightly favours older alumni. This alone might explain why the Logistic model outperformed the MLR model in terms of the validation set. The difference might be due to how age-related variables were introduced to each model as predictors; they may have been more influential in one than the other. It’s hard to say without going back to the models themselves for a close look.

One can spend a lot of time playing and learning with these filters. Let’s fast-forward and (finally) introduce recent-gift data — the giving that all scored individuals have engaged in since September 1, the day after the models were supposedly created. This data appears in the Measures pane as a variable I’ll call ‘Sum of Giving’. I’m specifically interested in who has given at least $1,000 (cumulatively), so I will need to create a calculated field to flag these people.

  1. Right-click on Sum of Giving and select Create Calculated Field…
  2. Give the field a name. I called it “Leadership donor”.
  3. The field Sum of Giving is already in the expression window. Now you just need to add some text around it to complete the expression:
  4. Click OK. This creates a field (variable) with the value 1 for any donor who has given at the Leadership level, and nothing if otherwise. Note that you can enter any amount in place of 999. If you want to count donors vs. non-donors, enter “>0”.
  5. The field appears in the Measures pane, because Tableau recognizes it as numeric. We’re using it as a categorical variable, so let’s convert it into a Dimension instead. Right-click on the field name and select “Convert to Dimension”, or simply drag the field into the Dimensions pane — both actions accomplish the same thing.

Now we have a flag we can use to zero in on our higher-end donors. Let’s create a new view for that. At the lower left of your screen, right-click on the tab for the existing view and select “Duplicate Sheet”. This will allow us to continue exploring the heat map without changing our original version. We could, of course, do all our work in a single view and use filters to dynamically alter the view — that’s one of the strengths of Tableau — but for now let’s keep our views separate.

  1. If you still have filters applied for Age or other variables, click on the quick filter menu and select “Clear Filter”. You can reapply it later if you want — we’re just getting it out of the way so we can see the full picture.
  2. Drag ‘Leadership donor’ to the Filters shelf.
  3. In the box that pops up, click “Select from list” on the General tab (it should already be selected), and then check the little box for ‘1’.
  4. Click OK.

The result looks like this. (Click for full size.)

Our big donors are clustered nicely down in the lower right corner, where both the MLR and the Logistic model scores are very high. Some of the lower-score bins contain zero Leadership donors, and Tableau has automatically hidden those rows and columns from view. Take a couple of minutes to study the map. Follow the three darkest squares (labeled 48, 74, and 23) as they form a 45-degree line up the centre of the map. If you compare the values in the squares that are directly opposite each other over this line, you’ll notice that there are slightly more Leadership donors on the upper side of the line. Those are donors who have higher Logistic scores than MLR scores. As well, notice that the scattered cloud of donors above the line is more extensive than that below the line. These observations should lead us to believe that the Logistic model performs slightly better than the MLR model.

That conclusion is a bit hasty, though. There might be more Leadership donors on the high-Logistic/low-MLR side simply because more alumni ended up in those squares in the first place. We need to calculate the PERCENTAGE of the population of each square that went on to become a Leadership donor. That’s right, we’re going to create a third view, and calculate percentages to plug into each square.

  1. Right-click on the tab for Sheet 2 and select Duplicate Sheet. (By the way, you can name these sheets whatever you want, just as in Excel.)
  2. Remove the filter for Leadership donor.
  3. Under Analysis in the top menu bar, select Create Calculated Field…
  4. Name the new field ‘Leadership percentage’.
  5. Enter this expression, which divides the number of Leadership donors by the total number of individuals.
  6. Click OK. The new field appears in the Measures pane, which is fine.
  7. Drag ‘Leadership percentage’ from Measures onto the Label shelf, replacing the count of ID.
  8. Drag ‘Leadership percentage’ from Measures again, this time onto the Color shelf.
  9. Right-click on any square in the map, and select Format…, which opens a formatting pane at the far left.
  10. On the Pane tab, in the Default section, click on the down-arrow to the right of “Numbers”, and select Percentage.

The result is below. (Click for full size.) You can select any precision for your percentages — I’ve rounded to whole numbers to avoid clutter.

The darkest square is a single donor with a very high MLR score but a very low Logistic score, who just happened to give at the Leadership level. That square is of course labeled 100%, which causes the rest of the display to be toned down to a degree that makes it hard to see the patterns. This single donor might be a person to look at more carefully, but for now, let’s exclude that person from the map. Hover your pointer over the square, and select Exclude from the tooltip box. (This creates a specific filter for this individual, which you can remove anytime.) All the squares are recoloured accordingly:

Now some of the darkest squares are also based on very sparse data. You can exclude any that you wish, but I’m fine with this display for now. For one thing, we can clearly see that having a Logistic score of 95 or higher is darn significant, regardless of what a donor’s MLR score is. For example, there are four Leadership donors who scored only 65-69 in the MLR model but have Logistic scores of 95-99, which is what we want to see. (Those donors are in the square labeled 14%.)

Being able to demonstrate that one model is superior is pretty nifty. But I am especially intrigued at how easy it is to see how the models might work together to improve accuracy.

Have a look at the square containing individuals who scored 100 in both models. There were 147 such individuals, and 48 of them gave $1,000 or greater — a whopping 32.6%. Here are a couple of facts to think about:

  • Of all the individuals who scored 100 in the Logistic model, 26.7% went on to give at the Leadership level.
  • Of all the individuals who scored 100 in the MLR model, 23.1% went on to give at the Leadership level.

Do you see what I’m getting at? When we combine both scores and zero in on people in the top percentile for both models, our yield of Leadership donors increases by nearly six percentage points over the best-performing model, to 32.6%.

The same boost is evident for other high-scoring cells in the heat map: The logistic model identifies some big donors that the MLR model misses, but the MLR model can enhance the accuracy of the logistic model. This is potentially useful for prospect identification in Major Giving, when we really want to be as focused as possible.

So far I’ve shown you only donor numbers. What about revenue? Our data set includes gift amounts, so let’s create a new view to visualize actual aggregate dollar totals.

  1. Duplicate the last sheet you created, and remove any filters that had been applied.
  2. Drag ‘Sum of Giving’ to the Label and Color shelves.
  3. Format the values as currency.
  4. For fun, change the color from green to red by clicking on Edit Colors in the context menu for the Sum of Giving card.

The result is pretty dramatic.

This is for all donors, not just Leadership donors, but if you want to narrow it down to Leadership donors only, re-apply your filter.

Just as with raw donor counts, the view above is a little misleading, simply because more prospects equals more donors, equals more dollars. So let’s create a calculated field to give us AVERAGE dollars per donor for every cell in the heat map.

The individuals with scores of 100 in both models gave nearly $5,000 on average — no other cell comes close. But guess what’s even better:

  • The individuals who scored 100 in the Logistic model gave an average of $2,927.
  • The individuals who scored 100 in the MLR model gave an average of $2,971.

The models are strongest where they intersect!

I’ve spent a lot of time and more than 4,000 words explaining how to do this in Tableau. This is very unusual for me — why a specific product such as Tableau, when one can create heat maps even in Excel? *

  • It’s just so easy to do it in Tableau, and the result looks attractive without requiring the user to fuss with formatting.
  • The data can be refreshed whenever necessary. If you’re connecting to an Excel file, simply paste new data into the file and refresh the data extract. It’s that simple. (Remember to refresh the extract rather than replace the data source entirely, if you want to retain your aliases as you’ve defined them.)
  • That goes for refreshing the giving data, AND for loading a whole different set of individuals and scores. You don’t need to rebuild these views from scratch (although it’s pretty easy to do so).
  • Tableau allows you to dynamically filter the data any which way you want. It’s a great way to explore the data. In my example, it would have been really interesting to filter on donors who UPGRADED to the $1,000+ level. Which model did a better job predicting upgrading? I don’t know, but I’m going to find out.
  • You can drill down to the underlying data. If you want to see a list of the people who scored 100 in both models, just hover the pointer over that square and click on the data icon, then the ‘Underlying’ tab. Imagine having wealth/capacity scores on one axis, and propensity scores on the other …
  • I’ve shared my heat maps here as static images, but you can share your analyses as fully-functioning views, even with people who don’t have the software on their computers. Save it as a Packaged Workbook, and they’ll be able to open it in Tableau Reader (which they can download for free). They can use the filters you’ve set up to play with the data themselves.

This may be the longest CoolData post ever, but as usual I feel I am barely scratching the surface.

* P.S.: Heat maps are easily created in a combination of Data Desk and Excel. Without going into too much detail: In Data Desk use contingency tables (a.k.a. cross tabs) to create the basic matrix of numbers, with one score set as x and the other as y, and use derived variable expressions to limit the counts as desired. Copy and paste the table text into Excel, and use conditional formatting to create the desired shading. Unfortunately this requires some fussing and the result is static.

20 January 2010

Another take on Google’s Motion Charts

Filed under: Coolness, Data visualization, Free stuff — Tags: , , , , — kevinmacdonell @ 9:09 am

Late last year I posted a tutorial on creating Google motion charts with your data. These very cool charts work with your time-series data, stored in Google Docs, to create an animation with the power to convey a lot of information in an easily understandable form.

But what about private data? You may not want to rely on Google’s ability to password-protect your data, or the privacy provisions you work with may prohibit posting data to an outside server.

Here’s another way to take advantage of motion charts. I was put onto this by Trevor Skillen, President and CEO of Metasoft, in Vancouver BC, whose company is working on incorporating motion charts into their well-known FoundationSearch product.

This version uses stored code to manipulate your data locally, rather than pulling it from Google Docs.

The advantages are clear:

  • Your data is stored locally and the code is executed locally, in the browser – nothing is sent to Google.
  • You gain precise control over the appearance – you can hide options that the user doesn’t need to see.
  • The example code provided by Google is fairly easy to modify without requiring programming or scripting skills.

Trevor directed me to Google’s ‘playground’ where one can get a quick feel for the technology without much tech effort.

There is a downside … there is a good deal of manual coding you’ll have to do if you want to put a chart together using your own data. This limits you to fairly simple charts – unless you’re capable of writing the additional code that will allow the chart to get data from a file or table.

10 December 2009

Cool motion charts – Part 4

Filed under: Data visualization, Free stuff — Tags: , , , — kevinmacdonell @ 1:30 pm

In my previous post in this tutorial, I described how to assemble the data to create your bubble chart. Now comes the relatively painless part: Pasting it into Google Docs and inserting a Google Gadget – the motion chart itself.

To review, the required columns in your spreadsheet should be in this order:

  • A column to define the bubbles (in our case, this is Class Decade)
  • A column to define the time series (Year, i.e. fiscal year of giving)
  • At least two columns of numerical data for the x-axis and y-axis. (You can have more than two columns, to give you more options for charting, but you need at least two. I used Median Gift for the y-axis, and a choice of either Number of Donors or % Participation for the x-axis.)
  • You may also have a column for Category, which just labels the circles in the legend (in our example, this is just a duplication of the data in the Class Decade column)

Assuming you already have a Google or Gmail account, navigate to Google Docs and click on ‘Create New’. Choose ‘Spreadsheet’ from the drop-down menu. Copy all the cells of your Excel spreadsheet, and paste them directly into the Google spreadsheet. Give the file a name, and Save.

(I’m going to assume that you have permission to post your institution’s data online. Keep in mind that you can block public access to the data, or limit it to select invitees who have to log in, or make it wide open and available to all. In any case, it would be best to seek approval.)

Select all of the cells in your sheet that contain your chart data, including the column headers. (Don’t select whole columns – click on cell A1, then hold shift down while clicking on the rightmost cell in the very last row of the sheet.)

In the spreadsheet menu, choose Insert. Click on Gadget.

A window of options will open. You might have to scroll down to find Motion Chart. Click the ‘Add to spreadsheet’ button.

The chart settings window will appear on top of your spreadsheet. (If you don’t see it, scroll up!)

The Range field will already be populated, because you had those cells selected before inserting the gadget. You can modify the range if need be.

Enter a title in the Title field. Ignore the other fields for now.

Click Apply and close.

The chart will take a second or two to appear. It won’t look right – we need to tweak it a bit.

It will also be rather small and hard to work with. To move it to its own sheet, clicking on the little down-arrow at the top left of the chart title bar, and select “Move to own sheet …” from the drop-down menu.

(For additional help at this stage, select Help from the More drop-down menu at top right.)

Now let’s choose the correct values for our x-axis and y-axis.

Click on the x-axis name, and choose the desired value from the options that pop up. (We’re using % Participation.)

(Ignore the Lin and Log menus for now. We’ll leave the scale as Linear, rather than Logarithmic.)

Now click on the y-axis name, and choose Median $.

Notice that the bubbles adjust their orientation accordingly.

Other items that you’ll want to tweak are below. All of these are able to be saved as the default state of your chart:

  • Colour: This should be set to ‘Category
  • Size: Set this to ‘Number alumni‘. For fun, you can also set this to ‘Number of donors’ – then the bubbles will change size over time!
  • Playback speed: The little triangle to the right of the Play button. I usually set this on the slowest speed.
  • Starting year: Push the slider all the way to the left.
  • Labels and trails: You can also click on individual bubbles to label them, or display their trails as they move.

If you play around a bit, which I know you will, you’ll notice that it’s very easy to lose all your settings. And if you try to share your chart with someone else, it won’t display in their browser the way you want it to.

The method for saving your default chart state will be covered in Part 5.

9 December 2009

Quick and easy visuals of large text files

Filed under: Data visualization, Free stuff, Text — Tags: , , , , — kevinmacdonell @ 7:30 pm

Earlier this year we conducted an extensive survey of alumni, made up mostly of scale statements but including a few free-text comment fields as well. Respondents typed in nearly 80,000 words in comments – that’s slightly longer than the first Harry Potter book!

Somebody has to read all this stuff (not me!). But what can we do with it in the meantime?

Why not play with it in Wordle?

According to the web site (www.wordle.net), Wordle is “a toy for generating ‘word clouds’ from text that you provide. The clouds give greater prominence to words that appear more frequently in the source text. You can tweak your clouds with different fonts, layouts, and color schemes. The images you create with Wordle are yours to use however you like.”

Here is a word cloud for the free-text comments made in response to the question, “Do you have any other comments about your academic experience?

You can also enter the URL of any blog or feed into Wordle, and it will generate a word cloud from that. Here’s what a Wordle of this blog looks like (so far).

Word cloud for CoolData blog (up to 9 Dec 2009).

Useful or just a toy? I did use some of these word clouds in a presentation of the alumni survey results, and the response told me it was worth it. It’s a cool thing, and people like cool things. I also see Wordle creations in newspapers – I think the first example I ever saw was a comparison of campaign speeches made by Barack Obama and John McCain.

What do you use word clouds for?

Cool motion charts – Part 3

Filed under: Data visualization, Free stuff — Tags: , , , — kevinmacdonell @ 11:21 am

In Part 1 I introduced the concept of using Google Gadgets to visualize our data. In Part 2 I gave an example of an alumni giving history visualization.

Today I will offer some additional notes on how the Giving chart spreadsheet was assembled. As I said before, assembling the data was a lot more work than actually making the chart! It would be helpful for you to have query access to your institution’s database for this, but if you can get your IT people to extract the data, you should be able to prepare that data in Excel.

I started with an MS Access query of our database (Banner) which gave me a row for every single alumnus/na, living and deceased (about 43,000 records), donors and non-donors, with a column for Class Year, and a column for Total Giving for each fiscal year from 1989 to 2009.

(Each fiscal year total was in fact a sub-query – there’s definitely some work involved. Luckily I had all these sub-queries already made, from previous data projects. Gathering all this data by individual may not be the easiest way to go about this. If you can obtain data that is already aggregated in some way, such as by class year of donor, go for it.)

The length of your time series will be limited by the depth of historical giving data available in your database – ours happens to go back 20 years.

Let’s look at the final spreadsheet we need to create. Here are the first few rows of mine:

We need to collapse our tens of thousands of rows of data on individual alumni into something like what you see above. This is a lot easier to do with a statistics software package, but no doubt it can be accomplished in Excel as well.

I use Data Desk for stats. Paste all your individual giving data into a new data file. The first step is to create a derived variable to recode Class Year into decades (1930s, 1940s, etc.) These will be used to populate the first column of our spreadsheet.

The first column defines the ‘bubbles’. The last column (Category) gives the bubbles names. The two columns contain exactly the same data, but the gadget requires this label information to be contained separately.

The second essential column is the time-series data, in this case Year. This isn’t Class Year – this is the fiscal year in which gifts were made. Year is the variable reflected in the time slide-control at the bottom of the chart.

Creating the spreadsheet was a bit tedious from this point on. It involved a lot of cutting and pasting of the desired values from Data Desk into an Excel spreadsheet. You may be able to vastly improve on my methods.

The middle column, Number Alumni, is the count of individuals in each Class Decade. The relative size of each bubble is based on this number, from 719 in the 1930s to more than 10,000 in the 2000s. (I’ll show you how to define bubble size later.)

The other columns provide the data that will form the x-axis and y-axis of the chart.

In my chart, % Participation or Number donors can be on the x-axis – you can choose one or the other. (% Participation = Number donors divided by Number alumni.)

And finally, the data in the column Median $ is used for the y-axis.

As I said, I used Data Desk to help do the calculations and order the data properly before pasting it into Excel and getting it all arranged properly. Here is a Data Desk Summary by Groups table showing giving totals by Class Decade for the year 1995:

This provided me my donor counts as well as my median giving data. Of course I had to do the same thing over and over for each fiscal year, 1989 to 2009.

Give it a shot. Use the tools you’ve got to massage your data until it’s in the form of the spreadsheet I showed above. This really is the time-consuming phase of your project. Once you’re done, the rest is surprisingly easy. I’ll get to that in Part 4.

8 December 2009

Cool motion charts – Part 2

Filed under: Data visualization, Free stuff — Tags: , , , , — kevinmacdonell @ 7:43 pm

In my previous post, Cool motion charts – Part 1, I talked about a Google Gadget which allows you to make a motion chart based on a spreadsheet of data you create in Google Docs. Now I will show you a live example I created. In Part 3, I will go into more detail about how you can create your own.

WordPress does not allow embedding of scripts, and Google Gadgets are scripts, so you’ll have to click on the link below to view the motion chart. The spreadsheet will open in a separate tab or window. Click on the “Motion Chart” tab at the lower left, then wait a moment for the chart to build.

(See note at bottom of this post for some caveats related to this data.)

Motion chart, "20 years of alumni giving". (Click to access.)

When you click the Play button in the lower left-hand corner, the animation runs from fiscal years 1989 through 2009. At any time you may click and drag on the time scale to control the year being displayed. Animation speed control is to the right of the Play button.

Each circle represents a class decade (1930s to 2000s). The size of the circle corresponds to the number of alumni, living and deceased. Click on any circle to highlight that decade. (Click on ‘Trails’ to record how it moved over time.)

The vertical axis is Median Gift for that fiscal year. I used median instead of mean (average) to avoid huge swings due to the influence of large outlier gifts.

The horizontal axis is Percentage Participation for that fiscal year. I just noticed that the gadget has converted all my percentages to decimals, as in 20% = 0.20. (If you click on % Participation, you have the option of changing the view to Number of Donors. Those are the only two views that make any sense.)

It took a lot longer to prepare the data for the spreadsheet than it did to produce the chart. Click on: Part 3 (data preparation) to continue!

[Note on data: Actual donor participation figures we have published are higher than this chart suggests, as multi-alumni households (which are common for our institution) are counted here as a single donor when gifts are not split. As well, giving figures include both annual and campaign gifts, but exclude student giving (via union fees), estate giving, and all other non-alumni giving.]

Older Posts »

Blog at WordPress.com.