CoolData blog

5 December 2016

Amazing things with matching strings

Filed under: Coolness, Data integrity, SQL — Tags: , , , — kevinmacdonell @ 7:44 am

 

I had an occasion recently when it would have been really helpful to know that a new address added to the database was a duplicate of an older, inactivated address. The addition wasn’t identified as a duplicate because it wasn’t a perfect match — a difference similar to that between 13 Anywhere Road and 13 Anywhere Drive. 

 

After the fact, I did a Google search and discovered some easy-to-use functionality in Oracle SQL that might have saved us some trouble. Today I want to talk about how to use UTL_MATCH and  suggest some cool applications for it in Advancement data work.

 

“Fuzzy matching” is the term used for identifying pairs of character strings that may not be exactly the same, but are so close that they could be. For example, “Washignton” is one small typo away from “Washington,” but the equivalence is very difficult to detect by any means other than an alert pair of human eyes scanning a sorted list. When the variation occurs at the beginning of a string — “Unit 3, 13 Elm St.” instead of “Apmt 3, 13 Elm St.” — then even a sorted list is of no use.

 

According to this page, the UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented and supported in Oracle 11g Release 2. The package includes two functions for testing the level of similarity or difference between strings.

 

The first function is called EDIT_DISTANCE, which is a count of the number of “edits” to get from one string to a second string. For example, the edit distance from “Kevin” to “Kelvin” is 1, for “New York” to “new york” is 2, and from “Hello” to “Hello” is 0. (A related function, EDIT_DISTANCE_SIMILARITY, expresses the distance as a normalized value between 0 and 100 — 100 being a perfect match.)

 

The second method, the one I’ve been experimenting with, is called JARO_WINKLER, named for an algorithm that measures the degree of similarity between two strings. The result ranges between 0 (no similarity) to 1 (perfect similarity). It was designed specifically for detecting duplicate records, and its formula seems aimed at the kind of character transpositions you’d expect to encounter in data entry errors. (More info here: Jaro-Winkler distance.)

 

Like EDIT_DISTANCE, it has a related function called JARO_WINKLER_SIMILARITY. Again, this ranges from 0 (no match) to 100 (perfect match). This is the function I will refer to for the rest of this post.

 

Here is a simple example of UTL_MATCH in action. The following SQL scores constituents in your database according to how similar their first name is to their last name, with the results sorted in descending order by degree of similarity. (Obviously, you’ll need to replace “schema”, “persons,” and field names with the proper references from your own database.)

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

UTL_MATCH.jaro_winkler_similarity(t1.first_name, t1.last_name) AS jw

FROM schema.persons t1

ORDER BY jw DESC

 

Someone named “Donald MacDonald” would get a fairly high value for JW, while “Kevin MacDonell” would score much lower. “Thomas Thomas” would score a perfect 100.

 

Let’s turn to a more useful case: Finding potential duplicate persons in your database. This entails comparing a person’s full name with the full name of everyone else in the database. To do that, you’ll need a self-join.

 

In the example below, I join the “persons” table to itself. I concatenate first_name and last_name to make a single string for the purpose of matching. In the join conditions, I exclude records that have the same ID, and select records that are a close or perfect match (according to Jaro-Winkler). To do this, I set the match level at some arbitrary high level, in this case greater than or equal to 98.

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

t2.ID,

t2.first_name,

t2.last_name,

UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) AS jw

FROM schema.persons t1

INNER JOIN schema.persons t2 ON t1.ID != t2.ID AND UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) >= 98

ORDER BY jw DESC

 

I would suggest reading this entire post before trying to implement the example above! UTL_MATCH presents some practical issues which limit what you can do. But before I share the bad news, here are some exciting possible Advancement-related applications:

 

  • Detecting duplicate records via address matching.
  • Matching external name lists against your database. (Which would require the external data be loaded into a temporary table in your data warehouse, perhaps.)
  • Screening current and incoming students against prospect, donor, and alumni records for likely matches (on address primarily, then perhaps also last name).
  • Data integrity audits. An example: If the postal code or ZIP is the same, but the city name is similar (but not perfectly similar), then there may be an error in the spelling or capitalization of the city name.
  • Searches on a particular name. If the user isn’t sure about spelling, this might be one way to get suggestions back that are similar to the guessed spelling.

 

Now back to reality … When you run the two code examples above, you will probably find that the first executes relatively quickly, while the second takes a very long time or fails to execute at all. That is due to the fact that you’re evaluating each record in the database against every other record. This is what’s known as a cross-join or Cartesian product — a very costly join which is rarely used. If you try to search for matches across 100,000 records, that’s 10 billion evaluations! The length of the strings themselves contributes to the complexity, and therefore the runtime, of each evaluation — but the real issue is the 10,000,000,000 operations.

 

As intriguing as UTL_MATCH is, then, its usage will cause performance issues. I am still in the early days of playing with this, but here are a few things I’ve learned about avoiding problems while using UTL_MATCH.

 

Limit matching records. Trying to compare the entire database with itself is going to get you in trouble. Limit the number of records retrieved for comparison. A query searching for duplicates might focus solely on the records that have been added or modified in the past day or two, for example. Even so, those few records have to be checked against all existing records, so it’s still a big job — consider not checking against records that are marked deceased, that are non-person entities, and so on. Anything to cut down on the number of evaluations the database has to perform.

 

Keep strings short. Matching works best when working with short strings. Give some thought to what you really want to match on. When comparing address records, it might make sense to limit the comparison to Street Line 1 only, not an entire address string which could be quite lengthy.

 

Pre-screen for perfect matches: A Jaro-Winkler similarity of 100 means that two strings are exactly equal. I haven’t tested this, but I’m guessing that checking for A = B is a lot faster than calculating the JW similarity between A and B. It might make sense to have one query to audit for perfect matches (without the use of UTL_MATCH) and exclude those records from a second query that audits for JW similarities that are high but less than a perfect 100.

 

Pre-screen for impossible matches. If a given ID_1 has a street address than is 60 characters long and a given ID_2 has a street address that is only 20 characters long, there is no possibility of a high Jaro-Winkler score and therefore no need to calculate it. Find a way to limit the data set to match before invoking UTL_MATCH, possibly through the use of a WITH clause that limits potential matching pairs by excluding any that differ in length by more than, say, five characters. (Another “pre-match” to use would check if the initial letter in a name is the same; if it isn’t, good chance it isn’t going to be a match.)

 

Keep match queries simple. Don’t ask for fields other than ID and the fields you’re trying to match on. Yes, it does make sense to bring down birthdate and additional address information so that the user can decide if a probable match is a true duplicate or not, but keep that part of the query separate from the match itself. You can do this by putting the match in a WITH clause, and then left-joining additional data to the results of that clause.

 

Truth be told, I have not yet written a query that does something useful while still executing in a reasonable amount of time, simply due to the sheer number of comparisons being made. I haven’t given up on SQL, but it could be that duplicate detection is better accomplished via a purpose-built script running on a standalone computer that is not making demands on an overburdened database or warehouse (aside from the initial pull of raw data for analysis).

 

The best I’ve done so far is a query that selects address records that were recently modified and matches them against other records in the database. Before it applies Jaro-Winkler, the query severely limits the data by pairing up IDs that have name strings and address strings that are nearly the same number of characters long. The query has generated a few records to investigate and, if necessary, de-dupe — but it takes more than an hour and half to run.

 

Have any additional tips for making use of UTL_MATCH? I’d love to hear and share. Email me at kevin.macdonell@gmail.com.

 

Advertisement

13 November 2016

Where we go from here

Filed under: Off on a tangent — Tags: , — kevinmacdonell @ 6:17 pm

 

Disbelief, anger, helplessness, anxiety. Does that describe your week just past? It certainly describes mine.

 

Given the nature of this blog, you might expect me to be dismayed at how poorly the number-crunchers fared in forecasting the outcome of this presidential election. But no, I don’t care about that.

 

While Tuesday night’s events were still unfolding on television, and long before any protestors took to the streets, voices of reason were already reminding us not to despair. I held onto three examples of these calm voices, because I figured I would need them. I would like to share them with you.

 

The first came around midnight, when it was starting to dawn on me that things were going to end badly:

 

“When voices of intolerance are loudest don’t be despondent — be emboldened, and even more committed to values of diversity and inclusion.”

 

That was a tweet from Richard Florizone (@DalPres), president of Dalhousie University, where I work. His words seemed too oblique when I first read them, somehow falling short of the righteous outrage called for by the occasion. But with the distance of a few days, when my head was cooler, I appreciated that this message was just right.

 

The second helpful piece of advice was a quote by French philosopher and political activist Simone Weil (1909-1943):

 

“Never react to an evil in such a way as to augment it.”

 

Such a succinct antidote to our instinct for knee-jerk retaliation! This quote came to me from the perennially wonderful Maria Popova (@brainpicker), a Bulgarian writer, blogger, and critic living in Brooklyn, New York. Her blog, BrainPickings.org, features her writing on culture, books, and eclectic subjects.

 

And finally, a simply-worded tweet from fundraising professional Lindsay Brown (@DonorScience) in Boston completed this circle of advice with a call to action:

 

“Now more than ever, it’s apparent to me that the work we do in the nonprofit sector is massively important. Let’s keep up the good work.”

 

This is only a sampling of the many calm and wise words spoken in recent days, but they will suffice. What do these three sentiments, taken together, advise us to do?

 

First, we are reminded that the Trump victory has not nullified the values of diversity and inclusion, nor impeded our ability to promote them. We need to understand why he was elected, and by whom (including millions of former Obama supporters who failed to vote), and to address root causes of political extremism. We need to understand, not denigrate, in order to clarify what we need to do to.

 

Second, whatever we do we should avoid making problems worse. Don’t move to Canada! As much as I’d love to have you here (in the unlikely event that Canada enables such immigration), please know that your country needs you now more than ever. For those outside the U.S. who feel like disengaging from that country via a boycott (which was my own initial response), please reflect on the consequences of feeding isolationism. And rioting in the streets against the outcome of a free and fair election can have no legitimate result. During the campaign, President Obama repeated the refrain, “Don’t boo — Vote!” Today we can say, “Don’t boo — Act!”

 

Third and finally: Never doubt that our sector is a vital player in creating a better world, despite not being directly “political”. Higher education and a host of nonprofits can build up and defend what Trumpism wants to tear down, and can help create diverse societies to combat the irrational fear of the Other that helps elect leaders like Trump in the first place.

 

The bad news is perfectly clear: that a radicalized faction of white extremism has just elected a dangerous, unpredictable leader animated by ethnic nationalism and xenophobia; that a nation that could have made history by electing its first woman president instead chose a man who abused and denigrated women and boasted about it; that a nostalgia for a bygone decade before civil rights has accompanied an irrational belief that advancement of ethnic minorities threatens the white, working-class status quo; that a country with international commitments to fight climate change has just elected a leader who doesn’t even believe climate change is a real thing.

 

This sudden clarity — this stunning proof that we have not made nearly as much progress as we thought — should be strong motivation not to despair but to get right to work.

 

I don’t have a prescription for what anyone needs to do. It depends on where you are, what tools you have to work with.

 

Do we have work to do at home? I’m willing to bet your daughters are prepared to take on a sexist world, but what are you telling your sons in order that they will help to create a new world?

 

What can we do in our neighbourhoods? Can diverse communities be brought together to interact? Can we replace mere proximity to the Other, which leads to tension and irrational suspicion, with familiarity and interdependence?

 

What causes and projects can we support with our dollars, our time, and our expertise to increase the ability for marginalized people to participate in the economy, to protect the environment, to support reputable journalism, to extend access to education, to promote people’s rights, to fight cynicism about politics and government?

 

There is so much — no one can do it all. I am still thinking about my own “what now?” list, and I know I have to choose wisely. But like voting itself, it is the accumulation of millions of individual actions that lead to dramatic overall results. Let’s agree that it is no longer enough to hold certain opinions, no longer enough to share the right memes on Facebook, no longer enough even to believe that our duty stops with voting and paying taxes.

 

As Hillary Clinton said the day after the election, “… our Constitutional democracy demands our participation. Not just every four years, but all the time. So let’s do all we can to keep advancing the causes and values we all hold dear. Making our economy work for everyone — not just those at the top. Protecting our country and protecting our planet. And breaking down all the barriers that hold any American back from achieving their dreams.”

 

These words can apply just as well to citizens of the United Kingdom, where far-right xenophobia prevailed in the Brexit vote, and to citizens of Canada, where extremist politicians are already talking about emulating Trump, and to people anywhere else in the world who are free to speak and act.

 

Disbelief, anger, helplessness, anxiety. Yes, there’s a time for all of those things. But let’s not subside into resignation, division, hopelessness, and cynicism. Instead let’s each of us look at our immediate surroundings and figure out what we can do. And then, roll up our sleeves and get to work.

 

3 October 2016

Grad class size: predictive of giving, but a reality check, too

 

The idea came up in a conversation recently: Certain decades, it seems, produced graduates that have reduced levels of alumni engagement and lower participation rates in the Annual Fund. Can we hope they will start giving when they get older, like alumni who have gone before? Or is this depressed engagement a product of their student experience — a more or less permanent condition that will keep them from ever volunteering or giving?

 

The answer is not perfectly clear, but what I have found with a bit of analysis can only add to the concern we all have about the end of “business as usual.”

 

For almost all universities, enrolments have risen dramatically over the decades since the end of the second World War. As undergraduate class sizes ballooned, metrics such as the student-professor ratio emerged as important indicators of quality of education. It occurred to me to calculate the size of each grad-year cohort and include it as a variable in predictive models. For a student who graduated in 1930, that figure could be 500. For someone who graduated in 1995, it might be 3,000. (If you do this, remember not to exclude now-deceased alumni in your count.) A rough generalization about the conditions under which a person received their degree, to be sure, but it was easy to query the database for this, and easy to test.

 

I pulled lifetime giving for 130,000 living alumni and log-transformed it before checking for a correlation with the size of graduating class. (The transformation being log of “lifetime giving plus 1.”) It turned out that lifetime giving has a strong inverse correlation with the size of an alum’s grad class, for that alum’s most recent degree. (r = -0.338)

 

This is not surprising. The larger the graduating class, the younger the alum. Nothing is as strongly correlated with lifetime giving as age, therefore much of the effect I was seeing was probably due to age. (The Pearson correlation of LTG and age was 0.395.)

 

Indeed, in a multiple linear regression of age on lifetime giving (log-transformed), adding “grad-class size” as a predictor variable does not improve model fit. The two predictors are not independent of each other: For age and grad-class size, r = -0.828!

 

I wasn’t ready to give up on the idea, though. I considered my own graduation from university, and all the convocations I had attended in the past as an Advancement employee or a family member of a graduate. The room (or arena, as the case may be) was full of grads from a whole host of degree programs, most of whom had never met each other or attended any class in common. Enrolment growth has been far from even across faculties (or colleges or schools); the student experience in terms of class size and one-on-one access to professors probably differs greatly from program to program. At most universities, Arts or Science faculties have exploded in size, while Medicine or Law have probably not.

 

With that in mind, I calculated grad-class size differently, counting the size of each alum’s graduating cohort at the faculty (college) level. The correlation of this more granular count of grads with lifetime giving was not as negative (r = -0.283), but at the same time, it was less tied to age.

 

This time, when I created a regression of age on lifetime giving and then added grad-class size at the faculty level, both predictors were significant. Grad class size gave a good boost to adjusted R squared.

 

I seemed to be on to something, so I pushed it farther. Knowing that an undergrad’s experience is very different from that of a graduate student, I added “Number of Degrees” as a variable after age, and before grad-class size. All three predictors were significant and all led to improvements in model fit.

 

Still on the trail of how class size might affect student experience, and alumni affinity and giving thereafter, I got more specific in my query, counting the number of graduates in each alum’s year of graduation and degree program. This variable was even less conflated with age, but despite that, it failed to provide any additional explanation for the variation in lifetime giving. There may be other forms of counts that are more predictive, but the best I found was size of grad class at the faculty/college level.

 

If I were asked to speculate about the underlying cause, the narrative I’d come up with is that enrolments grew dramatically not only because there were more young people, but because universities in North America were attracting students who increasingly felt that a university degree was a rite of passage required for success in the job market. The relationship of student to university was changing, from that of a close-knit club of scholars, many of whom felt immensely grateful for the opportunity, to a much larger, less cohesive population with a more transactional view of their relationship with alma mater.

 

That attitude (“I paid x dollars for my piece of paper and so our business here is done”), and not so much the increasing numbers of students they shared the lecture halls with, could account for drops in philanthropic support. What that means for Annual Fund is that we can’t bank on the likelihood that a majority of alumni will become nostalgic when they reach the magic age of 50 or 60 and open their wallets as a consequence. Everything’s different now.

 

I don’t imagine this is news to anyone who’s been paying attention. But it’s interesting to see how this reality is reflected in the data. And it’s in the data that we will be able to find the alumni for whom university was not just a transaction. Our task today is not just to identify that valuable minority, but to understand them, communicate with them intelligently, connect with their interests and passions, and engage them in meaningful interactions with the institution.

 

31 August 2016

Phonathon call attempt limits: A reading roundup

Filed under: Annual Giving, Best practices, Phonathon — Tags: , , — kevinmacdonell @ 2:49 pm

 

As September arrives, Annual Fund programs everywhere are gearing up for mailing and calling. Managers of phone programs are seeking advice on how best to proceed, and inevitably that includes asking about the optimal number of call attempts to make for each alum.

 

How many calls is too many? What’s ideal? Should it differ for LYBUNTs and SYBUNTs?

 

In my opinion, these are the wrong questions.

 

If your aim is to get someone on the phone, more calling is better. However, by “call more” I don’t mean call more people. I mean make more calls per prospect. The RIGHT prospects. Call the right people, and eventually many or most of them will pick up the phone. Call the wrong people, and you can ring them up 20, 30, 50 times and you won’t make a dent. That’s why I think there’s no reason to set a maximum number of call attempts. If you’re calling the right people, then just keep calling.

 

For Phonathon programs that are expensive or time-consuming (and potentially under threat of being cut), and shops with some ability to make decisions informed by data, it doesn’t make sense to apply across-the-board limits. Much better to use predictive modeling to determine who’s most likely to pick up the phone, and focus resources on those people.

 

Here are a number of pieces I’ve written or co-written on this topic:

 

Keep the phones ringing – but not all of them

 

Call attempt limits? You need propensity scores

 

How many times to keep calling?

 

Answering questions about “How many times to keep calling”

 

Final thoughts on Phonathon donor acquisition

 

2 August 2016

Data Down Under, and the real reason we measure alumni engagement

Filed under: Alumni, Dalhousie University, engagement, Training / Professional Development — Tags: — kevinmacdonell @ 4:00 pm

 

coverI’ve given presentations here and there around Canada and the U.S., but I’ve never travelled THIS far. On Aug. 24, I will present a workshop in Sydney, Australia — a one-day master class for CASE Asia-Pacific on using data to measuring alumni engagement. My wife and I will be taking some time to see some of that beautiful country, leaving in just a few days.

 

The workshop attendees will be alumni relations professionals from institutions large and small, and in the interest of keeping the audience’s needs in mind, I hope to convince them that measuring engagement is worth doing by talking about what’s in it for them.

 

This will be the easy part. Figuring out how to quantify engagement will allow them to demonstrate the value of their teams’ activity to the university, using language their senior leadership understands. Scoring can also help alumni teams better target segments based on varying levels of engagement, evaluate current alumni programming, and focus on activities that yield the greatest boost in engagement.

 

There is a related but larger context for this discussion, however. I am not certain that everyone will be keen to hear about it.

 

Here’s the situation. Everything in alumni relations is changing. Alumni populations are growing, the number of donors is decreasing, and traditional engagement methods are less effective. Friend-raising and “one size fits all” approaches to engagement are increasingly seen as unsustainable wastes of resources. (A Washington, DC based consultancy, the Education Advisory Board, makes this point very well in this excerpt of a report which you can download here: The Strategic Alumni Relations Enterprise.)

 

I don’t know so much about the Asia-Pacific region, but in North America university leaders are questioning the very purpose and value of typical alumni relations activities. In this scenario, engagement measurement is intended for more than producing a merely informational report or having something to brag about: Engagement measurement is really a tool that enables alumni relations to better align itself with the Advancement mission.

 

In place of “one size fits all,” alumni relations teams are under pressure to understand how to interact with alumni at different levels of engagement. Alumni who are somewhat engaged should be targeted with relevant programs and messages to bring them to the next level, while alumni who are at the lowest levels of engagement should not have significant resources directed at them.

 

Alumni at high levels of engagement, however, require special and customized treatment. They’re looking for deeper and more fulfilling experiences that involve furthering the mission of the institution itself. Think of guest lecturing, student recruitment, advisory board roles, and mentorship, career development and networking for students and new grads. Low-impact activities such as pub nights and other social events are a waste of the potential of this group and will fail to move them to continue contributing their time and money.

 

Think of what providing these quality experiences will entail. For one, alumni relations staff will have to collaborate with their colleagues in development, as well as in other offices across campus — enrolment management, career services, and academic offices. This will be a new thing, and perhaps not an easy thing, for alumni relations teams stuck in traditional friend-raising mode and working in isolation.

 

But it’s exactly through these strategic partnerships that alumni relations can prove its value to the whole institution and attract additional resources even in an environment where leaders are demanding to know the ROI of everything.

 

Along with better integration, a key element of this evolution will be robust engagement scoring. According to research conducted by the Education Advisory Board, alumni relations does the poorest job of any office on campus in providing hard data on its real contribution to the university’s mission. Too many of us are still stuck on tracking our activities instead of the results of those activities.

 

It doesn’t have to be that way, if the alumni team can effectively partner with other units in Advancement. For those of us on the data, reporting, and analysis side of the house, get ready: The alumni team is coming.

 

5 July 2016

A simple score you can probably build in Excel

Filed under: Excel, Peter Wylie, Predictive scores — Tags: , , , — kevinmacdonell @ 4:22 pm

Guest post by Peter B. Wylie

 

In the evolving world of analysis for higher ed and non-profits, it’s apparent that a gap is widening: Many well-resourced shops are acquiring analytics talent comfortable with statistics and programming, but many others are unable to make investments in specialized talent.

 

Today’s guest post is a paper by Peter Wylie that addresses the latter group, the ones at risk of being left behind. Download his paper here: Simple_Score_in_Excel_Wylie

 

In this piece he uses data from two schools to show you something you can try with your own data, building a very simple predictive score using nothing but Excel.

 

Some level of data analysis ought to be accessible at some level to every organization, regardless of technical proficiency or tools. And in fact, shops that move too quickly to automate predictive scoring with black-box-like methods risk passing over the insights available to the exploratory analyst using more manual, time-consuming methods.

 

We hope you enjoy, and above all, that you try this with your own data. The download link again: Simple_Score_in_Excel_Wylie

 

« Newer PostsOlder Posts »

Blog at WordPress.com.