CoolData blog

3 October 2017

Our “data-informed decision making” journey

Filed under: Analytics, Business Intelligence, Dalhousie University — Tags: , , — kevinmacdonell @ 6:34 pm

 

Building a business intelligence and analytics program can take years, and the move toward data-informed decision making is a cultural evolution that might never be complete. In my previous blog post, I talked about what advancement BI looks like in its ideal state (Analytics as an organizing principle). Today I want to talk about the messy reality.

 

Looking back at our own journey at Dalhousie University, I realize that we didn’t pursue the most direct and well-lit path, but we did learn a lot along the way. Eight years ago or so, we had very limited capability for supporting decisions with data. We still haven’t “arrived” — there is plenty more to do — but our progress is worth looking back on. It’s this progress I’ve been recounting for audiences across the country lately; it seems everyone is attempting to plan their own journey, or at least compare notes.

 

Here I’ll recount a few of the steps that got us to where we are today, starting with some of the obvious ingredients for a successful BI program — quality data, good software tools, and so on — and then talk about some of the perhaps less obvious influences that were essential for driving us forward.

 

First of all, DATA: Years ago, the general perception in our office was that our data was in bad shape. Our coverage rate for contact and employment information was believed to be low, and the accuracy of the data was frequently called into question, based largely on errors spotted in lists. But aside from anecdotes we really had no objective idea.

 

We developed reports to get a handle on coverage rates, as well as the ability to automatically archive these data points to be able to track progress and gaps over time. With our alumni constituency alone growing by several thousand individuals a year, we stepped away from imagining that we should aim to find every lost alum, and instead used a score to prioritize who to trace first. More importantly, the purely clerical “Alumni Records” team was reinvented as the “Constituent Data Integrity” team, tasked with going beyond data entry to developing and acting on data integrity audits, leading a large, cross-functional Data Integrity group to discuss integrity issues, and working much more closely with Prospect Research and Alumni Engagement to provide better support. We have also worked with frontline staff to encourage them to think of Advancement data as something they “own” and will benefit from directly, with a responsibility to feed information and intelligence to records and research staff.

 

We also made a concerted effort to establish written definitions for fundraising terminology and drafted a standardized set of counting rules, agreed to and approved by our leadership. Embedded in a single, core reporting view from which all reports and analyses are derived, these rules enforce a single version of the truth across all reports and dashboards. This work is not complete, but well advanced, and starting with Development data was a good idea.

 

A second enabler was the development of a three-year strategic plan for Advancement Services, something we’d never had. The plan charted a way forward for the team and became the foundation for much-needed investments in personnel. Without a doubt, the most important element in a successful BI program is people — hiring well has been the biggest driver of momentum for us — but resources won’t be made available in the absence of a plan and a roadmap for the future. Our plan did not necessarily lay out everything we intended to do with reporting, BI, and analytics — we didn’t know what the ideal team and technology would look like — but we were able to clearly articulate our gaps and what we needed to help us bridge those gaps.

 

Developing the plan required a commitment to change. This was a big step, because our team had gotten adept at concealing issues. For example, whenever leadership and deans needed an update on fundraising progress against campaign priorities, someone would take the raw data home each night and crunch the data manually in Excel. People were getting the information they wanted, so why change? But in fact we had zero agility, and reporting was never going to be able to grow beyond the basics. The fact that our AVP of Development felt forced to author his own reports should have been a wakeup call.

 

With qualified people making smart decisions, we have invested considerable time adopting Tableau as a reporting tool to bridge over a years-long period of uncertainty with regards to a centrally-supported BI tool. Over the years we evolved from having senior staff being served with pots of raw data and having to fend for themselves in Excel, to having our standard Development reporting automated in Tableau, with progress being made in reporting for other units. At the same time, we hired a BI Analyst to perform more ad hoc analytical and predictive modelling work. At this time, we are hiring two additional BI analysts, each with more specialized roles.

 

Greater demand for more sophisticated reports, dashboards, and analyses meant a greater need for complex transformations of our raw transactional data. We therefore put some emphasis on hiring people who knew SQL or could learn it. My colleague Darrell Rhodenizer puts it this way: Being able to use reporting tools such as Tableau Desktop or Cognos Reporting is one thing, but being able to directly speak the language of our database enables us to use all sorts of tricks to better shape our data for the reporting environment. Other departments that have not invested in the ability to look under the hood seem to be at a disadvantage.

 

As a result, our team has taken over from central IT the primary responsibility for modelling our data — that is, assembling our database tables into complex data structures to serve reporting and analysis. This works well for Advancement, which at most universities is far down the list of departments in terms of central IT support, and often has frequently-changing needs as priorities shift and campaigns roll through.

 

It’s gone beyond just learning SQL. Darrell, as our Associate Director, Advancement Systems & Reporting, has developed a new ETL tool which has accelerated our progress and promises to change the game for years to come. Our unit’s data is extracted nightly from the university’s centrally-managed data warehouse and multiple transformations are applied to it before it is re-stored to the same data warehouse. Under the full control of Advancement, the transformed data is available to all the same users using whatever tool they have. Data model changes are made with agility and with minimal disruption to business.

 

One final enabler: Outside Advancement, a new attitude to working cooperatively across departments and a new appreciation of data as an institutional asset has led to development of a data governance model and policies for opening up access to data. Before, if data was shared at all, it was done haphazardly and insecurely via Excel files. Today, we have a process for responsible use of data across the institution.

 

These elements of progress — technology, tools, people, skills — had a combined effect that was more than additive. We achieved an increasing momentum over the years, such that newer staff members struggle to imagine how bad things used to be in “them days.”

 

These and other factors were important enablers of change. Without some of them, we could not have made the improvements we did. However, they were not sufficient themselves to drive change. I suspect we are too often prone to falsely equate analytics competence with a piece of software, or an employee with a certain title, or a team, when really it’s none of those things. We would not have hired key people, and we would not have sought out and effectively deployed new tools, had there not been forces driving us in that direction.

 

Internally, we faced increased demand from Advancement leadership for information and insight. The closing of a comprehensive campaign was very revealing of our gaps in reporting and analysis — and the eventual ramp-up to another campaign spurs us to ensure that we are ready.

 

As well, for some years now a new culture of strategic planning has taken hold, with the development and adoption of an Advancement Balanced Scorecard. This plan for the whole department has had a focusing and integrative effect — everyone sees how functions fit together, and how their own job supports the mission. As great as that is for Development or Marketing or Alumni Engagement, it’s been essential for Operations. We now have a vision for what priorities we will need to support into the future, and a chunk of that support consists of data, information, reporting, dashboards, analyses, and other analytical products — not to mention the development of KPIs directly tied to measuring Advancement’s progress against the goals and objectives of the Balanced Scorecard itself. To date, high-level strategic planning has been the most significant “focusing” factor for our BI work.

 

You may have noticed that these and other internal drivers of change all come from the top, whereas the “enablers” tended to rely on initiative from lower down in the organization. Again, without both, not much would have happened.

 

But some drivers of a culture of analytics aren’t coming from the organization itself at all. We’re growing increasingly aware of external drivers. There are some new realities out there, and the organizations that position their data teams to address these new realities will have a better chance of succeeding.

 

First, alumni and donors have a different relationship with institutions than they once did, and their expectations are different. Alumni populations are growing, the number of donors is decreasing, and traditional engagement methods are less effective. Friend-raising and “one size fits all” approaches to engagement are increasingly seen as unsustainable wastes of resources. University leaders are questioning the very purpose and value of typical alumni relations activities.

 

According to current wisdom, engaged alumni are seeking meaningful interactions that make a difference, especially interactions with students in the form of advice, mentorship, or career development. If they have anything to do with the institution itself, it’s less about nostalgia for student life than it is being a part of the university’s role in society and community. Barbecues and pub nights hold little appeal for truly engaged alumni who believe in your brand of higher education (or your cause), and believe in the power of your students to change the world for the better. They want to be part of the mission.

 

Donors, too, are looking for meaningful engagement. Through their giving they want to accomplish things in the world. If they’re giving to your institution, it is because they feel your institution is uniquely qualified to carry out the change they’re seeking. Society’s needs, not the institution’s needs, are of greatest importance to this donor. They are not interested in “giving back.” Instead of giving TO institutions, they give THROUGH institutions.

 

This is partly borne out in what many of our organizations are seeing happening in our Annual Fund: for years now, donor numbers have been trending down, while average gift size has been going up. Donors are being more strategic with their giving, pooling resources and being more deliberate with their dollars.

 

These global shifts are not new, but I don’t think their real impact on the sector has yet been fully realized. Certainly for many of us, our strategies are not keeping pace. Analytics is going to be increasingly important for responding to these global shifts. A few examples follow …

 

In order to move from one-size-fits-all messages and programs, and evolve toward more targeted, relevant opportunities to engage, we need to understand how engaged each individual is right now. So we, along with many other institutions, have developed a means to measure alumni engagement. Every alumnus and alumna has a score that reflects where on the engagement spectrum they are, just as we know where on the donor spectrum they are. With those two pieces of information we can invest more time and money developing opportunities aimed at the upper niche of engaged individuals where it will have the most impact. (See: Why we measure engagement.) We need to engage with them on their own level, not ours, via relevant events and volunteerism. What information, programs, and services do they need, and which connect with their interests and talents?

 

In place of “one size fits all,” engaged alumni need more fulfilling experiences such as guest lecturing, student recruiting, and mentorship, career development and networking for students and new grads. Engagement measurement, then, is really a tool that enables alumni relations to better align itself with the mission of Advancement and the university.

 

Second, we aspire to understand our constituents not just based on their degree or by how much they’ve given, but through their interests and values — data we are just starting to bring together from a variety of sources in order to inform more intelligent segmentation of alumni and donors.

 

Third, we are doing what we can to measure impact of programming and events. We might report that we had 100 events that attracted 10,000 attendees, but why stop there? We should also be able to say we moved 2,000 people, say, to the next level of engagement, or that this or that event inspired 50 people to give. According to research conducted by the Education Advisory Board, a consulting firm, alumni relations does the poorest job of any office on campus in providing hard data on its real contribution to the university’s mission. Too many offices are stuck on tracking activities instead of results and outcomes.

 

Wonderful as these examples sound, and as far as we’ve come, we haven’t done everything right. There are areas where I wish we had made more progress, and things I discovered along the way that I wish I’d thought of earlier.

 

We’ve never had a long-range plan for the BI/analytics team. Yes, BI was a component of our three-year strategic plan and we have yearly operational plans, but there was no overarching vision of what the team would finally look like, along the lines of the three-tiered structure I outlined in my previous post. Our growth has been organic, addressing the gaps as we saw them from year to year. Perhaps that’s the right way to grow, especially as employees themselves grow and discover new strengths, but I think in a perfect world we might have had an idea of what the ideal future state would look like.

 

More fundamentally, a major all-at-once investment in rapid growth absolutely requires a plan. The way we did it, each new person who came on had to be somewhat self-sufficient in provisioning themselves with data to analyze, being responsible for transforming it and so on. That’s not the way it is now – as we evolve, positions are becoming more specialized.

 

Second, in hindsight I would have given more thought to how data-informed decisions are made. I mentioned earlier that the Balanced Scorecard exercise for Advancement has provided a main focus for BI, but I can see that it’s not enough. There has to be a framework for prioritizing and directing data-informed decision making below the level addressed by the Scorecard. (I wrote about this in my previous post.) I could have spent some time earlier on thinking about the structure and processes to make that happen.

 

A third thing I wish we had devoted more brainpower to was tackling self-serve list generation. Automation of the generation of lists of contact information for event invitations, solicitations, and so on is surprisingly challenging for a whole host of reasons, and this has prevented us from putting that ability into the hands of users. Had we cracked that nut early on in the journey, we would have freed resources for more interesting work. And more generally, “self-serve” is a cultural shift which takes a lot of time, training, and reinforcement. Even if we had developed a good tool for users to pull their own ready-to-use lists, it would have taken a long time to get people to use it (regardless of what they might say about the idea of it). If you’re considering a big push for self-serve, I would warn you that the payoff will come years, not months, from now.

 

Data-informed decision making in general is a cultural shift; it’s not just a series of technical problems to be solved. Nothing will happen without the technology, to be sure, but the technology enables — it does not drive. You can invest heavily in a BI team and software and still not achieve a state of making decisions informed by data.

 

When I talk about how poorly we did some years ago, that’s not intended as a critique of the people doing the work at that time. Everyone always did the best they could with what they had to work with. In the same way, when I speak with folks from other universities who are struggling with how to make progress in this area, it’s not a lack of will or even skill that I detect: It’s more a lack of clarity about the way forward. It’s rarely obvious how to pull the pieces and people together, but with progress comes momentum. I wish you luck on your own journey!

 

Advertisements

10 July 2017

Analytics as an organizing principle

Filed under: Analytics, Business Intelligence — kevinmacdonell @ 7:51 am

 

I’ve been thinking a lot lately about how an organization gets good at making decisions informed by data. Or, in other words, how to build business intelligence and analytics teams. This preoccupation started with a talk I gave a couple of months ago to a gathering of Advancement leaders from across Canada. I was asked to talk about analytics in general and how our department in particular got to where we are today. Since then, I’ve also spoken to folks from other universities on the same topic.

 

All this talking has been helpful for me in organizing my thoughts, and I’ve come to realize a number of things in retrospect, ways in which we might have evolved more quickly. One of these is a realization about what it means to make data and analytics an “organizing principle.”

 

For my talk in May I was asked to begin with an overview of analytics, so I’ll devote this post to that topic. In a future post, I will share what we learned on our journey.

 

Because analytics is an ever-evolving field, I avoid dictionary-like definitions for analytics. I find it more helpful to talk about what analytics “looks like” in terms of the types of work it consists of, the skill sets of the people doing the work, and the organizational structure of the team (if it’s a team).

 

In my mind, these concepts have resolved into a “triad of threes” … The work itself fits into three tiers, the ideal analytics practitioner is a “triple threat”, and the team is made up of three distinct teams or functions. (If what I’m presenting here is an oversimplification, at least it’s a structurally satisfying one.) What I’m talking about is fairly conventional — I’m not inventing anything — but it’s supported by my own experience.

 

First, the work itself. Analytics practice today works at three distinct levels: Descriptive, predictive, and prescriptive.

 

Descriptive analytics serves the business with information, specifically information about the past, which helps us understand current performance in relation to the past. It attempts to answer the questions, “How have we done?” and “How are we doing now?” This is the realm of reporting and a lot of what is referred to as Business Intelligence. Although this is a starting point for any analytics program, that doesn’t mean it’s easy or that it doesn’t have aspects that are advanced. KPI development, support for performance management, and ad hoc data analyses to answer specific business questions might be included in this tier.

 

Predictive analytics is about predicting the future. Not “the future” in general, but the behaviour of individuals. Predictive modelling is a set of techniques for ranking individuals by their likelihood to engage in some behaviour of interest (making a bequest, becoming a donor, attending an event, etc.). The business goal might be prospect identification, or focusing limited resources to save time or money.

 

And finally, prescriptive analytics provides advice on what action to take to influence a behaviour of interest. While predictive analytics gives us an idea who’s more likely to, say, sign up for a high-end credit card from a financial institution, prescriptive analytics suggests what types of interventions (targeting advertisements, for example) that would inspire a customer to actually do it.

 

Prescriptive analytics is the newest type of analytics and the most advanced — I don’t think it’s the same as A/B testing found in direct marketing — and still rare in the nonprofit and advancement sector. I’m using an example from the financial services industry for a reason: my team is just beginning to explore this type of work, and I’m not aware of anyone else doing it. (If you’re reading this in a year or two from now, the situation might be different.)

 

If your organization is doing a good job on reporting, business intelligence, predictive modelling, and maybe some forecasting as well — then you’re most likely doing very well in comparison with your peer institutions in terms of function.

 

So much for the work. What about the people?

 

There is a popular notion of what the ideal analytics practitioner looks like in terms of education, work experience, and skills. That person, who might be styled a Data Scientist, is what I have called a “triple threat” — he or she has extensive domain expertise (fundraising, engagement, and/or marketing), a background in computer science (adept at writing scripts in SQL, R, Python or other language to extract and transform data for analysis and advanced modelling), and mathematics (with an array of advanced statistical methods in his or her toolbox).

 

The problem is, such professionals are both rare and in high demand. You won’t find many of these folks working in our sector — at least not for very long. Their natural habitat is more likely to feature Big Data, not the “little data” we’ve got, and machine learning, rather than our old standbys such as multiple linear regression. I have already elaborated on these points in the blog post I link to above, Mind the data science gap. Suffice to say, we do not currently aspire to hire data scientists.

 

That doesn’t mean the ideal isn’t a useful model, however. When we hire, it makes sense to single out candidates with skills in one of the three areas, and who seem to have some aptitude for picking up skills in complementary areas. The strategy here is not to hire a data scientist, but to grow a reasonable facsimile of one. If you’ve got an employee who has some subject-matter knowledge, has a penchant for self-learning technical skills (on her own time perhaps), is curious about things and diving into the data, and who is a good communicator — such a person will add a lot of value in a BI role.

 

You can have the right people doing the right work, but they need to work in an organizational structure that promotes data-informed decision making. So, the third and final aspect: The organizational structure. There is no one perfect structure, but keeping with the theme of “three,” I think that a three-tier setup makes sense. In a large organization, each tier might be a team. In a smaller organization, each tier might be one person. (If one person is responsible for everything, this “structure” can be thought of as a way to organize or compartmentalize one’s own work.)

 

The first and foundational tier is the Technical Team, consisting of Advancement staff who might be responsible for building and/or maintaining a data warehouse dedicated to Advancement needs, building and maintaining materialized views and data models for use in BI software, developing complex reports and dashboards, integrating internal and external systems and platforms so that data from disparate systems can be merged or federated, and liaising with central IT.

 

This tier sounds very “IT”, but it’s important to recognize that it is distinct from the institution’s centralized IT department, which is responsible for maintaining hardware, servers, and the core database software itself, as well as managing the network and security.

 

So you’re not trying to replicate an IT shop, but you are building a team with specific technical skills. For any higher ed institution in which departments are not supported equally by central IT, having in-house expertise to integrate systems and develop data models tailored to business needs is definitely a key to success. Someone has to supply and support the data infrastructure, if central IT is too overtaxed to provide.

 

The next team is the Analysis Team, the people who build predictive models, define KPIs, do ad hoc analyses, and so on. This team (or person) benefits directly from the work of the Technical Team, freed from having to always extract and transform their own data. While analysis often implies exploration of the raw, unaggregated data, there’s a huge payoff in having a lot of the standard transformations (tedious and repetitive) pushed to the data warehouse level. Analysts add the most value when they’re interacting with clients to define business questions and present results, not struggling yet again with raw, transactional data that could be processed more efficiently and accurately with an ETL tool.

 

In my own workplace, the distinction between these two teams is something of an oversimplification, but it’s roughly analogous.

 

The third team is harder to define, as it may take various forms, depending on the organization. I’ve seen it referred to as the Executive Team, but a better name might be the Analytic Strategy Team or the BI Decision Team. We don’t have a name for it in my workplace, because our department doesn’t have such a group — yet. In fact, this is less a “team” than a solid business process. In any case, I’ve come to think it’s essential for data-informed decision making, and at the heart of analytics as an organizing principle.

 

The Analytic Strategy Team would be a cross-functional team made up of business sponsors (directors and managers of programs and units) and analysts from both the Technical and Analysis teams. In a data-driven organization, this team meets regularly to rank and prioritize analysis projects that have been submitted to the team as requests, called for by department leadership, or generated by the team members themselves. Projects rank higher for being supportive of current strategy, having a high perceived impact, having executive sponsorship, and so on.

 

Prioritizing is not the team’s most important role, however. As the hub of a framework for Advancement decision-making, the Analytic Strategy Team is there to ensure that when a business question is answered through analysis, there will be follow-through. The Team nails down the “why” and “how” of every analysis project: Zeroing in on the real business question that needs to be answered, drafting the general approach to answering the question, and (most critically) determining what actions will be taken if the answer is x, y, or z. Results and recommendations are channeled to a decision maker, who has agreed in advance to the definition of the business question.

 

Ideally, the department’s leadership team approves the ongoing analytics agenda. Having leadership sign off on the list of priorities fosters an integrated approach to making decisions as a whole department.

 

This team is important for focus — analysts do their best work if they can focus — but it’s even more important for driving decisions. Your team can be kept endlessly busy generating analyses, but it’s when it comes to the consequences of analyses that BI programs risk falling flat. Without the accountability implied by an agreed-on process of question, answer, and follow-through, analysts end up floating from one fishing expedition to another, generating “findings” that never get acted on, or fulfilling requests to support program managers’ foregone conclusions with “evidence.”

 

Of course we want to do some purely exploratory analyses without a defined outcome — but that’s not how data-informed decisions get made. As Thomas Davenport has written, “In the traditional analytics world, analysts may have lacked the ability to work closely with decision-makers to frame decisions appropriately, engage stakeholders, and structure decision processes and actions. Decision analysts in a business analytics environment need to move from back-office decision support to front-office decision consultants.”

 

Again I say, these observations about the “third team” are not drawn from my first-hand experience. These are things I’ve come to understand only recently. My naiveté is evident in “Score!” the book I co-authored with Peter Wylie and which was published just two years ago. What we wrote seemed to imply that all it takes is a supportive leader driving change from the top and engaged staff people with an aptitude for data work driving change from the bottom. They would somehow meet in the middle, and magic would happen. Well, we do need both of those forces, but nowadays I don’t see organizational change happening in the absence of a well-functioning business process that guides decision-making.

 

I’ve talked about the people, the types of work they do, and the structure of the team — all from a general perspective. In my next post, I will talk about the journey our own shop has taken towards building a BI/analytics program. Not surprisingly, the real-world program doesn’t arrive as neatly packaged as this general overview would suggest.

 

23 February 2017

Proceeds from sales of “Score!” to be donated to ACLU

Filed under: Book, Peter Wylie, Score! — Tags: , — kevinmacdonell @ 9:09 pm

 

Peter Wylie and I are pleased to tell you that all our current and future royalties from sales of the book “Score!: Data-Driven Success for Your Advancement Team” will be donated to the American Civil Liberties Union.

 

A good seller since it came out a couple of years ago, “Score!” is available for order online, in both print and e-book versions. (Click here to enter the CASE book store.)

 

Each year around late August, I am delighted to see that cheque in my mail from the Council for Advancement and Support of Education. (Peter of course gets his cheque at the same time, only he spells it “check”.) The next cheque (or check) we receive will be our third. We never know how sales have gone for the year until we get paid; since “Score!” continues to be featured prominently in the CASE catalogue, and people continue to click through this blog to the CASE bookstore every day, we have reason to think sales are still healthy.

 

A good opportunity, then, to extend our little book’s modest influence in a positive direction in these strange times. The ACLU works to defend and preserve the individual rights and liberties guaranteed by the Constitution and laws of the United States. As you may know, I live in Canada, but I recognize that holding the current administration to account is in everyone’s interest.

 

If you’ve been meaning to get a copy and just needed that extra reason to act, click here order online. Or, even better, consider making a contribution directly to the ACLU or whatever organization you feel is best positioned to undo the poison of xenophobia in your community, region, or country.

 

31 January 2017

Are we missing too many alumni with web surveys? (Part 2)

Filed under: Alumni, John Sammis, Peter Wylie, Surveying — Tags: , — kevinmacdonell @ 6:22 am

Guest post by Peter B. Wylie, with John Sammis

 

Download a printable PDF version of this paper: Are We Missing Too Many Alumni P2.

 

It seems everyone we know, no matter how young or old, has an email address or uses Facebook. So we might assume that nowadays online surveys will reliably deliver a representative sampling of a school’s alumni population.

 
 

In this guest post, Peter Wylie and John Sammis demonstrate that alumni available and willing to be polled online differ from non-online constituents in potentially significant ways. Although current practice tends towards online-only surveying, the evidence suggests this probably skews the conclusions we can draw about our constituencies, with key differences that go well beyond just age.

 
 

(This is “part 2” of an earlier piece. To download the first paper, click here: Are We Missing Too Many Alumni With Web Surveys?)

 
 

Again, the link for Part 2:  Are We Missing Too Many Alumni P2.

 
 

5 December 2016

Amazing things with matching strings

Filed under: Coolness, Data integrity, SQL — Tags: , , , — kevinmacdonell @ 7:44 am

 

I had an occasion recently when it would have been really helpful to know that a new address added to the database was a duplicate of an older, inactivated address. The addition wasn’t identified as a duplicate because it wasn’t a perfect match — a difference similar to that between 13 Anywhere Road and 13 Anywhere Drive. 

 

After the fact, I did a Google search and discovered some easy-to-use functionality in Oracle SQL that might have saved us some trouble. Today I want to talk about how to use UTL_MATCH and  suggest some cool applications for it in Advancement data work.

 

“Fuzzy matching” is the term used for identifying pairs of character strings that may not be exactly the same, but are so close that they could be. For example, “Washignton” is one small typo away from “Washington,” but the equivalence is very difficult to detect by any means other than an alert pair of human eyes scanning a sorted list. When the variation occurs at the beginning of a string — “Unit 3, 13 Elm St.” instead of “Apmt 3, 13 Elm St.” — then even a sorted list is of no use.

 

According to this page, the UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented and supported in Oracle 11g Release 2. The package includes two functions for testing the level of similarity or difference between strings.

 

The first function is called EDIT_DISTANCE, which is a count of the number of “edits” to get from one string to a second string. For example, the edit distance from “Kevin” to “Kelvin” is 1, for “New York” to “new york” is 2, and from “Hello” to “Hello” is 0. (A related function, EDIT_DISTANCE_SIMILARITY, expresses the distance as a normalized value between 0 and 100 — 100 being a perfect match.)

 

The second method, the one I’ve been experimenting with, is called JARO_WINKLER, named for an algorithm that measures the degree of similarity between two strings. The result ranges between 0 (no similarity) to 1 (perfect similarity). It was designed specifically for detecting duplicate records, and its formula seems aimed at the kind of character transpositions you’d expect to encounter in data entry errors. (More info here: Jaro-Winkler distance.)

 

Like EDIT_DISTANCE, it has a related function called JARO_WINKLER_SIMILARITY. Again, this ranges from 0 (no match) to 100 (perfect match). This is the function I will refer to for the rest of this post.

 

Here is a simple example of UTL_MATCH in action. The following SQL scores constituents in your database according to how similar their first name is to their last name, with the results sorted in descending order by degree of similarity. (Obviously, you’ll need to replace “schema”, “persons,” and field names with the proper references from your own database.)

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

UTL_MATCH.jaro_winkler_similarity(t1.first_name, t1.last_name) AS jw

FROM schema.persons t1

ORDER BY jw DESC

 

Someone named “Donald MacDonald” would get a fairly high value for JW, while “Kevin MacDonell” would score much lower. “Thomas Thomas” would score a perfect 100.

 

Let’s turn to a more useful case: Finding potential duplicate persons in your database. This entails comparing a person’s full name with the full name of everyone else in the database. To do that, you’ll need a self-join.

 

In the example below, I join the “persons” table to itself. I concatenate first_name and last_name to make a single string for the purpose of matching. In the join conditions, I exclude records that have the same ID, and select records that are a close or perfect match (according to Jaro-Winkler). To do this, I set the match level at some arbitrary high level, in this case greater than or equal to 98.

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

t2.ID,

t2.first_name,

t2.last_name,

UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) AS jw

FROM schema.persons t1

INNER JOIN schema.persons t2 ON t1.ID != t2.ID AND UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) >= 98

ORDER BY jw DESC

 

I would suggest reading this entire post before trying to implement the example above! UTL_MATCH presents some practical issues which limit what you can do. But before I share the bad news, here are some exciting possible Advancement-related applications:

 

  • Detecting duplicate records via address matching.
  • Matching external name lists against your database. (Which would require the external data be loaded into a temporary table in your data warehouse, perhaps.)
  • Screening current and incoming students against prospect, donor, and alumni records for likely matches (on address primarily, then perhaps also last name).
  • Data integrity audits. An example: If the postal code or ZIP is the same, but the city name is similar (but not perfectly similar), then there may be an error in the spelling or capitalization of the city name.
  • Searches on a particular name. If the user isn’t sure about spelling, this might be one way to get suggestions back that are similar to the guessed spelling.

 

Now back to reality … When you run the two code examples above, you will probably find that the first executes relatively quickly, while the second takes a very long time or fails to execute at all. That is due to the fact that you’re evaluating each record in the database against every other record. This is what’s known as a cross-join or Cartesian product — a very costly join which is rarely used. If you try to search for matches across 100,000 records, that’s 10 billion evaluations! The length of the strings themselves contributes to the complexity, and therefore the runtime, of each evaluation — but the real issue is the 10,000,000,000 operations.

 

As intriguing as UTL_MATCH is, then, its usage will cause performance issues. I am still in the early days of playing with this, but here are a few things I’ve learned about avoiding problems while using UTL_MATCH.

 

Limit matching records. Trying to compare the entire database with itself is going to get you in trouble. Limit the number of records retrieved for comparison. A query searching for duplicates might focus solely on the records that have been added or modified in the past day or two, for example. Even so, those few records have to be checked against all existing records, so it’s still a big job — consider not checking against records that are marked deceased, that are non-person entities, and so on. Anything to cut down on the number of evaluations the database has to perform.

 

Keep strings short. Matching works best when working with short strings. Give some thought to what you really want to match on. When comparing address records, it might make sense to limit the comparison to Street Line 1 only, not an entire address string which could be quite lengthy.

 

Pre-screen for perfect matches: A Jaro-Winkler similarity of 100 means that two strings are exactly equal. I haven’t tested this, but I’m guessing that checking for A = B is a lot faster than calculating the JW similarity between A and B. It might make sense to have one query to audit for perfect matches (without the use of UTL_MATCH) and exclude those records from a second query that audits for JW similarities that are high but less than a perfect 100.

 

Pre-screen for impossible matches. If a given ID_1 has a street address than is 60 characters long and a given ID_2 has a street address that is only 20 characters long, there is no possibility of a high Jaro-Winkler score and therefore no need to calculate it. Find a way to limit the data set to match before invoking UTL_MATCH, possibly through the use of a WITH clause that limits potential matching pairs by excluding any that differ in length by more than, say, five characters. (Another “pre-match” to use would check if the initial letter in a name is the same; if it isn’t, good chance it isn’t going to be a match.)

 

Keep match queries simple. Don’t ask for fields other than ID and the fields you’re trying to match on. Yes, it does make sense to bring down birthdate and additional address information so that the user can decide if a probable match is a true duplicate or not, but keep that part of the query separate from the match itself. You can do this by putting the match in a WITH clause, and then left-joining additional data to the results of that clause.

 

Truth be told, I have not yet written a query that does something useful while still executing in a reasonable amount of time, simply due to the sheer number of comparisons being made. I haven’t given up on SQL, but it could be that duplicate detection is better accomplished via a purpose-built script running on a standalone computer that is not making demands on an overburdened database or warehouse (aside from the initial pull of raw data for analysis).

 

The best I’ve done so far is a query that selects address records that were recently modified and matches them against other records in the database. Before it applies Jaro-Winkler, the query severely limits the data by pairing up IDs that have name strings and address strings that are nearly the same number of characters long. The query has generated a few records to investigate and, if necessary, de-dupe — but it takes more than an hour and half to run.

 

Have any additional tips for making use of UTL_MATCH? I’d love to hear and share. Email me at kevin.macdonell@gmail.com.

 

13 November 2016

Where we go from here

Filed under: Off on a tangent — Tags: , — kevinmacdonell @ 6:17 pm

 

Disbelief, anger, helplessness, anxiety. Does that describe your week just past? It certainly describes mine.

 

Given the nature of this blog, you might expect me to be dismayed at how poorly the number-crunchers fared in forecasting the outcome of this presidential election. But no, I don’t care about that.

 

While Tuesday night’s events were still unfolding on television, and long before any protestors took to the streets, voices of reason were already reminding us not to despair. I held onto three examples of these calm voices, because I figured I would need them. I would like to share them with you.

 

The first came around midnight, when it was starting to dawn on me that things were going to end badly:

 

“When voices of intolerance are loudest don’t be despondent — be emboldened, and even more committed to values of diversity and inclusion.”

 

That was a tweet from Richard Florizone (@DalPres), president of Dalhousie University, where I work. His words seemed too oblique when I first read them, somehow falling short of the righteous outrage called for by the occasion. But with the distance of a few days, when my head was cooler, I appreciated that this message was just right.

 

The second helpful piece of advice was a quote by French philosopher and political activist Simone Weil (1909-1943):

 

“Never react to an evil in such a way as to augment it.”

 

Such a succinct antidote to our instinct for knee-jerk retaliation! This quote came to me from the perennially wonderful Maria Popova (@brainpicker), a Bulgarian writer, blogger, and critic living in Brooklyn, New York. Her blog, BrainPickings.org, features her writing on culture, books, and eclectic subjects.

 

And finally, a simply-worded tweet from fundraising professional Lindsay Brown (@DonorScience) in Boston completed this circle of advice with a call to action:

 

“Now more than ever, it’s apparent to me that the work we do in the nonprofit sector is massively important. Let’s keep up the good work.”

 

This is only a sampling of the many calm and wise words spoken in recent days, but they will suffice. What do these three sentiments, taken together, advise us to do?

 

First, we are reminded that the Trump victory has not nullified the values of diversity and inclusion, nor impeded our ability to promote them. We need to understand why he was elected, and by whom (including millions of former Obama supporters who failed to vote), and to address root causes of political extremism. We need to understand, not denigrate, in order to clarify what we need to do to.

 

Second, whatever we do we should avoid making problems worse. Don’t move to Canada! As much as I’d love to have you here (in the unlikely event that Canada enables such immigration), please know that your country needs you now more than ever. For those outside the U.S. who feel like disengaging from that country via a boycott (which was my own initial response), please reflect on the consequences of feeding isolationism. And rioting in the streets against the outcome of a free and fair election can have no legitimate result. During the campaign, President Obama repeated the refrain, “Don’t boo — Vote!” Today we can say, “Don’t boo — Act!”

 

Third and finally: Never doubt that our sector is a vital player in creating a better world, despite not being directly “political”. Higher education and a host of nonprofits can build up and defend what Trumpism wants to tear down, and can help create diverse societies to combat the irrational fear of the Other that helps elect leaders like Trump in the first place.

 

The bad news is perfectly clear: that a radicalized faction of white extremism has just elected a dangerous, unpredictable leader animated by ethnic nationalism and xenophobia; that a nation that could have made history by electing its first woman president instead chose a man who abused and denigrated women and boasted about it; that a nostalgia for a bygone decade before civil rights has accompanied an irrational belief that advancement of ethnic minorities threatens the white, working-class status quo; that a country with international commitments to fight climate change has just elected a leader who doesn’t even believe climate change is a real thing.

 

This sudden clarity — this stunning proof that we have not made nearly as much progress as we thought — should be strong motivation not to despair but to get right to work.

 

I don’t have a prescription for what anyone needs to do. It depends on where you are, what tools you have to work with.

 

Do we have work to do at home? I’m willing to bet your daughters are prepared to take on a sexist world, but what are you telling your sons in order that they will help to create a new world?

 

What can we do in our neighbourhoods? Can diverse communities be brought together to interact? Can we replace mere proximity to the Other, which leads to tension and irrational suspicion, with familiarity and interdependence?

 

What causes and projects can we support with our dollars, our time, and our expertise to increase the ability for marginalized people to participate in the economy, to protect the environment, to support reputable journalism, to extend access to education, to promote people’s rights, to fight cynicism about politics and government?

 

There is so much — no one can do it all. I am still thinking about my own “what now?” list, and I know I have to choose wisely. But like voting itself, it is the accumulation of millions of individual actions that lead to dramatic overall results. Let’s agree that it is no longer enough to hold certain opinions, no longer enough to share the right memes on Facebook, no longer enough even to believe that our duty stops with voting and paying taxes.

 

As Hillary Clinton said the day after the election, “… our Constitutional democracy demands our participation. Not just every four years, but all the time. So let’s do all we can to keep advancing the causes and values we all hold dear. Making our economy work for everyone — not just those at the top. Protecting our country and protecting our planet. And breaking down all the barriers that hold any American back from achieving their dreams.”

 

These words can apply just as well to citizens of the United Kingdom, where far-right xenophobia prevailed in the Brexit vote, and to citizens of Canada, where extremist politicians are already talking about emulating Trump, and to people anywhere else in the world who are free to speak and act.

 

Disbelief, anger, helplessness, anxiety. Yes, there’s a time for all of those things. But let’s not subside into resignation, division, hopelessness, and cynicism. Instead let’s each of us look at our immediate surroundings and figure out what we can do. And then, roll up our sleeves and get to work.

 

Older Posts »

Create a free website or blog at WordPress.com.