CoolData blog

13 December 2009

Adding new variables to existing datafile in Data Desk, Part 2

Filed under: Data Desk, Derived variables — Tags: , , , , , — kevinmacdonell @ 8:04 pm

As explained in Part 1, you can simply paste a new variable into an existing relation IF there are the same number of cases in both the existing relation and the new variable, AND they are in exactly the same order. Otherwise you will have to perform a join, as follows.

As an example, let’s say my existing data set has variables for a unique identifier (ID), First Name, Last Name, and Class Year. And let’s say that for some reason, I need to add Degree as a variable. (For something simple like this, I would of course scrap this data file and re-query the database to get all the data I need; this is just an example.)

I’ll need to query the database for two things, ID and Degree. (Whenever you query your database, you must never fail to include a unique record identifier. In Banner this may be ID or PIDM.)

Copy the columns you want from Excel, Access or whatever your source is.

In Data Desk, go to the Data menu:  New > Relation. You’ll be pasting your new variables into a relation that is separate from the existing relation that contains the names and class years.

Name the folder (let’s call it “Deg“), and paste your variables. For neatness, drag the variables out of the Clipboard folder and into Deg, and drag the empty Clipboard to the trash.

To avoid confusing the new ID variable and the ID variable as already exists, rename the new variable. (eg. ‘ID-degree’)

Create a new derived variable in the Deg relation folder. Name it ‘ID-degree-lookup’, and give it this expression:

lookup('ID-degree','ID')

Remember to drag each of the two variables into the expression window rather than type them. The Lookup command will literally look up the value of the first argument in the list of values in the second argument.

Now create another derived variable, again in the Deg relation folder. Call it ‘Degree-joined’. Give it this expression:

getcase(textof('Degree'),'ID-degree-lookup')

The GetCase command will retrieve the value of ‘Degree’ that matches each looked-up ID value. The nested TextOf command is needed because the value of ‘Degree’ is non-numerical. If the variable we were adding was a numerical value, such as Lifetime Giving, the TextOf command would not be needed.

You can check if the variable works by choosing Show numbers from the variable’s Hyperview menu, or by dragging the variable into a frequency table.

‘Degree-joined’ is now a full member of both relations – you can drag it into the folder that contains your original data without getting an error message. It’s now available to be used like any other variable in the original data set.

If this seems a little confusing at first, don’t worry. Print these directions out and refer to them as needed. If you are building a predictive model and discovering new variables as you go along, you’ll be calling on this technique so often that eventually you’ll know it by heart.

Advertisements

11 December 2009

Adding new variables to existing datafile in Data Desk, Part 1

Filed under: Data Desk — Tags: , , , — kevinmacdonell @ 11:47 am

When I form a query on our database using MS Access, I try to make it as complete as possible before starting work in Data Desk. I ask myself, do I have all the columns (variables) I need for this task? If not, I modify and run the query again.

Invariably, though, I realize I’ve forgotten something only after I am well underway in Data Desk and starting from scratch is not an option. Or sometimes the data set has to contain so many variables (such as in any typical predictive modeling project) that there’s no way I can pull them out of the database with just one query (MS Access can handle only so much at once).

There is a lazy way to do this, which I do not recommend. You can simply paste a new variable into an existing relation, under these conditions:

  • The number of cases in your new variable is exactly the same as in your existing data file.
  • All the cases (‘rows’ in Excel) in the new variable are in exactly the same order as your existing data file.

These conditions can be easily met if your existing data and your new data is coming from the same static spreadsheet stored on your own hard drive. But a university database is a living thing. The number of ‘living, addressable alumni’ changes by the hour, so I can never be sure that any new variable I pull out of it is going to match up with the data I pulled yesterday.

Even if you ARE grabbing data from your own spreadsheet, what happens if you’ve reordered the rows in the meantime? Data Desk will allow you to paste the variable (because the number of cases is the same), but the data will be mismatched.

Fortunately it takes just a couple of steps in Data Desk to perform what I call a ‘merge’ or ‘join’ using the Lookup and GetCase functions. Sometimes I call it stitching data together – that’s what it feels like I’m doing.

I’ll show you how in: “Adding new variables to existing datafile in Data Desk, Part 2.

7 December 2009

Rural vs. Urban postal codes

Filed under: Data Desk, Derived variables — Tags: , , , — kevinmacdonell @ 11:18 am

If your data set contains primarily individuals residing in Canada, you can create a derived indicator variable in Data Desk to identify who does not live in an urban area or small town. Just one more potential predictor variable to test in your models.

First, your data set needs to contain Postal Code. You’ll also need to include Country, if your data set includes non-Canadian addresses.

The expression below will create an indicator variable with a value of ‘1’ (true) for rural Canadian addresses, and ‘0’ (false) for urban addresses.

Drag your own variables for Postal Code and Country into the expression window.

mid('Postal Code',2,1) = 0 AND textof('Country') = "Canada"

Lazy man’s derived variables: Class decade

Filed under: Data Desk, Derived variables — Tags: , , , , — kevinmacdonell @ 10:59 am

While using Data Desk, I find I need to create certain derived variables expressions over and over again. Instead of composing these anew every time, I save them as text files in a place quickly accessible on my computer.

Class Year is a variable I include in nearly every set of alumni data I work with. So one of the derived variables I create most often is Class Decade.

To use, simply copy this text and paste into the expression window of a new derived variable. Replace each ‘v’ with the Class Year variable, exactly as it is named in your data set. (Best to click and drag ‘Class Year’ into the expression window, instead of typing it every time.)

if len (v) = 0 then "n/a"
else if v < 1950 then "1940s + <"
else if v < 1960 then "1950s"
else if v < 1970 then "1960s"
else if v < 1980 then "1970s"
else if v < 1990 then "1980s"
else if v < 2000 then "1990s"
else if v < 2010 then "2000s"
else •

« Newer Posts

Create a free website or blog at WordPress.com.