Derive New Variables pro Better Predictive Models

Introduction

There are a lot of important components in creation data mining work for you. One of the most important parts is ensuring that you glean all of the information you can from your data. Occasionally simple transformations and replacements can make a big impact on your model. The SAS Enterprise Miner nodes make it easy to make these types of changes. Let’s look at the subject now.

Data

Here we are use Titanic data. This data set contains 891 observations and 12 variables.

Variable Descriptions:

Survival Survival (0 = No; 1 = Yes)

Pclass Passenger Class (1 = 1st, 2 = 2nd , 3 = 3rd )

Name Name

Sex Sex

Age Age

Sibsp Number of Siblings/ Spouses Aboard

Parch Number of Parents/Children Aboard

Ticket Ticket Number

Fare Passenger Fare

Cabin Cabin

Embarked Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

We will build a model wherever we have to predict the fate of the passengers aboard the RMS Titanic, which sank in the North Atlantic Ocean in the early morning of April 15, 1912, after colliding with an iceberg.

According to Wikipedia, a disproportionate number of men were left aboard because a “women and children first” protocol was followed when loading lifeboats. There were not enough lifeboats to accommodate all of those aboard, only a fraction of the passengers survived.

Our first model without any data modifications

We can build a simple model using a Decision Tree as shown under using variables given in the Titanic data. Since the name, cabin number and ticket number are all unique to each passenger; let’s reject those variables for now. We will use all other variables as predictors to build this model.

Run the flow and check the Tree results. We will use misclassification rate as the measure of the best model. Notice that the Misclassification Rate for this model is 0.17284 as shown below.

Check the Variable Importance table in the Tree results. Notice that Sex, Pclass, Age, SibSp and Fare are important variables.

Our first model with data modifications

How can we get an improved model? Here we can create new variables from the available variables to get more value from them.

While the ticket, cabin and name data isn’t useful since they were unique to each passenger; maybe a substring of those text strings might be useful to build a new predictor. We can start with the name field. If we explore a passenger’s name we see the following:

Moran, Mr. James

A passenger’s title can reflect gender, position on the ship (doctors, officer & wealthy people), and access to lifeboats (where “Master” superseded “Mr”). Perhaps the passenger’s title might give us a little more insight.

If we explore the dataset we see many titles including Mr, Mrs, Miss, Master, Lady and the Countess. The title ‘Master’ was used for unmarried boys. We have very few of the following titles: Captain, Don, Major and Sir. All of these are either military titles, or rich people. We strength be able to create a new variable which can be an important predictor other than age, gender, etc.

In order to extract these titles to make new variables, we can use the Transform node. We can use the SAS Code window in the Transform node to create a new variable called “Title”.

Add a Transform Variables node between the IDS node and the Tree node as shown below.

Select Transform Variables node, open SAS Code editor and enter code mentioned below. Here, we have used the SCAN function to extract the title from the character variable Name.

What else can we do to get more information from existing variables? There are two variables SibSb and Parch that indicate the number of family members each passenger is travelling with. We can assume that a large family might have trouble assembly all family members as they all try to get off the sinking ship, so we try to combine the two variables into a new one, Family Size. Again we can use the Transform node to create a new variable.

We can use either the SAS Code editor or the Formula Builder to create this variable. Let’s use the Formula Builder.

1. Select the Transform node and open the Formulas window.

2. Select the “Create” button.

3. In the “Edit Transformation” window, enter “FamilySize” in the Name field to create a variable called “FamilySize”.

4. Select the “Build” button. Create the following formula in the Expression Builder. We just add the number of siblings, spouses, parents and children the passenger had with them, and plus one for their own existence of course.

5. Select “OK” and save your changes.

6. Run the flow from the Transform node and check the exported data. Explore the new variables “Title” and “FamilySize”.

Run the flow and check the Tree results. Notice that this model performed better than the previous model. The Misclassification Rate for this model is 0.160494 as shown below.

Check the Variable Importance table in the Tree results. Notice that Title, FamilySize, Pclass, Age and Sex are important variables and are used in a tree to build the model. You can see that new variables Title and FamilySize have higher importance than SibSp and Fare.

Our next model with data modifications

What else we can do to improve this model? If you look at all observations of the Title variable, you will notice that there are a few very rare titles that won’t give our model much information to work with, so let’s combine a few of the most unusual titles. For the ladies, we have “Lady”, “the Countess”, “Dana”, “Mme”, “Mlle” and “Johkheer”. All of these are rich ladies traveling in first class. We can combine these separate groups into the “Lady” group.

For the men, we have a handful of titles: Captain, Don, Major and Sir. All of these are either military titles, or rich people. We can combine these titles into the “Sir” group.

We can use the Replacement node to reduce the number of levels. Add a Replacement node between the Transform node and the Tree node as shown below.

Select the Replacement node and open the Replacement Editor for class variables. As explained above, combine a few of the unusual titles into a single category as shown below. Replace “Col”, “Major”, “Capt” and “Don” with “Sir” and “Jonkheer”, “Mme”, “Mlle” and “the Countess” with “Lady”. Replace “Ms” with “Miss”. Save your changes.

Run the flow and check the Tree results. Notice that this model performed better than the previous model. Notice that the Misclassification Rate for this model is 0.150393 as shown below.

Comparing models with and without data manipulation

You can use the Model Comparison node to compare all three models.

Check the Model Comparison results. Notice that Tree_Model3 was selected as the best model using the misclassification rate as the metric for model selection.

Summary

This example shows that a little bit of effort in transforming your data can go a long way toward improving your model. With just a few transformations we’ve begun to see an improvement in classification accuracy. Imagine the improvements you might achieve with a bit more effort.

SAS Predictive Modeling Online Training

Pages

Labels

Reviews

Media

Derive New Variables pro Better Predictive Models

Post a Comment

Blog Archive

Popular Posts

Labels

Random Posts

Flickr Photo

About Us

Relevent Blogs

Contact Us

SAS Predictive Modeling Online Training

Pages

Labels

Reviews

Media

Derive New Variables pro Better Predictive Models

Post a Comment

Connect Us

Blog Archive

Popular Posts

Labels

Random Posts

Flickr Photo

About Us

Relevent Blogs

Contact Us