Loading...
https://saspredictivemodelingonlinetraining.blogspot.com/2015/11/derive-new-variables-pro-predictive-models.html
Introduction
There are a lot of important components
in creation data mining work for you. One of the most important parts is
ensuring that you glean all of the information you can from your data. Occasionally
simple transformations and replacements can make a big impact on your model.
The SAS Enterprise Miner nodes make it easy to make these types of changes.
Let’s look at the subject now.
Data
Here we are use Titanic data. This data
set contains 891 observations and 12 variables.
Variable Descriptions:
Survival Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st, 2 = 2nd
, 3 = 3rd )
Name Name
Sex Sex
Age Age
Sibsp Number of Siblings/ Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin
Embarked Port of Embarkation (C =
Cherbourg, Q = Queenstown, S = Southampton)
We will build a model wherever we have
to predict the fate of the passengers aboard the RMS Titanic, which sank in the
North Atlantic Ocean in the early morning of April 15, 1912, after colliding
with an iceberg.
According to Wikipedia, a
disproportionate number of men were left aboard because a “women and children
first” protocol was followed when loading lifeboats. There were not enough
lifeboats to accommodate all of those aboard, only a fraction of the passengers
survived.
Our
first model without any data modifications
We can build a simple model using a
Decision Tree as shown under using variables given in the Titanic data. Since
the name, cabin number and ticket number are all unique to each passenger;
let’s reject those variables for now. We will use all other variables as
predictors to build this model.
Run the flow and check the Tree results.
We will use misclassification rate as the measure of the best model. Notice
that the Misclassification Rate for this model is 0.17284 as shown below.
Check the Variable Importance table in
the Tree results. Notice that Sex, Pclass, Age, SibSp and Fare are important
variables.
Our
first model with data modifications
How can we get an improved model? Here
we can create new variables from the available variables to get more value from
them.
While the ticket, cabin and name data
isn’t useful since they were unique to each passenger; maybe a substring of
those text strings might be useful to build a new predictor. We can start with
the name field. If we explore a passenger’s name we see the following:
Moran, Mr. James
A passenger’s title can reflect gender,
position on the ship (doctors, officer & wealthy people), and access to
lifeboats (where “Master” superseded “Mr”). Perhaps the passenger’s title might
give us a little more insight.
If we explore the dataset we see many
titles including Mr, Mrs, Miss, Master, Lady and the Countess. The title
‘Master’ was used for unmarried boys. We have very few of the following titles:
Captain, Don, Major and Sir. All of these are either military titles, or rich
people. We strength be able to create a new variable which can be an important
predictor other than age, gender, etc.
In order to extract these titles to make
new variables, we can use the Transform node. We can use the SAS Code window in
the Transform node to create a new variable called “Title”.
Add a Transform Variables node between
the IDS node and the Tree node as shown below.
Select Transform Variables node, open
SAS Code editor and enter code mentioned below. Here, we have used the SCAN
function to extract the title from the character variable Name.
What else can we do to get more
information from existing variables? There are two variables SibSb and Parch
that indicate the number of family members each passenger is travelling with.
We can assume that a large family might have trouble assembly all family
members as they all try to get off the sinking ship, so we try to combine the
two variables into a new one, Family Size. Again we can use the Transform node
to create a new variable.
We can use either the SAS Code editor or
the Formula Builder to create this variable. Let’s use the Formula Builder.
1. Select the Transform node and open
the Formulas window.
2. Select the “Create” button.
3. In the “Edit Transformation” window,
enter “FamilySize” in the Name field to create a variable called “FamilySize”.
4. Select the “Build” button. Create the
following formula in the Expression Builder. We just add the number of
siblings, spouses, parents and children the passenger had with them, and plus
one for their own existence of course.
5. Select “OK” and save your changes.
6. Run the flow from the Transform node
and check the exported data. Explore the new variables “Title” and
“FamilySize”.
Run the flow and check the Tree results.
Notice that this model performed better than the previous model. The
Misclassification Rate for this model is 0.160494 as shown below.
Check the Variable Importance table in
the Tree results. Notice that Title, FamilySize, Pclass, Age and Sex are
important variables and are used in a tree to build the model. You can see that
new variables Title and FamilySize have higher importance than SibSp and Fare.
Our
next model with data modifications
What else we can do to improve this
model? If you look at all observations of the Title variable, you will notice
that there are a few very rare titles that won’t give our model much
information to work with, so let’s combine a few of the most unusual titles.
For the ladies, we have “Lady”, “the Countess”, “Dana”, “Mme”, “Mlle” and
“Johkheer”. All of these are rich ladies traveling in first class. We can
combine these separate groups into the “Lady” group.
For the men, we have a handful of
titles: Captain, Don, Major and Sir. All of these are either military titles,
or rich people. We can combine these titles into the “Sir” group.
We can use the Replacement node to
reduce the number of levels. Add a Replacement node between the Transform node
and the Tree node as shown below.
Select the Replacement node and open the
Replacement Editor for class variables. As explained above, combine a few of
the unusual titles into a single category as shown below. Replace “Col”,
“Major”, “Capt” and “Don” with “Sir” and “Jonkheer”, “Mme”, “Mlle” and “the
Countess” with “Lady”. Replace “Ms” with “Miss”. Save your changes.
Run the flow and check the Tree results.
Notice that this model performed better than the previous model. Notice that
the Misclassification Rate for this model is 0.150393 as shown below.
Comparing
models with and without data manipulation
You can use the Model Comparison node to
compare all three models.
Check the Model Comparison results.
Notice that Tree_Model3 was selected as the best model using the
misclassification rate as the metric for model selection.
Summary
This example shows that a little bit of
effort in transforming your data can go a long way toward improving your model.
With just a few transformations we’ve begun to see an improvement in
classification accuracy. Imagine the improvements you might achieve with a bit
more effort.
Training
1855295102109939193
Post a Comment
Home
item
Blog Archive
Popular Posts
-
Introduction There are a lot of important components in creation data mining work for you. One of the most important parts is ensuring t...