In a hurry! Classification and Regression Tree: Six Easy Steps

What happens in CART — and six easy steps to build a model.

Rutvij Bhutaiya
Analytics Vidhya

--

Classification and Regression Tree is part of the Decision Tree algorithm.

XOR using 3 attributes — hackearth.com

Decision Tree is based on an algorithm approach that identifies ways to split the data based on conditions. (add that pic)

Decision Tree is based on a top-down, greedy approach.

ID3 algorithm uses information gain where C4.5 uses gain ratio for splitting.

CART is alternate, for regression and classification. CART uses gini index for splitting.

How CART calculates Gini overall for each variable,

Photo by Volodymyr Hryshchenko on Unsplash

For example,

variable temp = {hot, mild, cold, cold, cold, mild, hot, mild} and target = {yes, yes, no, no, yes, yes, no, yes}

Now to calculate gini for temp(overall), first we need to calculate

gini (hot overall) Yes =1 No = 1 hence, 1- (1/2)² — (1/2)² = 0.5

gini (mild overall) Yes =3, No= 0 hence, 1- (3/3)² — (0/3)² = 0

gini (cold overall) Yes = 1 No = 2 hence, 1 — (1/3)² — (2/3² = 0.46

Gini temp overall = (2/8)*0.5 + (3/8)*0 + (3/8)*0.46 = 0.297

Similarly, we can compute the gini for other variables like wind speed, weather, etc.

After calculating gini overall for all the variables, CART chooses variable with less gini gain to split the data, after split data becomes pure.

After a split, the parent gets 2 or more child nodes (based on variable class). And for further child node split algorithm uses the same loop.

Here are the six easy and quick steps to perform CART in R,

Step1: CART supervised technique we use the library(rpart) and library(rpart.plot)

Step2: Set control parameters with the help of rpart.control() function — set minbucket (terminal node) and minsplit (minimum observation must exist in a node for split)

NOTE: lower the minsplit and minbucket, larger the tree. Now, can’t put minbucket = 1, which means only 1 attribute, and this model gives the highest accuracy. And hence the model is overfitted and does not give the same accuracy on an unseen dataset.

Step 3: build a model with the use of rpart() function.

Step 4: Prune the tree based on a complex parameter (CP) table and xerror.

Step 5: Create a chart with the use of fancyRpartplot() function.

Step 6: predict the class of unseen dataset using predict() function.

--

--