In order to illustrate the construction of regression tree (using the CART methodology), consider the following simulated dataset,

> set.seed(1) > n=200 > X1=runif(n) > X2=runif(n) > P=.8*(X1<.3)*(X2<.5)+ + .2*(X1<.3)*(X2>.5)+ + .8*(X1>.3)*(X1<.85)*(X2<.3)+ + .2*(X1>.3)*(X1<.85)*(X2>.3)+ + .8*(X1>.85)*(X2<.7)+ + .2*(X1>.85)*(X2>.7) > Y=rbinom(n,size=1,P) > B=data.frame(Y,X1,X2)

with one dichotomos varible (the variable of interest, ), and two continuous ones (the explanatory ones and ).

> tail(B) Y X1 X2 195 0 0.2832325 0.1548510 196 0 0.5905732 0.3483021 197 0 0.1103606 0.6598210 198 0 0.8405070 0.3117724 199 0 0.3179637 0.3515734 200 1 0.7828513 0.1478457

The theoretical partition is the following

Here, the sample can be plotted below (be careful, the first variate is on the *y*-axis above, and the *x*-axis below) with blue dots when equals one, and red dots when is null,

> plot(X1,X2,col="white") > points(X1[Y=="1"],X2[Y=="1"],col="blue",pch=19) > points(X1[Y=="0"],X2[Y=="0"],col="red",pch=19)

In order to construct the tree, we need a partition critera. The most standard one is probably Gini’s index, which can be writen, when ‘s are splited in two classes, denoted here

or when ‘s are splited in three classes, denoted

etc. Here, are just counts of observations that belong to partition such that takes value . But it is possible to consider other criteria, such as the chi-square distance,

where, classically

Here again, the idea is to maximize that distance: the idea is to discriminate, so we want samples as not independent as possible. To compute Gini’s index consider

> GINI=function(y,i){ + T=table(y,i) + nx=apply(T,2,sum) + pxy=T/matrix(rep(nx,each=2),2,ncol(T)) + vxy=pxy*(1-pxy) + zx=apply(vxy,2,sum) + n=sum(T) + -sum(nx/n*zx) + }

We simply construct the contingency table, and then, compute the quantity given above. Assume, first, that there is only one explanatory variable. We split the sample in two, with all possible spliting values , i.e.

Then, we compute Gini’s index, for all those values. The knot is the value that maximizes Gini’s index. Once we have our first knot, we keep it (call it, from now on ). And we reiterate, by seeking the best second choice: given one knot, consider the value that splits the sample in three, and give the highest Gini’s index, Thus, we consider either the following partition

or this one

I.e. we cut either below, or above the previous knot. And we iterate. The code can be something like that,

> X=X2 > u=(sort(X)[2:n]+sort(X)[1:(n-1)])/2 > knot=NULL > for(s in 1:4){ + vgini=rep(NA,length(u)) + for(i in 1:length(u)){ + kn=c(knot,u[i]) + F=function(x){sum(x<=kn)} + I=Vectorize(F)(X) + vgini[i]=GINI(Y,I) + } + plot(u,vgini) + k=which.max(vgini) + cat("knot",k,u[k],"\n") + knot=c(knot,u[k]) + u=u[-k] + } knot 69 0.3025479 knot 133 0.5846202 knot 72 0.3148172 knot 111 0.4811517

At the first step, the value of Gini’s index was the following,

which was maximal around 0.3. Then, this value is considered as fixed. And we try to construct a partition in three parts (spliting either below or above 0.3). We get the following plot for Gini’s index (as a function of this second knot)

which is maximum when the split the sample around 0.6 (which becomes our second knot). Etc. Now, let us compare our code with the standard R function,

> tree(Y~X2,method="gini") node), split, n, deviance, yval * denotes terminal node 1) root 200 49.8800 0.4750 2) X2 < 0.302548 69 12.8100 0.7536 * 3) X2 > 0.302548 131 28.8900 0.3282 6) X2 < 0.58462 65 16.1500 0.4615 12) X2 < 0.324591 7 0.8571 0.1429 * 13) X2 > 0.324591 58 14.5000 0.5000 * 7) X2 > 0.58462 66 10.4400 0.1970 *

We do obtain similar knots: the first one is 0.302 and the second one 0.584. So, constructing tree is not that difficult…

Now, what if we consider our two explanatory variables? The story remains the same, except that the partition is now a bit more complex to write. To find the first knot, we consider all values on the two components, and again, keep the one that maximizes Gini’s index,

> n=nrow(B) > u1=(sort(X1)[2:n]+sort(X1)[1:(n-1)])/2 > u2=(sort(X2)[2:n]+sort(X2)[1:(n-1)])/2 > gini=matrix(NA,nrow(B)-1,2) > for(i in 1:length(u1)){ + I=(X1<u1[i]) + gini[i,1]=GINI(Y,I) + I=(X2<u2[i]) + gini[i,2]=GINI(Y,I) + } > mg=max(gini) > i=1+sum(mg==max(gini[,2])) > par(mfrow = c(1, 2)) > plot(u1,gini[,1],ylim=range(gini),col="green",type="b",xlab="X1",ylab="Gini index") > abline(h=mg,lty=2,col="red") > if(i==1){points(u1[which.max(gini[,1])],mg,pch=19,col="red") + segments(u1[which.max(gini[,1])],mg,u1[which.max(gini[,1])],-100000)} > plot(u2,gini[,2],ylim=range(gini),col="green",type="b",xlab="X2",ylab="Gini index") > abline(h=mg,lty=2,col="red") > if(i==2){points(u2[which.max(gini[,2])],mg,pch=19,col="red") + segments(u2[which.max(gini[,2])],mg,u2[which.max(gini[,2])],-100000)} > u2[which.max(gini[,2])] [1] 0.3025479

The graphs are the following: either we split on the first component (and we obtain the partition on the right, below),

or we split on the second one (and we get the following partition),

Here, it is optimal to split on the second variate, first. And actually, we get back to the one-dimensional case discussed previously: as expected, it is optimal to split around 0.3. This is confirmed with the code below,

> library(tree) > arbre=tree(Y~X1+X2,data=B,method="gini") > arbre$frame[1:4,] var n dev yval splits.cutleft splits.cutright 1 X2 200 49.875000 0.4750000 <0.302548 >0.302548 2 X1 69 12.811594 0.7536232 <0.800113 >0.800113 4 <leaf> 57 8.877193 0.8070175 5 <leaf> 12 3.000000 0.5000000

For the second knot, four cases should be considered: spliting on the second variable (again), either above, or below the previous knot (*see below on the left*) or spliting on the first one. Then whe have wither a partition below or above the previous knot (*see below on the right*),

Etc. To visualize the tree, the code is the following

> plot(arbre) > text(arbre) > partition.tree(arbre)

Note that we can also visualize the partition. Nice, isn’t it?

To go further, the book *Classification and Regression Trees* by Leo Breiman (and co-authors) is awesome. Note that there are also interesting sections in the bible *Elements of Statistical Learning: Data Mining, **Inference, and Prediction* by Trevor Hastie, Robert Tibshirani and Jerome Friedman (which can be downloaded from http://www.stanford.edu/~hastie/…)

Thanks for the nice explanation. I would love to see the R code that produced the animations. Would you be willing to share/post it ?

Thanks!

Markus

oh, thanks Markus… I will see if I can rewrite them when I will teach it this semester !

Bonjour,

Merci pour ce billet les vidéos sont très cools !

il manque certaines images qui correspondent probablement aux formules de l’index de Gini pour 2 et plus de catégories (un url manquant est : http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-04.png)

désolé, j’essaye de fixer ça rapidement

Merci pour ce billet éclairant.

What about random foret trees in regard to regression trees ?