CART (Classification & Regression Trees)

CART partition the space of all joint predictor variable values into disjoint regions (see top right-hand figure). The output of the model in each of this disjoint region is a constant . That is, the predictive rule:
Output of the overall model output can be written in a more general way as follows:
Here, is the identity function, ,
At each step, CART Algorithm needs:
- Splitting Variable
- Splitting Point
We have observations , . If we minimize RMSE, , the best for region turns out to be the average of all the in that region :
For regression tree, to find the best variable to split on and best splitting point, we need a greedy approach. Say, we consider variable to split on at point . So, we get two region,
Then we minimize the following:
We choose the splitting point, that minimizes the above objective function. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions.
Boosting and Stagewise Additive Model
Boosting builds an additive models of weak learners. If is a weak learner, then we combine of such learners in a stagewise manner, i.e. at each stage , we are gonna optimize the learner , while keeping all the other learners constant. This significantly reduce the chances of overfitting.
For examples, boosting for a regression problem turns out to be the problem of repeatedly fitting the residuals, as shown below. At stage 0, the residual is the output, i.e. , . Here, is the shrinkage factor to shrink down each tree, also helps with overfitting.

For a more general framework, the forward stagewide additive model algorithm looks like this:

In boosted tree model, the weak learner is CART, and the algorithm is solved in the forward stagewise manner:
in order to find , .
At each step , the solution tree is the one that maximally reduces the above equation, given the current model and its fits .
To compare the above with a gradient descent algorithm, in gradient descent, we initialize with some random value at the first step, then in the subsequent step, we go in the negative direction of the slope.
The tree predictions here is analogous to the components of the negative gradient. The main difference is, in tree prediction, the components of the gradient vector are constrained to be the components of a decision tree, whereas in gradient descent, the negative gradient is the unconstrained maximal descent direction. In gradient boosted machine, at -th iteration, we train a tree a , whose predictions are as close as possible to the negative gradient in least-square sense:
here gradient is defined as
The following table shows gradients of the commonly use loss functions:

We can see for regression, negative gradient is just the ordinary residual .
The following algorithm shows a more generalized algorithm:

In summary, GBM involves three elements:
- A differentiable loss function to be optimized.
- A weak learner to make predictions. In this case CART
- An additive model to add weak learners to minimize the loss function. Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees.
XGBoost
XGBoost introduced regularization and momentum term in the above formulation. Objective:

With trees, model output,
Therefore, the objective functions becomes:

Here, we can not use methods such as SGD, to find (since they are
trees, instead of just numerical vectors). So, solution is additive training:


For regression problem, we minimize the residual error at each step, as shown below:

For a more general case, we use Taylor series expansion.
Then the objective function can be approximated as:
Here,
Removing the constant, objective function becomes:
Example: Regression (again)
Then, and,
We can re-formulate tree building mechanism based on the above expansion. We can define a tree by a vector of scores in leafs , and a leaf index mapping function that maps an input instance to a leaf.
For the following tree,

Mapping function looks like:

We can define (one possible definition) the regularization term based on the above definition,

Then, for the above example, we have,

We can define the instance set in leaf as . Example, for the following input instances:

the instance set looks like: , , .
We can re-group the objective by each leaf:
This is sum of independent quadratic equations.
Using the above properties of quadratic functions, we have


Exhaustive search algorithm to find the optimum tree at step-t:

So, we can use a greedy algorithm:

Summary of XGBoost

Tree Size in each tree
Tree of size can involve variables, or interaction order of the model.