# Machine Learning: XGBoost / Extreme Gradient Boosting

To understand the impact some disruptions can make, it is important to know how an algorithm such as XGBoost, has pushed the Machine Learning community by a leap. With successful results, and winning in multiple competitions, this algorithm is quite a breakthrough, and uses a trivial, yet essential concept to help find effective solutions.

### What is XGBoost ?

XGBoost is an ensemble learning algorithm, and it stands for Extreme Gradient Boosting. At times, it is important to take the reviews of a few people before we order for a pizza, isnt it ? Thus, this insufficiency of validated information is what is at the heart, and root, of XGBoost.
Now, what is an ensemble learning algorithm ?
Well, to put it simply, we combine the predictive power of multiple results, and the resultant is the aggregated output of several models. For example, if 7 out of 10 suggest that this object is a ball, we will rely on the opinion that ok, this object is a ball.

Did you know that CERN utilises the XGBoost algorithm to classify signals for the Large Hadron Collider ? You may ask why ? Well it’s because of the algorithms construct, scalability, and flexibility that allows it to process close to 3 petabytes (~3000 Terabytes) of data every year. The two common techniques used for statistical models using ensemble learners are Bagging and Boosting.

So, the models that form the ensemble, also called as base learners, could include learning algorithms from the same, or different learning algorithms. There are two predominant techniques used for ensemble learners, which are, Bagging and Boosting. We will discuss Bagging and Boosting in the sub-topic below.

### How does XGBoost work?

Before we begin with understanding how XGBoost works, let us understand what are these two essential concepts, Bagging, and Boosting.

#### Bagging

Intuition: Image we are in a supermarket, as a family of 5.
1. When three suggest that they want to eat pasta tonight, majority wins. We cook pasta. Winner: Pasta
2. Now, if 4 suggest that today we eat pasta, again majority wins. Winner: Pasta
3. If two want to eat pasta, and the other three want to eat alternative meals such as a kebab, pizza, and burger, respectively, again pasta wins. Winner: Pasta

I don’t want pasta to lose, because it is life. 😀

Now, the point here is that, in bagging, the model accepts inputs, and comes up with multiple decisions. The majority of the output received is what wins, and the model will give us an output by a majority vote – just like we did with pasta.

To give you the right example, this is EXACTLY how decision trees work. They create multiple decisions, and the one which has the highest votes wins.

#### Boosting

In boosting, the process is quite different. Here we learn from our mistakes, and thus end up making a strong predictor. The model learns sequentially, and the output of one becomes the input of the second, then the output of the second becomes the input of the third, so on – and so forth. Thus it iterates, and learns from it’s previous iteration.

The models are built sequentially such that each subsequent model aims to reduce the errors of the previous model. Each model learns from its predecessors. Hence, the model that grows next in the sequence will learn from an updated version of the errors. Thus, with every passing iteration of the model, we end up knowing “WHAT WE DON’T WANT“.

Intuition: The illustration below should give you a quick idea of how the algorithm works.

As we can see, the output of the First Model, becomes the input of the Second Model, and so on. With each passing model, the errors of the first become the benchmarks for the consequent one, and keep telling it, “OK, understand me clearly, you are NOT supposed to do this!

#### How does XGBoost learn from the errors?

I like to explain things with intuitions, because if you can imagine something, you can reproduce it. Sometimes we may not be able to explain it right away, but little by little, imagining what is really happening, in our little grey organ, over time, we are able to express ourselves with a lot more ease.

So, let us take a look at the below illustrations to understand what do we mean by that XGBoost learns from it’s errors ? func.in: In the above illustration, we see how the outputs of various models are essential in deciding the final output.

So, the illustration includes the Sun and the Moon. Let’s look at it in detail:
1. Output 1: 3/5 Suns correctly classified – 2/5 incorrectly classified
2. Output 2: 3/5 Suns correctly classified – 2/5 incorrectly classified
3. Output 3: 4/5 Suns correctly classified – 1/5 incorrectly classified – 2/5 Moons incorrectly classified

Now, let’s imagine this intuitively. In the final result, the output of each one of them is overlapped, and we end up with a final result with correct classification of the Sun and the Moon.

This is how, Boosting really works. It finds the intersection of all correct classifications, and gives us a final model with dependable results. Tada!

#### Why is the word Gradient used in XGBoost (Extreme Gradient Boosting) ?

Extreme Gradient Boosting gets it’s name from Gradient Descent. The loss function that is chosen by us, is minimised by using gradient descent to help us find the right parameters to ensure this loss function is minimum. The aim is to find those parameters where the loss function give us the MINIMUM results. This is how Gradient Boosting reduces the error. With every pasing sequence, the error reduces, all the way until we have found one final predictor which gives us the least difference in the actual and predicted values.

A combination of model, and computational, efficiency has ensured that this model turns out to be the most effective one among many available. It is Fast, Efficient, and Easy to implement.

### An intuitive explanation of the Mathematics behind XGBoost

Mathematics – a word that has given many little children, and adults alike, more sleepless nights than Exhorcist.

Well, to make things simple, for you, and me 🙂 I will provide you with a strong intuition of what is really happening when such an algorithm is put to use. The mathematics can be difficult for me to remember and recollect as well, but with an intuition it shall be simple to grasp the essence of the Algorithm.

The general mathematical formula: fm(x)=fm-1(x)+ γmhm(x)

Hence, the first iteration would be: f1(x)=f0(x)+γ1h1(x)

To explain the above, we are going to perform three important steps:

1. An initial model f0 is defined to predict the target variable y. This model will be associated with a residual (yf0) – this is known as the ‘Residual‘ which is the difference between the predicted value and the loss function.
2. A new model h1 is fit to the residuals from the previous step.
3. Now, f0 and h1 are combined to give f1, the boosted version of f0. To this model h1 we add a constant called γ1 which is the derived multiplicative factor.

Therefore, we achieve a general equation which can visualised as the following:

Now that we have seen what the general boosted model looks like, I have made a quick video for you to see what the model actually does. I enjoy learning with intuitions since it let’s me imagine the models, and numbers, in my mind.

#### Intuition of the Mathematics

Let’s consider the first square below with ‘x‘ which is 1/4 part of the square being the solution and the residue being 3/4 parts of the square, which is something we don’t need as part of our solution.

Now, over a few iterations, we obtain the following results: func.in: Assume that after each iteration, we have the following three results.

Now what happens after this is where things begin to make complete sense. The three results, overlap, and the end result is shown in a short 20 second below.

As we can see through Part 1 to Part 4 of the algorithm breakdown, of what is really happening at the heart of the algorithm. We achieve results MINUS the residues that are computed. The multiplicative factor is also derived using a mathematical formula. Here, for convenience, I have ensured that the intuition is strongly understood. Some simple ideas can create effective solutions.

### Sample Code

Being a quick resouce, here is a quick sample code for your perusal. Can be used as you would want. In another blog post, I shall explain the specifics of how can we hyper tune our model for enhanced results.

Example:

#Import Packages
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score

#Initialise the Model, I have used some hyperparameters here
xgb = xgb.XGBRegressor(objective =’reg:squarederror’,n_estimators=101,learning_rate=0.2, gamma=0,subsample=0.8,colsample_bytree=0.6, max_depth=43)

#Fit the Model
xgb.fit(x_train,y_train)

#Predict and Print
predictions = xgb.predict(x_test)
print(‘Accuracy Rate:’,round(explained_variance_score(y_test,predictions)*100),’%’)

### Conclusion

I have typed this tutorial with utmost excitement, and a drive to share what I have understood. While I implement the same algorithm at work, understanding the fundamentals of algorithms is key to moving a step ahead everyday while we float through this vast ocean of Artificial Intelligence and Machine Learning. In case you have any questions, or doubts, feel free to drop a message here. If you do observe any errors in the above post, do comment your observations. I would like to rectify the error immediately.

Stay tuned for more! 