Whether it’s computing Point of Interest similarities for our Resolve service, or predicting business categories from business names, we at Factual use Machine Learning to solve a wide variety of problems in our pursuit of quality data and products. One of the most powerful Machine Learning techniques we turn to is ensembling. Ensemble methods build surprisingly strong models out of a collection of weak models called base learners, and typically require far less tuning when compared to models like Support Vector Machines.

In this post I will attempt to explain why aggregrating a collection of base learners is an effective way to build a strong model, and how to do so correctly. Most ensemble methods use decision trees as base learners and many ensembling techniques, like Random Forests and Adaboost, are specific to tree ensembles. So, let’s start with a brief introduction to decision trees.

Decision Trees

Training a decision Tree for Classification (predicting a category) or Regression (predicting a number) is actually fairly simple. You start with a table containing all of your rows. Then, for each column (aka “feature”) you assess every possible split point to divide the data into two new sets.

If you have a categorical variable that has values like “red”, “green”, or “blue” you could split on whether the value is “red” or not “red”; whereas if you have a real-valued variable like Age, you could split on the various possible Age values that exist in you training data. If a row’s value lies above the split point it’s partitioned into one group, otherwise the row is partitioned into the other group. After that you continue to split the groups in a recursive fashion to end up with a decision tree.

You measure the quality of a split point using metrics like Information Gain or the Gini Coefficient if you’re doing classification, and reduction in the Standard Deviation if you’re doing regression. Essentially, all these metrics measure how much you’ve reduced your uncertainty by picking that split point. By picking the best split each time the greedy decision tree training algorithm tries to form decisions with as few splits as possible.

Finally, the leaf nodes of the tree represent predictions. In the case of classification you’d predict the most frequent category in the leaf set, and in the case of regression you’d predict the mean or median value in the leaf set.

To avoid overfitting with trees, one typically uses various stopping criteria to decide when to stop splitting. You can set a max depth for the tree, or a minimum on the number of rows in a leaf set. It’s common sense to stop splitting if your rows have output values that are all of the same category (in the classification case) or have zero variance (in the regression case). There are also various pruning techniques that reduce the size of a tree once it has been fully grown.

Example: Predicting Hotel Star Ratings

Lets say we wanted to predict how many stars a hotel has. We know that a hotel can have anywhere between 1 and 5 stars. So, if we assume that the distribution of stars across all hotels is uniform, then the Root Mean Square Error (RMSE) of making random guesses is 1.6 stars. We can be smarter by guessing 3 stars (the mean) every time and this would give us a RMSE of 1.2 stars.

As an experiment, let’s see if we can do better using Geopulse data from Factual’s API to predict hotel star ratings. To do this I first queried Factual’s API for Hotels in the Los Angeles area that have a non-blank star rating. Then I took the latitude and longitude values for each hotel and queried the Geopulse API for information about the surrounding area. In the end, I formed a table of data including “median income”, “commercial density”, and various other counts of place categories in the area like “legal & financial”, “health & medicine”, “education”, “arts, entertainment & nightlife”, etc.

After partitioning the data into a training and test set, I trained a decision tree using the training set. The resultant tree predicted the star rating of the withheld test set with a RMSE of 0.927 stars. This is a decent improvement on our uninformed guess of 3 stars for each hotel and this is what the tree looks like.

Decision Tree

Each rectangle represents a split, where the “yes” (or true) branch is on the left and the “no” (or false) branch is on the right. The circles represent leaf nodes and the number inside the circles is the value which would be predicted. It’s interesting that none of the predictions are greater than 3.8, which means that model would never predict a 5 star hotel correctly.

The Bias-Variance Tradeoff

One way to look at the problem of overfitting (and underfitting) a model like decision trees is the bias-variance tradeoff. Let’s use regression as an example. You can view the target values (Y) as being a sum of some unknown deterministic function (f(x)) mapping inputs (X) to target values plus some random noise (R) which corrupts the target values. We can assume that the noise is zero-mean since any bias would be incorporated into f(x).

Y=f(X)+R

You can think of R as being randomness induced from not having perfect features or it could be a truly random source of corruption. Your best bet is to estimate f(x) by a function g(x) using the training data. With that in mind, it turns out you can decompose the Mean Square Error of your g(x) in predicting the real target Y into three terms: the irreducible error due to the variance in R (known as the Bayes Limit), the squared bias, and the variance.

Equation

The bias is the average error of g(x) in estimating f(x) over all possible inputs. Some of your errors may be positive, some may be negative, but if your model is unbiased, then they balance out over the long run. The variance captures how much spread your errors have in estimating f(x) with g(x), so a larger variance means that often the magnitude of your errors are very large even though they might average out to zero. Notice that bias and variance are measured with respect to estimating f(x) and not Y since R is random and therefore truly unpredictable.

Overfitting typically leads to a low bias, but a very high variance, while underfitting will give you the opposite scenario. When you tune models you’re looking for a happy medium between bias and variance and there will be a point that minimizes the prediction error overall.

Bias-Variance Graph

Squashing the Variance with Bagging

The big insight with regards to ensembling is that if you have a large number of models, each with zero or near-zero bias, then you can reduce the variance by averaging their respective answers. Think of it this way: each separate model gives a prediction that is somewhere near the right answer. If the predictions are well-spread around the right answer, then if you average them the result should be a lot closer to the right answer than most of the individual models.

You may have noticed that there’s a glaring requirement in this scheme: each model in the ensemble has to give a different prediction. If each model gave the same prediction, then averaging the predictions would give you no benefit. It would come out to the same prediction as a single model. So if we’re using trees (or any repeated base learner for that matter) for our ensemble, how can we get them to give different predictions? Well, the answer could be: use different training sets. But we only have one training set right?

Enter bagging (bootstrap aggregration). To generate different training sets from a single training set we randomly sample the original training set. Some people sample with replacement, others don’t, but the main objective is to have the training sets be relatively large but as different as possible.

Then we train a completely overfit tree on each training set. Why do we want the tree to be overfit? Because then the bias is near zero. Our little averaging trick doesn’t work if the individual models have bias, because the bias will compound and undo the benefit we gain from reduced variance.

I trained a bagged tree model on the hotel data and the resultant classifier produced a RMSE of 0.787 stars, which is definitely an improvement over the 0.927 star RMSE of the single decision tree.

Random Forests

So, the bagging trick is pretty exciting. This technique will almost always produce better results than a single base learner could. But how can we do even better than bagging? Then answer lies in our technique for producing different base models. Yes, our individual decision trees are producing different predictions because they use slightly different training sets, but slightly is the operative word here.

We need to have even more difference between the individual trees to get better results. So, how can we introduce more randomness? Well, instead of using every feature as possible columns to be split on, we could also randomly sample features to use. But not only do we randomly sample new features for each tree we actually randomly sample a different set of features for each split of each tree. A model that uses this technique for creating more variation in the individual trees is called a Random Forest.

Notice that bagging could be applied to any base learning model, but Random Forests are specific to trees and this is why the name of the ensembling method alludes to trees: many trees make a forest.

In the case of our hotel data, the Random Forest improved on the result of the bagged trees. The Random Forest achieved a RMSE of 0.745 stars. As we train a Random Forest, we can estimate the test error using what’s called the out-of-bag (OOB) error. You do this by predicting the output of each row using only the trees that row wasn’t included in. The following graph shows how the OOB error decreased as more and more trees were added to the ensemble.

Random Forest

Notice that the OOB error of the first tree is horrible. This is due to the individual trees being completely overfit to minimize bias. But, as more trees are added, the OOB error due to variance decreases very quickly. By the time there are 50 trees most of the improvement has been realized. You typically see a slower dropoff for larger datasets.

Other Advancements in Tree Ensembling

So, I guess there really is wisdom in crowds. Random Forests are great models. They have very few tuning parameters (number of trees and number of features to sample at each split) and yet in a 2006 study of various machine learning algorithms, including SVMs, Neural Networks and other standard models, Random Forests were found to have the best average performance, second only to boosting ensembles.

Boosting is a class of ensemble algorithms where each new base learner tries to correct the mistakes of the current ensemble in some fashion. This contrasts with Random Forests, because the individual base learners are unaware of each other’s performance in a Random Forest. Adaboost is a popular boosting technique for classification. It assigns weights to the rows in your training set and these weights get larger over time if that particular row is being frequently misclassified by the existing group of trees. The new trees use a modified split metric that uses the weights to try take better care of the higher weight rows.

Another recent advancement in tree ensembling is Rotation Forests. These ensembles try to rotate the feature space with a rotation matrix for each individual tree in the ensemble. This helps because trees can only form decision boundaries that are perpendicular to feature axes. If your data happens to have optimal decision boundaries that are more “diagonal” the trees will probably approximate this boundary fairly poorly. There are some lofty claims that Rotation Forests can outperform Random Forests and Adaboost, but I guess only time will tell.