# A Tutorial of Model Monotonicity Constraint Using Xgboost

Fitting a model and having a high accuracy is great, but is usually not enough. Quite often, we also want a model to be simple and interpretable. An example of such an interpretable model is a linear regression, for which the fitted coefficient of a variable means holding other variables as fixed, how the response variable changes with respect to the predictor. For a linear regression, this relationship is also monotonic: the fitted coefficient is either positive or negative.

## Model Monotonicity: An Example

Model monotonicity is sometimes applied in the real-world. For example, if you apply for a credit card but got declined, the bank usually tells you reasons (that you mostly don’t agree with) why the decision is made. You may hear things like your previous credit card balances are too high, etc. In fact, this means that the bank’s approval algorithm has a monotonically increasing relationship between an applicant’s credit card balance and his / her risk. Your risk score is penalized because of a higher-than-average card balance.

If the underlying model is not monotonic, you may well find someone with a credit card balance \$100 higher than you but otherwise identical credit profiles getting approved. To some extent, forcing the model monotonicity reduces overfitting. For the case above, it may also improve fairness.

## Beyond Linear Models

It is possible, at least approximately, to force the model monotonicity constraint in a non-linear model as well. For a tree-based model, if for each split of a particular variable we require the right daughter node’s average value to be higher than the left daughter node (otherwise the split will not be made), then approximately this predictor’s relationship with the dependent variable is monotonically increasing; and vise versa.

This monotonicity constraint has been implemented in the R gbm model. Very recently, the author of Xgboost (one of my favorite machine learning tools!) also implemented this feature into Xgboost (Issues 1514). Below I made a very simple tutorial for this in Python. To follow this tutorial, you will need the development version of Xgboost from the author:

## Tutorial for Xgboost

I’m going to use the California Housing dataset [1] for this tutorial. This dataset consists of 20,460 observations. Each observation represents a neighborhood in California. The response variable is the median house value of a neighborhood. Predictors include median income, average house occupancy, and location etc. of that neighborhood.

First we load the data. If you use scikit-learn, you just need to do:

To start, we use a single feature “the median income” to predict the house value. We first split the data into training and testing datasets. Then We use a 5-fold cross-validation and early-stopping on the training dataset to determine the best number of trees. Last, we use the entire training set to train my model and evaluate its performance on the testset.

Notice the model parameter 'monotone_constraints'. This is where the monotonicity constraints are set in Xgboost. For now I set 'monotone_constraints': (0), which means a single feature without constraint.

Let’s take a look at the model performances:

We can also check the relationship between the feature (median income) and the dependent variable (median house value):

Here I wrote a helper function partial_dependency to calculate the variable dependency or partial dependency for an arbitrary model. The partial dependency [2] describes that when other variables fixed, how the average response depends on a predictor.

Without any monotonicity constraint, the relationship between the median income and median house value looks like this:

One can see that at very low income and income around 10 (times its unit), the relationship between median income and median house value is not strictly monotonic.

You may be able to find some explanations for this non-monotonic behavior (e.g. feature interactions). In some cases, it may even be a real effect which still holds true after more features are fitted. If you are very convinced about that, I suggest you not enforce any monotonic constraint on the variable, otherwise important relationships may be ignored. But when the non-monotonic behavior is purely because of noise, setting monotonic constraints can reduce overfitting.

For this example, it is hard to justify that neighborhoods with a low median income have a high median house value. Therefore we will try enforcing the monotonicity on the median income:

We then repeat the CV procedure, refit the model and evaluate it on the testset. Below is the result:

Looks like compared to before, the training error slightly increased while the testing error slightly decreased. We may have reduced overfitting and improved our performance on the testset. However, given that statistical uncertainties on these numbers are probably just as big as the differences, it is just a hypothesis. For this example, the bottom line is that adding monotonicity constraint does not significantly hurt the performance.

Now we can check the variable dependency again:

Great! Now the response is monotonically increasing with the predictor. This model has also become a bit easier to explain.

We can also enforce monotonicity constraints while fitting multiple features. For example:

We assume that median house value is positively correlated with median income and house age, but negatively correlated with average house occupancy.

Is it a good idea to enforce monotonicity constraints on features? It depends. For the example here, I didn’t see a significant performance decrease, and I think the directions of these variables make intuitive sense. For other cases, especially when the number of variables is large, it may be difficult and even dangerous to do so. It certainly relies on a lot of domain expertise and exploratory analysis to fit a model that is “as simple as possible, but no simpler”.

(The iPython Notebook of this tutorial can be found here)

## Bibliography

1. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning Ed. 2”, Springer, 2009, pp. 371-375

2. Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.

Written on November 22, 2016