Permutation Feature Importance: Deep Dive

permutation-feature-importance
Dennis Turp

Written by:
Dennis Turp (Data Scientist at Modulos)

When we work with Machine Learning models, we often report the model’s score; e.g. “my model reached an accuracy of 0.9” or “my R2 score is 0.85”. These performance estimators are easy to understand and practical when benchmarking models against each other. Unfortunately however, they reduce the complexity of the model to a single number. When a company then uses these models to build real applications, new questions arise, which cannot be answered with these single numbers. For example: “Which of my input features is the model relying on to make predictions?”, “Are those predictions trustworthy even for unseen data instances?” or “My model is performing exceptionally well/poorly. Did we make mistakes when preparing the input data?”.

These are all valid questions that we should answer before using a model in a real-life setting. This article will show how permutation feature importance can be used to address some of these issues. 

What is permutation feature importance, and how do we calculate it?

For each feature, permutation feature importance measures the effect that shuffling of its values has on the model’s prediction error. If the shuffling of a feature increases the model error, a feature is deemed important by this measure. This explanation makes intuitive sense; if a model relies heavily on the permuted feature, we would expect a significant change in the prediction. In contrast, permuting a feature that does not have an effect on the error should not change the model prediction. Figure 1 shows a visual explanation of how permutation feature importance can be computed:

Figure 1: Shows a visual explanation of how to calculate the feature importance value for one input feature. The upper row shows the table with the original data and predictions made by the model. The bottom row shows the same data but with permuted values for Feature_2 and the corresponding predictions made by the model on the permuted data. If the model heavily relies on Feature_2 for its predictions, the feature importance value will be large. On the other hand, if the model does not rely on Feature_2, permuting it will not impact the predictions and the feature importance value will be small.

This pseudo-code illustrates the computation:

  • Input: Trained model $M$, Feature Matrix $X$, labels $y$, error function $E(y, M)$
  • Calculate the original model error $E_{orig} = E(y, M(X)) $
  • For each feature $j$ in $(1, …, P)$ do:
    • For each repetition $r$ in $(1,…,R)$ do:
      • Randomly shuffle column $j$ of the feature matrix $X$ to create a permuted data set $X^{jr}_{perm}$.
      • Estimate error $E^{jr}_{perm} = E(y,M(X^{jr}_{perm}))$ based on the predictions of the permuted data.
    • Compute the feature importance value $FI_{j}=\frac{1}{R}\sum_r(|E_{orig} -E_{perm}^{jr}|)$
  • Sort all features by descending $FI_j$


Now that we have illustrated how feature importance is calculated, let’s look at how it can help us understand our Machine Learning models.

Build trust in the computation and analysis

In a first analysis, let us have a look at how feature importance can be used to build trust in the predictions of our Machine Learning models. For that, we will use the “Diabetes” dataset. Kaggle describes this dataset in the following way: “This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.” [1]

We use the Modulos AutoML platform to search for the best model and hyperparameter combination for the diabetes dataset. We pick the model with the highest score. In this case, the model yields an accuracy of 0.779. Whether this level of accuracy is sufficient for the task in question is up to medical professionals to decide. However, to build trust into our system, we should be able to explain which features our model relies on to make predictions. After calculating the feature importance for the diabetes dataset, we get the following result.

One can see that the most important feature for predicting if a patient has diabetes is the glucose level. This result makes intuitive sense and helps to build confidence in the system. If, for example, the model would heavily rely on the SkinThickness feature and ignore the Glucose levels altogether, a medical professional would likely deem the model unreliable even though the accuracy might seem sufficient.

Debug and audit input data

For the following example, we use the bike-sharing dataset from the UCI Machine Learning Repository [2]. Using this dataset, one can forecast the demand for rental bikes based on temperature, weekday features, etc. We pick the model that reaches an R2 Score of 0.98, which is almost perfect. Looking at the feature importance graphic, we can see that the only essential features for the model’s decision are the number of bikes rented by registered users and casual bike rentals.

Taking a closer look at those features, we realize that the quantity that we want to predict, the total number of bike rentals, corresponds to the sum of the registered and causal rentals. Since both features are present during training, creating a model with an almost perfect score was easy. In a real-world scenario however, the registered and casual bike rental numbers are unknown to the rental service in advance. Since those two numbers are not available during inference, we made a mistake in our data preparation.

Thus, the feature importance graphic revealed that we made a mistake in our data processing.

Caveats

So far, we have seen that feature importance can be a helpful tool to analyze and understand how Machine Learning models generate predictions. But, there are certain pitfalls and conclusions one should avoid when looking at feature importance plots:

1. Permutation feature importance calculations are always model-specific. For different models, different features can be important.

Please select a model and observe that the feature importance changes. The most important feature for all models is highlighted.

Select a model
2. Not doing enough permutations in the computation of the feature importance can lead to false/inaccurate results.

Please drag the slider to see that the most important feature changes and only stabilize with higher-order permutations.

Number of permutations
3. Strong correlations between features can reduce the overall importance of the correlated features. The Machine Learning model learns to rely on the information present in both features instead of only depending on one. 

Please drag the slider to observe that adding features, which are strongly correlated with feature_0, decreases the importance of feature_0.

Additional features strongly correlated with feature_0:

Permutation Feature importance with Modulos

In the Modulos AutoML release 0.4.1, we introduced permutation feature importance for a limited set of datasets and ML workflows. For these workflows, the Modulos AutoML platform computes the permutation feature importance for all solutions. The static plots and feature importance data shown in this blog post were automatically created using the Modulos AutoML software. If you are interested in knowing more or trying out the platform, don’t hesitate to contact us.

If you found this explanation insightful, feel free to share it!

References

[1] https://www.kaggle.com/uciml/pima-indians-diabetes-database (external link)
[2] https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset (external link)
[3] https://christophm.github.io/interpretable-ml-book/feature-importance.html (external link)
[4] https://scikit-learn.org/stable/modules/permutation_importance.html (external link)