AutoML v.0.4.5: Docker Solutions & Advanced Classification

Written by:
Anna Weigel (CTO at Modulos)

Modulos AutoML version 0.4.5 is available! With this version, we are making it even easier to deploy and utilize our Machine Learning (ML) Solutions. For the first time, this release also allows you to predict probabilities for classification tasks. Among many other enhancements and refinements, we have also added new ML models and objectives.


Solutions: New Dockerfiles and REST APIs

The Modulos AutoML platform finds the best feature extractor, model, and hyperparameter combinations for your use cases. These stand-alone Solutions allow you to generate predictions and are available for download. As part of this release, we have added new functionalities that make it easier to deploy and to use Solutions.

To generate predictions, you can now choose between even more options!

You can rely on the purely python based solution clients that we have been shipping for a while: the online client, for single sample predictions, the batch client, for multi sample predictions, or the forecast client, for time series workflows. The jupyter notebook shows you how to use these clients in practice. All requirements are easy to install with a single script.

Example python Solution client. The new Solution server makes it very easy to generate predictions, as this example shows. Simply define your feature values and send a request via the REST API. And you do not have to use python! Requests can be sent via a language of your choice.

With the new release, most Solutions now also include everything needed for a client-server setup. The interface to generate predictions, i.e. the REST API, is standardized. This means that you do not have to change the format of your requests if you swap out Solutions. The Solution server can be run in pure python mode or it can be launched within a Docker container. The Docker container neatly bundles all Solution dependencies into a single package, making it easier to deploy your prediction service. Additionally, you are no longer restricted to python and can generate predictions via a programming language of your choice!

You can find more information in the on-platform documentation, as well as the Solution readme, which have both been reworked completely to introduce these new deployment options. 


Classification with a probabilistic outcome

Among many other use cases, classification workflows allow you to predict if a customer is likely to churn in the near future, what kind of risk group a patient falls into, or if there is a manufacturing defect in a specific product. But how certain are these statements?

In this release, we enable “classification with a probabilistic outcome”. This feature allows you to gauge how definitive the predictions of a classification task are.

Like for a traditional classification workflow, you upload a dataset with categorical labels (e.g. customer will churn or will not churn) to the Modulos AutoML platform. But now you can produce Solutions that compute the probability per available category (e.g. probability of customer churning: 80%, not churning: 20%), rather than the category directly! Of course, you can still infer the original label from such Solutions: e.g. simply pick the category with the highest probability.

Examples from the MNIST dataset. You are now able to predict the probability of a specific image containing a certain digit. This allows you to gauge how certain these predictions are. For clearly written digits the probability for a specific category is high (e.g. “1”). For digits that are difficult to identify, the probability is spread across different categories (e.g. “4”, “8”). A traditional classification workflow returns the category with the highest probability (highlighted in green/red).
Example case: MNIST

The MNIST database includes a large number of images containing handwritten digits and corresponding labels. We can use the Modulos AutoML platform to create a ML Solution that directly predicts which digit each image contains. Or we can build a Solution that returns the probability of a specific image showing a “0”, “1”, “2”, etc. 

ROC curve for binary classification. The ROC-AUC objective aims to maximize the area under the curve. For a perfect classifier this area would be equal to 1. For binary classification with probabilistic outcome this figure is available in the solution readme.

To use this feature, make sure to choose the new ROC-AUC objective in the workflow creation. The goal of the ROC-AUC objective is to maximize the area under curve (AUC) for the receiver operating characteristic (ROC) curve. 

Returning to the churn example from above: Let’s assume our model predicts that a customer will churn with a 55% probability. The big advantage of probabilities is that you can decide whether this customer falls into the “will churn” category because the probability is higher than 50%. Or, if the client is part of the “will not churn” group, if you want to be more conservative. 

The ROC curve shows how the number of true and false positives changes as a function of this probability threshold. A perfect classifier model is very good at distinguishing positive and negative outcomes. For such a model we expect the true positive rate to be high, even for the highest cut offs. It should also not change significantly as we vary the threshold. For a perfect classifier the area under the ROC is hence close to 1.

At the moment, classification with a probabilistic outcome is available for labels that are either boolean or contain more than three categories. For more information please see the on-platform documentation.


Module updates & new python version 

ML modules available in version 0.4.5 of the Modulos AutoML platform.

In addition to the binary and macro versions of the ROC-AUC objective, we have also added two new models. Version 0.4.5 of the Modulos AutoML platform contains k-Nearest Neighbor (kNN) models for classification and regression. For classification, a sample’s category is based on the class that is most common amongst its nearest neighbors. For regression, predictions are determined by averaging the values of a sample’s nearest neighbors. This makes kNN a simple, but intuitive and easy to interpret ML algorithm. kNNs are also fast since they do not have to be trained and are suited for small datasets. 

For the random forest and ridge regression models, we have refined the range of possible hyperparameter configurations. For example, this makes the random forest more robust for unbalanced datasets. It also increases the performance of ridge regression on a wider variety of datasets.

Since python 3.6 is reaching its end of life soon, we stopped its support. Instead, we have added support for python 3.8 and 3.9. This extends the range of supported versions to python 3.7, 3.8, and 3.9. Consequently, you can now also use these more recent python versions to generate predictions.

Finally, we have updated many of the used third party libraries within our code base. This increases the overall robustness and security of the platform. It also ensures that the generated ML Solutions are up to date.


OTHER IMPROVEMENTS AND BUG FIXES

  • Improved the installation and setup procedure. The update now allows users to skip Modulos AutoML platform versions easily. We also greatly increased the quality of life of the automl tool by refining the communication, adding clearer user prompts, and increasing the flexibility for the storage locations of the various backups.
  • Increased the speed of the identity feature extractor by changing its batch size
  • Included all categories in the confusion matrix figure, even rare ones, which are not part of the validation dataset.
  • Streamlined the files shipped in the Solution, thereby decreasing its size.
  • Added the workflow type and label categories to the solution readme. 
  • Fixed inconsistency on the license page.