In this blog post, Modulos Sales Manager Florian Marty describes his experience using AutoML to solve a current scientific challenge using his biology domain knowledge.
- Domain knowledge is critical to successfully implementing AutoML.
- Modulos AutoML is a low-code, easy-to-use, automated machine learning platform that allows you to generate predictive machine-learning models in minutes.
- AutoML, combined with expert knowledge of data science, can significantly increase your data science pipelines.
Can We Use AutoML To Predict COVID-19 Immunity?
It all started back in early November 2020, when I was speaking with a potential customer about Modulos AutoML.
After some back and forth discussion about a demo, the potential customer sent me an email with a link to a data set and additional information from the customer about relevant columns, including the requested task.
We want to know if a T-cell of a person is specific against Sars-CoV 2 and whether this person, therefore, is immune against CoViD-19. Based on the sequence of the TCRs, predict if a TCR would bind to an antigen of Sars-CoV 2. The target is named “SARS-CoV-2” in the column “Epitope species.” The columns MHC A, MHC B, MHC class, Epitope, Epitope gene may not be used as features for the algorithm (they contain information about the virus and would therefore generate information leakage).
The task is to predict if a person is immune to COVID-19 or not based on the information generated from a wide variety of assays and sequencing experiments and, ultimately, put together in one curated database.
If this is possible, such a model could be used, for example, to determine who should get vaccinated first in an initial phase of vaccination when there is not yet enough vaccine available.
Data Preparation Using Domain Expertise
To make a binary classification model, the data must first be prepared; as in any machine learning project. The key variable (CDR3) is a a single, concatenated string string and, as such, can not be used as input in the current version of Modulos AutoML.
At this stage, due to my expertise from years of biology as well as a Ph.D. in mass spectrometry, I was able to relatively quickly realize that there are many ways to transform a peptide (a string of amino acids) into numerical values. Further, specific biochemical properties are added to enrich the dataset.
After some time spent reading many available libraries and tools, I decided to use the following two modlAMP (Copyright (c) 2016 – 2019 ETH Zurich, Switzerland; Alex Müller, Gisela Gabernet, Gisbert Schneider.) and pyOpenMS.
First, I used pyOpenMS to generate a simple series of b- and y-ions for every CDR3 peptide string similar to how this would be done in the lab; using mass spectrometry for peptide identification.
Next, I used modlAMP to generate a set of peptide descriptors, including important biochemical properties.
Additional data manipulation resulted in a dataset with two classes for the target variable (Epitope species); where class 1 represented SARS-COVID-2 and class 0 represented all other species.
As the dataset was highly imbalanced (1000:1) I decided to upsample the minority class. But, before upsampling my minority class I split the dataset into train and test data using commonly available libraries with an 80:20 train:test split.
All of this took a couple of hours of work for data preparation resulting in three files (
test_org) ready to use in Modulos AutoML.
Generating Machine Learning Models In Minutes With AutoML
I logged into the Modulos AutoML demo platform that is powered by 8 CPUs without any GPU and uploaded the
Through the platform, I selected my target variable (Label) and the variables (CDR3) to exclude for training the machine-learning models. On the page with the models and feature engineering options, I went with the default settings provided by the platform.
The platform provides an unbiased set of models and feature engineering algorithms that are applicable to the dataset and the machine learning task; in this case, a classification problem.
Next, I needed to select the optimizer strategy. Generally speaking, the aim of hyperparameter optimization in machine learning is to find the hyperparameters of a given machine-learning algorithm that return the best performance as measured on a validation set. There are many optimizers used for hyperparameter optimization including manual search, grid search, random search, and Bayesian search. Inside the platform, Modulos allows you to choose between random search and Bayesian search.
At a high level, Bayesian optimization methods are efficient because they choose the next hyperparameters in an informed manner.
On the objective page, I used my acquired knowledge of the data and domain expertise and chose the F1 Score (binary) as the objective for workflow.
Inside the platform, on the objective selection page, we can see an explanation of the objective.
Modulos AutoML internally splits the data into a train/validation set and uses the validation set to improve the chosen objective. To prevent overfitting to the validation data, the platform has some built-in mechanisms that will soon be extended to allow expert users a higher level of control.
So, within two minutes of loading the data onto the platform, the training can start.
After training overnight, I downloaded the best machine learning solution and looked at the confusion matrix to evaluate the model.
To put the confusion matrix into perspective and judge the model performance, it is important to think about the consequences of being incorrectly classified.
Let’s go back to our assumption case that the model is used to determine if a person should get vaccinated or not.
True label 1 and predicted as 0 means we are already immune but will get vaccinated anyway. True label 0 and predicted 1 means we are not immune but predicted to be immune and as such will not be vaccinated. This is the error that will affect people directly. Good to see that our model is producing fewer (relative and absolute) misclassifications here.
Next, I trained on the upsampled dataset with the very same settings e.g., objective F1 Score (binary).
Again I downloaded the best solution and checked the confusion matrix provided by the platform on the validation data split.
So, here we did not classify any of the class 1 wrong but have twice as many class 0 predicted to be class 1. Of course, due to the higher absolute number of samples because of the upsampling, the F1 score (binary) overall is much better.
In my opinion, this is far worse than the best model generated by the original sample.
To further check this and test both models performance on unseen data, I used the two solutions to generate predictions on the test split.
|XGBoost from original||XGBoost from upsampled|
|F1 Score (binary)||0.6666||0.6909|
Again the model from the original data performs better in terms of precision, but has a lower F1 Score (binary).
Now, as stated above, with some more time and careful selection of descriptors and other variables there might be some room for improvement on the final model. that could eventually be used to screen people for being immune to SARS-COVID-2 and thus, not needing the vaccine up front.
With Modulos AutoML I was able to produce a series of machine learning models without any prior knowledge of generating machine learning models myself in under half a day. All that is needed is a good understanding of the data (domain expertise) and some publicly available libraries to prepare the data. Eventually, such a model could be used to screen people for being immune to SARS-COVID-2 and thus, not needing the vaccine to help the authorities judge who gets vaccinated first.