Categories

Follow Us

Share Post

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp

Predicting COVID-19 Immunity using AutoML

In this blog post, Modulos Sales Manager Florian Marty describes his experience using AutoML to solve a current scientific challenge using his biology domain knowledge.

Take-home messages

  • Domain knowledge is critical to successfully implementing AutoML.
  • Modulos AutoML is a low-code, easy-to-use, automated machine learning platform that allows you to generate predictive machine-learning models in minutes.
  • AutoML, combined with expert knowledge of data science, can significantly increase your data science pipelines.

Can We Use AutoML To Predict COVID-19 Immunity?

It all started back in early November 2020, when I was speaking with a potential customer about Modulos AutoML.

After some back and forth discussion about a demo, the potential customer sent me an email with a link to a data set and additional information from the customer about relevant columns, including the requested task.

We want to know if a T-cell of a person is specific against Sars-CoV 2 and whether this person, therefore, is immune against CoViD-19. Based on the sequence of the TCRs, predict if a TCR would bind to an antigen of Sars-CoV 2. The target is named “SARS-CoV-2” in the column “Epitope species.” The columns MHC A, MHC B, MHC class, Epitope, Epitope gene may not be used as features for the algorithm (they contain information about the virus and would therefore generate information leakage).

The data used in this experiment. The CDR3 column is a string which had to be decomposed using biology expertise.

The task is to predict if a person is immune to COVID-19 or not based on the information generated from a wide variety of assays and sequencing experiments and, ultimately, put together in one curated database.

If this is possible, such a model could be used, for example, to determine who should get vaccinated first in an initial phase of vaccination when there is not yet enough vaccine available. 

Data Preparation Using Domain Expertise

To make a binary classification model, the data must first be prepared; as in any machine learning project. The key variable (CDR3) is a a single, concatenated string string and, as such, can not be used as input in the current version of Modulos AutoML.

At this stage, due to my expertise from years of biology as well as a Ph.D. in mass spectrometry, I was able to relatively quickly realize that there are many ways to transform a peptide (a string of amino acids) into numerical values. Further, specific biochemical properties are added to enrich the dataset.

After some time spent reading many available libraries and tools, I decided to use the following two modlAMP (Copyright (c) 2016 – 2019 ETH Zurich, Switzerland; Alex Müller, Gisela Gabernet, Gisbert Schneider.) and pyOpenMS.

First, I used pyOpenMS to generate a simple series of b- and y-ions for every CDR3 peptide string similar to how this would be done in the lab; using mass spectrometry for peptide identification.

Next, I used modlAMP to generate a set of peptide descriptors, including important biochemical properties.

Additional data manipulation resulted in a dataset with two classes for the target variable (Epitope species); where class 1 represented SARS-COVID-2 and class 0 represented all other species.

As the dataset was highly imbalanced (1000:1) I decided to upsample the minority class. But, before upsampling my minority class I split the dataset into train and test data using commonly available libraries with an 80:20 train:test split.

All of this took a couple of hours of work for data preparation resulting in three files (train_org, train_upsampled, and test_org) ready to use in Modulos AutoML.

Generating Machine Learning Models In Minutes With AutoML

I logged into the Modulos AutoML demo platform that is powered by 8 CPUs without any GPU and uploaded the train_org and train_upsampled datasets. 

Through the platform, I selected my target variable (Label) and the variables (CDR3) to exclude for training the machine-learning models. On the page with the models and feature engineering options, I went with the default settings provided by the platform.

Summary of the set-up of a workflow in AutoML.

The platform provides an unbiased set of models and feature engineering algorithms that are applicable to the dataset and the machine learning task; in this case, a classification problem.

Next, I needed to select the optimizer strategy. Generally speaking, the aim of hyperparameter optimization in machine learning is to find the hyperparameters of a given machine-learning algorithm that return the best performance as measured on a validation set. There are many optimizers used for hyperparameter optimization including manual search, grid search, random search, and Bayesian search. Inside the platform, Modulos allows you to choose between random search and Bayesian search.

At a high level, Bayesian optimization methods are efficient because they choose the next hyperparameters in an informed manner. 

On the objective page, I used my acquired knowledge of the data and domain expertise and chose the F1 Score (binary) as the objective for workflow. 

Inside the platform, on the objective selection page, we can see an explanation of the objective.

Modulos AutoML internally splits the data into a train/validation set and uses the validation set to improve the chosen objective. To prevent overfitting to the validation data, the platform has some built-in mechanisms that will soon be extended to allow expert users a higher level of control. 

So, within two minutes of loading the data onto the platform, the training can start.

After training overnight, I downloaded the best machine learning solution and looked at the confusion matrix to evaluate the model.

To put the confusion matrix into perspective and judge the model performance, it is important to think about the consequences of being incorrectly classified.

Let’s go back to our assumption case that the model is used to determine if a person should get vaccinated or not.

True label 1 and predicted as 0 means we are already immune but will get vaccinated anyway. True label 0 and predicted 1 means we are not immune but predicted to be immune and as such will not be vaccinated. This is the error that will affect people directly. Good to see that our model is producing fewer (relative and absolute) misclassifications here.

Next, I trained on the upsampled dataset with the very same settings e.g., objective F1 Score (binary).

Again I downloaded the best solution and checked the confusion matrix provided by the platform on the validation data split.

So, here we did not classify any of the class 1 wrong but have twice as many class 0 predicted to be class 1. Of course, due to the higher absolute number of samples because of the upsampling, the F1 score (binary) overall is much better.

In my opinion, this is far worse than the best model generated by the original sample.

To further check this and test both models performance on unseen data, I used the two solutions to generate predictions on the test split.

XGBoost from originalXGBoost from upsampled 
Confusion matrix
F1 Score (binary)0.66660.6909
Precision 0.82920.7755
Comparison of the models on unseen test data.

Again the model from the original data performs better in terms of precision, but has a lower F1 Score (binary).

Now, as stated above, with some more time and careful selection of descriptors and other variables there might be some room for improvement on the final model. that could eventually be used to screen people for being immune to SARS-COVID-2 and thus, not needing the vaccine up front.

Summary

With Modulos AutoML I was able to produce a series of machine learning models without any prior knowledge of generating machine learning models myself in under half a day. All that is needed is a good understanding of the data (domain expertise) and some publicly available libraries to prepare the data. Eventually, such a model could be used to screen people for being immune to SARS-COVID-2 and thus, not needing the vaccine to help the authorities judge who gets vaccinated first.

Share Post

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp

Find Out More

Read More:

Evangelia Mitsopoulou

Senior Frontend Engineer

Work? What is this? I only know the verb create.

She is g(r)eek frontend advocate. Evangelia holds a M.Sc on ICT (2008) from Aristotle University of Thesslaoniki and a B.Sc on Applied Computer Science (2006) from Univesity of Macedonia in Thessaloniki, Greece. She has worked as a semantic web researcher on EC-funded projects while living in London. The last 8 years she loves mastering the frontend world.

Kevin Schawinski

CEO / Co-Founder

Running a startup is super relaxing, right?

While a Ph.D student, he co-founded the Galaxy Zoo citizen science project involving more than a million members of the public in scientific research because machines weren’t quite good enough yet to go map the cosmos and classify galaxies. He stayed in Oxford as the Henry Skynner Junior Research fellow at Balliol College before moving to Yale as a NASA Einstein Fellow. In 2012, he started the galaxy and black hole research group at ETH Zurich as an assistant professor and began a close collaboration with Ce Zhang from computer science to work on the space.ml project. He is now the CEO of Modulos.

Ce Zhang

Co-Founder

Random is best.

He believes that by making data—along with the processing of data—easily accessible to non-computer scientists, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), “Best of VLDB” (2015), and the Nature magazine (2015).

Alexandra Arvaniti

Operations Manager

“You miss 100% of the shots you don’t take.” – Wayne Gretzky

During the last twenty years, she worked in different roles, setting up and running PMOs, supporting the Executive Management Team or as Operations Manager for the DACH region. She loves all organizational challenges, which she can use well at Modulos, like set up and establish administrative business processes.

Rudolf Bär

Chairman of the Advisory Board

After initially working for Dow Corning International in Zurich and Brussels (1964 to 1969), he held various management functions in the Private Banking Group Julius Baer, Zurich, lastly as CEO from 1993 to 2000 and retired from its Board of Directors in 2005. Since 2014 he has been studying at the Institute for Particle Physics and Astrophysics at the ETH, Zurich.

Marianne Chiesi

Administration

Marianne has worked in administration of various companies and the ETH.

Marianne Chiesi worked in the administration of various companies before taking time off to raise her children. She translated text books and literary works into Braille and joined the ETH Zurich as an administrative assistant. At ETH, she worked with professorships and researchers in many areas, including astrophysicists, particle physicists and biochemists. She now runs the administration at Modulos.

Bojan Karlaš

Software Engineer

Real engineers must be a little bit lazy.

After getting a bachelor’s degree in software engineering at the University of Belgrade, Serbia, Bojan spent 2 years working as a developer at Microsoft building distributed database solutions. He then went to Switzerland to pursue a computer science master’s degree at EPFL. He did his master thesis with Ce Zhang at ETH Zürich on the topic of time series forecasting, after which he joined Ce’s group as a PhD student. His industry experience also includes internships at Microsoft, Oracle and Logitech. His research interests revolve around systems and abstractions for making machine learning accessible to non-experts.

Nikolay Komarevskiy

Software Engineer

Software engineer in his prime

Passionate about nanophotonics and scientific research, he pursued his PhD degree in the Computational Optics group under the supervision of Prof. Christian Hafner at ETH Zurich. In addition to electromagnetics, Nikolay gained profound expertise in optimizations and in evolutionary optimizations in particular. Substantial part of his PhD work was conducted in collaboration with NASA Ames and was dedicated to the design and optimization of photonic reflectors. After a year of Postdoc, Nikolay moved to industry, where he joined an R&D team to employ his experience in electromagnetic/multiphysics simulations and stochastic optimizations. Fascinated by the recent advances in building smart software, Nikolay switched his gears to software engineering and eagerly faces new challenges.

Romain Lencou

Head of Engineering

Deleted code is debugged code. (Jeff Sickel)

Romain Lencou graduated from the Grenoble Institut National Polytechnique with M.Sc in Computer Science in 2008. Growing up in France in the 90’s, he developed an enthusiasm for pop culture, technology and food. Always eager for technological challenges, Romain worked for companies like VMware, Intel and Logitech, covering various topics including cryptography, virtualization and computer vision. Bitten by the machine learning bug, he is looking forward to apply his problem solving skills in Modulos.

Dominic Stark

Data Scientist

Code quality correlates with food quality.

Dominic Stark studied physics at ETH Zürich. The transition of his career path to Data Science began when he was analyzing UV images of galaxies. Together with Kevin Schawinski an Ce Zhang, he worked on applying the latest advances of deep learning research to his problem. It turned out that the method itself was at least as interesting as the problem they designed it for. After publishing the results, his research project was about using Reinforcement Learning to develop novel ideas for data acquisition in astronomy. As a Data Scientist at Modulos, he keeps on solving problems, that require new ideas and technologies.

Modulos Newsletter

Sign up for our newsletter to receive updates on our products and company.

Michael Röthlisberger

Data Scientist

Data handling with structure

He started to take an interest in Data Science and Software Development during his master’s degree. For his master thesis he worked on the image reconstruction software for a new PET detector. Michael gained some first experience in an internship for Sensirion AG. There he was part of the R&D team, which was developing a new gas sensor. The participation of a machine learning hackathon was sparking the interest of Michael in ML and he decided to pursue a career in this field. He is now exited to face new challenges with modulos and experience working in a rising start-up.

Dennis Turp

Data Scientist

Dennis Turp is the first employee of Modulos.

Prior to his work at Modulos he studied physics at ETH Zurich. During his Master studies he worked together with Kevin Schawinski and Ce Zhang on exploring machine learning related topics in astronomy. In these one and a half years they published three scientific papers together. Dennis Turp is currently employed as a Data Scientist. His main expertise lies in the fields of generative modeling and anomaly detection.

Andrei Văduva

Software Engineer

The trendsetter geek

He focused his attention on designing Architectures of Computer Systems. During university, he gained an excellent understanding of performance optimization and scalability on architectures such as distributed systems. Having a good experience in various Computer Science fields like big data analytics and Artificial Intelligence, he did his bachelor’s thesis designing a Machine Learning algorithm for social media platforms. After graduation, he joined the investment banking industry, in London, where he gained good experience in designing and building high-quality software. Andrei moved to Switzerland to explore new perspectives and found a great challenge in the startup world. Using his passion for technology and professional experience, he brings the best practices in software engineering to Modulos.

Anna Weigel

Chief Technology Officer

After acquiring Bachelor and Master degrees in Physics, Anna completed her PhD in Astrophysics in Kevin Schawinski’s group at ETH. Her work on the relationship between supermassive black holes and their host galaxies is summarized in five first-author papers. After exploring the depths of our Universe, Anna joined Modulos as the Head of Data Science. She has since been appointed the role of CTO and is now leading the overall technology development.

Claudio Bruderer

Product Manager

Give me coffee to function.

After obtaining a BSc and a MSc degree in physics at ETH Zurich, Claudio decided to continue his studies of the Universe as a PhD student in Prof. Refregier’s Cosmology research group. He studied the gravitational lensing effect, whereby he measured the shapes of several billions of galaxy images (mostly synthetic ones). After acquiring his PhD, Claudio then joined the consulting company AWK Group AG and worked as a project manager and associate for IT and communications projects in the logistics and mobility sectors and for the federal government. Determined to create cutting-edge IT solutions, he decided to join Modulos as a product manager.

Thank you for submitting this form.

Christoph Golombek

Sales Manager

Happy customers, happy Christoph – or is it the other way around?

After finishing his master studies in Energy Technology at RWTH in Germany, Christoph started his professional career as an expert and Sales Support Engineer for wind turbines in cold climates in Canada. There he started seeing the benefits of machine help in tackling data-driven challenges. Having explored the great North, his passion for cutting edge technology drove him into the machine vision sector in Switzerland, where he has worked as a fusion of Sales Engineer and Tech Support, while also acting as a Team Leader of a team of four. At Modulos, he can now focus again on bringing state-of-the-art technology to happy customers.

Florian Marty

Sales Manager

Putting Science into the Art of Sales.

As a Ph.D. in Molecular Biology from the University of Zurich, Florian Marty was, like most scientists, not a big fan of sales initially. But, over the years and with growing experience in different commercial roles, he learned that there is a lot of science in what makes good salespeople. Coupled with his open mindset to learn new things and a communicative personality, Florian is fascinated to explore and test new strategies, tactics, and expert moves in sales. As a Sales Manager, he will be joining the commercial team helping to grow the customer base and make Machine Learning accessible to everyone. Fun fact, as Florian has never written a single line of code in his life.

We believe he is the perfect fit to bring across the Modulos value proposition to our customers. Do not hesitate to reach out to Florian to engage in a discussion about Modulos AutoML.