Modulos v1.1.1: Data Quality for Image Use Cases

Claudio Bruderer

Written by:
Claudio Bruderer (Head of Product at Modulos)

We are happy to announce the release of the Modulos Platform version 1.1.1! This latest installment of our platform is available to our customers as of now. It contains several exciting new features supporting you in training Machine Learning (ML) Solutions and improving the quality of your datasets with a Data-Centric Artificial Intelligence (DCAI) approach. First, we are extending our data cleaning capabilities to datasets containing images. Next, we are enhancing the data quality improvement recommendations with insight plots and further information. Lastly, besides many other improvements, we are extending some of our ML modules and you can now also apply them on datasets containing text features.

Data Quality Management for Image Use Cases

In our previous two releases of the Modulos Platform – v1.0.0 & v1.1.0 -, we have launched our Data-Centric AI platform. At the core of the DCAI philosophy is to shift the focus away from fine tuning ML models. Instead, it emphasizes the quality of the input datasets, which are the key ingredient for ML after all. Our platform achieves that with the Data Quality Management (DQM) feature. It gives users the necessary tools to identify and diagnose data samples with a negative impact on their ML Solutions’ performance (e.g., accuracy, but also various fairness metrics). It also enables the domain experts, who know their data best, to decide on which data quality improvement recommendations are appropriate for their use case in order to reach an improvement goal.

With this release, we are making DQM also available for imaging datasets. This unlocks a range of different use cases. For instance, you may want to prioritize which image labels need to be acquired for the largest positive impact on the performance. Or, you may want to identify wrongly labeled images and/or samples with faulty feature values. Example use cases could be to identify wrongly labeled images of faulty machine parts.

Top 24 samples in the first data cleaning iteration to be reviewed as they have affected the ML Solution’s performance the most negatively.

Let us take a closer look at this feature using an example use case based on the MNIST dataset. We have a set of handwritten digits and would like to train a ML Solution to correctly classify images depending on the recorded digit. For our example, we have removed the label values of 25% of the samples and mixed up the labels of further 25% of the samples. In other words, half of the training dataset contains either no or wrong label values.

We have then proceeded in the following way: We started by training a first ML Solution. Unsurprisingly, the initial performance (accuracy in this case) was not great. We then apply our DQM feature aiming to improve the accuracy of our Solution. It identifies the samples contributing to reaching this goal the most negatively and assesses their impact. The top 24 samples identified in the first cleaning iteration are shown above. Out of these 24 samples, it recommended fourteen samples to be labeled and they would be adding significant information. Eight of the images were correctly identified as wrongly labeled images and they were flagged as samples with a significant negative impact. Lastly, only two of the samples recommended for cleaning were actually correctly labeled; significantly fewer than one would expect when randomly cleaning faulty samples (~7; two thirds of the labeled samples are correct).

By then iteratively reviewing the top identified samples and retraining the Solution, the performance increases quickly. In our case, after only two cleaning iterations (40% of the training dataset) the accuracy of the Solution has increased by several percentage points and matches the performance on fully cleaned data.

Insights Into DQM Tasks

Example fairness plots showing confusion matrices for different classes of the protected attribute. These and other plots help you understand the sources of data flaws (e.g., issues with fairness) and effectively improve the data quality of your training data, which yields better ML Solutions.

With every release since Modulos v1.0.0 (released in May), we are adding more and more capabilities to our Data Quality Management feature. Not only is it applicable to ever more use cases (e.g., image use cases). It is also more and more effective in identifying data flaws and making recommendations on how to clean them. It is important to distill more insights on how your data quality may be lacking and to visualize the outputs. We are including plots of the distributions of the DQM modules’ output values. Those can help you diagnose the sources of your data quality issues more effectively and understand their impact.

To illustrate this, let us consider the use case of fairness in consumer lending. In our example presented in one of our recent webinars, the initially trained models before applying DQM yield well performing ML Solutions with a high accuracy. However, when evaluating the equalized odds fairness metric for the protected attribute “gender”, we find a bias regarding the gender. By looking at some of the newly added plots (e.g., confusion matrices for the different populations as shown above), we see how the ML Solution behaves depending on the gender. 

By then computing and applying our data quality improvement recommendations, we can address these issues! The platform identifies the samples with a negative impact on the equalized odds metric. By then cleaning and/or removing these from the dataset, it and thus also the Solution are gradually made fairer.

Support for Text Features

Let us imagine the following situation: Say that you want to train a ML model to more efficiently triage support tickets (e.g., by predicting the time to resolve the issue in advance). To do that you would create an extract from a ticketing system and use it as your input dataset. While this extract contains various numerical and categorical data (e.g., priority, affected services etc.), it also contains some text features like the title of tickets and the ticket description.

The information encoded in these text features is valuable, as it contains the information needed to troubleshoot the issues. Thus, it definitely should be included in the training of ML Solutions and when applying them on new data.

We have added additional feature extractors to handle text features now. This extends the range of use cases covered by the Modulos Platform. If ensures that the Solutions can also make use of this information.

Other Improvements and Fixes

  • Added support for python 3.10 for deploying the Modulos Platform or its Solutions and stopped supporting the soon outdated python 3.7.
  • Added more fairness objectives (Statistical Parity and Predictive Parity) as available DQM improvement objectives.
  • Added a new Keras neural network, which uses pretrained weights (EfficientNetV2 and MobileNetV1).
  • Refined the look and feel of the platform’s error and success messages.
  • Fixed an issue of sometimes not available previews of the datasets selected when configuring ML training workflows.