Sep 21, 2022

Modulos v1.1.1: Data Quality for Image Use Cases

Written by:
Claudio Bruderer (Head of Product at Modulos)

We are happy to announce the release of the Modulos Platform version 1.1.1! This latest installment of our platform is available to our customers as of now. It contains several exciting new features supporting you in training Machine Learning (ML) Solutions and improving the quality of your datasets with a Data-Centric Artificial Intelligence (DCAI) approach. First, we are extending our data cleaning capabilities to datasets containing images. Next, we are enhancing the data quality improvement recommendations with insight plots and further information. Lastly, besides many other improvements, we are extending some of our ML modules and you can now also apply them on datasets containing text features.

Data Quality Management for Image Use Cases

In our previous two releases of the Modulos Platform – v1.0.0 & v1.1.0 -, we have launched our Data-Centric AI platform. At the core of the DCAI philosophy is to shift the focus away from fine tuning ML models. Instead, it emphasizes the quality of the input datasets, which are the key ingredient for ML after all. Our platform achieves that with the Data Quality Management (DQM) feature. It gives users the necessary tools to identify and diagnose data samples with a negative impact on their ML Solutions’ performance (e.g., accuracy, but also various fairness metrics). It also enables the domain experts, who know their data best, to decide on which data quality improvement recommendations are appropriate for their use case in order to reach an improvement goal.

With this release, we are making DQM also available for imaging datasets. This unlocks a range of different use cases. For instance, you may want to prioritize which image labels need to be acquired for the largest positive impact on the performance. Or, you may want to identify wrongly labeled images and/or samples with faulty feature values. Example use cases could be to identify wrongly labeled images of faulty machine parts.

*Top 24 samples in the first data cleaning iteration to be reviewed as they have affected the ML Solution’s performance the most negatively.*

Let us take a closer look at this feature using an example use case based on the MNIST dataset. We have a set of handwritten digits and would like to train a ML Solution to correctly classify images depending on the recorded digit. For our example, we have removed the label values of 25% of the samples and mixed up the labels of further 25% of the samples. In other words, half of the training dataset contains either no or wrong label values.

We have then proceeded in the following way: We started by training a first ML Solution. Unsurprisingly, the initial performance (accuracy in this case) was not great. We then apply our DQM feature aiming to improve the accuracy of our Solution. It identifies the samples contributing to reaching this goal the most negatively and assesses their impact. The top 24 samples identified in the first cleaning iteration are shown above. Out of these 24 samples, it recommended fourteen samples to be labeled and they would be adding significant information. Eight of the images were correctly identified as wrongly labeled images and they were flagged as samples with a significant negative impact. Lastly, only two of the samples recommended for cleaning were actually correctly labeled; significantly fewer than one would expect when randomly cleaning faulty samples (~7; two thirds of the labeled samples are correct).

By then iteratively reviewing the top identified samples and retraining the Solution, the performance increases quickly. In our case, after only two cleaning iterations (40% of the training dataset) the accuracy of the Solution has increased by several percentage points and matches the performance on fully cleaned data.

Insights Into DQM Tasks

Example fairness plots showing confusion matrices for different classes of the protected attribute. These and other plots help you understand the sources of data flaws (e.g., issues with fairness) and effectively improve the data quality of your training data, which yields better ML Solutions.

With every release since Modulos v1.0.0 (released in May), we are adding more and more capabilities to our Data Quality Management feature. Not only is it applicable to ever more use cases (e.g., image use cases). It is also more and more effective in identifying data flaws and making recommendations on how to clean them. It is important to distill more insights on how your data quality may be lacking and to visualize the outputs. We are including plots of the distributions of the DQM modules’ output values. Those can help you diagnose the sources of your data quality issues more effectively and understand their impact.

To illustrate this, let us consider the use case of fairness in consumer lending. In our example presented in one of our recent webinars, the initially trained models before applying DQM yield well performing ML Solutions with a high accuracy. However, when evaluating the equalized odds fairness metric for the protected attribute “gender”, we find a bias regarding the gender. By looking at some of the newly added plots (e.g., confusion matrices for the different populations as shown above), we see how the ML Solution behaves depending on the gender.

By then computing and applying our data quality improvement recommendations, we can address these issues! The platform identifies the samples with a negative impact on the equalized odds metric. By then cleaning and/or removing these from the dataset, it and thus also the Solution are gradually made fairer.

Support for Text Features

Let us imagine the following situation: Say that you want to train a ML model to more efficiently triage support tickets (e.g., by predicting the time to resolve the issue in advance). To do that you would create an extract from a ticketing system and use it as your input dataset. While this extract contains various numerical and categorical data (e.g., priority, affected services etc.), it also contains some text features like the title of tickets and the ticket description.

The information encoded in these text features is valuable, as it contains the information needed to troubleshoot the issues. Thus, it definitely should be included in the training of ML Solutions and when applying them on new data.

We have added additional feature extractors to handle text features now. This extends the range of use cases covered by the Modulos Platform. If ensures that the Solutions can also make use of this information.

Other Improvements and Fixes

Added support for python 3.10 for deploying the Modulos Platform or its Solutions and stopped supporting the soon outdated python 3.7.
Added more fairness objectives (Statistical Parity and Predictive Parity) as available DQM improvement objectives.
Added a new Keras neural network, which uses pretrained weights (EfficientNetV2 and MobileNetV1).
Refined the look and feel of the platform’s error and success messages.
Fixed an issue of sometimes not available previews of the datasets selected when configuring ML training workflows.

Name	Borlabs Cookie
Provider	Owner of this website
Purpose	Saves the settings of the visitors selected in the Borlabs Cookie cookie box.
Cookie Name	borlabs-cookie
Cookie Expiry	1 Year

Name	HubSpot
Provider	HubSpot Inc., 25 First Street, 2nd Floor, Cambridge, MA 02141, USA
Purpose	HubSpot is a user database management service provided by HubSpot, Inc. We use HubSpot on this website for linking it to our newsletter service, the one pager download, and our online marketing activities. It is necessary to accept it in order for all website features to be available.
Privacy Policy	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Expiry	Session / 30 Minutes / 1 Day / 1 Year / 13 Months

Name	Google Tag Manager
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Cookie by Google used to control advanced script and event handling.
Privacy Policy	https://policies.google.com/privacy?hl=en
Cookie Name	_ga,_gat,_gid
Cookie Expiry	2 Years

Accept	Google Analytics
Name	Google Analytics
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Cookie by Google used for website analytics. Generates statistical data on how the visitor uses the website.
Privacy Policy	https://policies.google.com/privacy
Cookie Name	_ga,_gat,_gid
Cookie Expiry	2 Years

Accept	Hotjar
Name	Hotjar
Provider	Hotjar Ltd., Dragonara Business Centre, 5th Floor, Dragonara Road, Paceville St Julian's STJ 3141 Malta
Purpose	Hotjar is an user behavior analytic tool by Hotjar Ltd.. We use Hotjar to understand how users interact with our website.
Privacy Policy	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Expiry	Session / 1 Year

Accept	Facebook
Name	Facebook
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Used to unblock Facebook content.
Privacy Policy	https://www.facebook.com/privacy/explanation
Host(s)	.facebook.com

Accept	Instagram
Name	Instagram
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Used to unblock Instagram content.
Privacy Policy	https://www.instagram.com/legal/privacy/
Host(s)	.instagram.com
Cookie Name	pigeon_state
Cookie Expiry	Sitzung

Accept	OpenStreetMap
Name	OpenStreetMap
Provider	Openstreetmap Foundation, St John’s Innovation Centre, Cowley Road, Cambridge CB4 0WS, United Kingdom
Purpose	Used to unblock OpenStreetMap content.
Privacy Policy	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Expiry	1-10 Years

Accept	Twitter
Name	Twitter
Provider	Twitter International Company, One Cumberland Place, Fenian Street, Dublin 2, D02 AX07, Ireland
Purpose	Used to unblock Twitter content.
Privacy Policy	https://twitter.com/privacy
Host(s)	.twimg.com, .twitter.com
Cookie Name	__widgetsettings, local_storage_support_test
Cookie Expiry	Unlimited