Modulos v1.1.2: Data Insights & Cloud Integrations

Claudio Bruderer

Written by:
Claudio Bruderer (Head of Product at Modulos)

The latest version of the Modulos Platform, Modulos v1.1.2, is out! This software release adds several exciting new features. They focus on giving you more insights into your datasets and trained Machine Learning Solutions. This version also includes new tools to review and edit data records recommended for cleaning to further strengthen our Data-Centric Artificial Intelligence approach. Lastly, amongst other improvements, we are adding integrations to different cloud platforms for a more seamless dataset import.

Dataset Analyses and Insights

Screenshot of the distributions of individual feature values in a dataset. They allow to easily explore your data and quickly draw preliminary insights 

Data is the key ingredient for Machine Learning (ML). This idea lies at the core of our Data-Centric AI (DCAI) approach first introduced with Modulos v1.0.0. DCAI aims to yield good and fair ML Solutions by putting the focus on the quality of your data. It gives you the tools to assess your datasets and to identify those flaws which limit your Solutions from reaching a desired outcome (e.g., decrease the discrimination of a ML Solution).

It is crucial to understand your datasets to effectively improve the data quality. This is why the platform now enables you to analyze your datasets by computing various statistics and plots. It also alerts you of potential data issues (e.g., significant number of empty values, large skewness etc.). This allows you to visually and quantitatively assess the distributions of feature values. Furthermore, the Modulos Platform computes the correlation matrix of all the numerical features and highlights the pairs showing strong (anti-)correlations. These pairs can also be investigated more carefully by inspecting the scatter plots.

Analyze Data Quality Flaws

Interactively investigate data records with a negative and a positive impact on the fairness of the trained ML Solution.

Another important ingredient in the DCAI journey is the data-model feedback loop that allows for  iterative data quality improvements. Once you defined an improvement goal (e.g., improving the accuracy metric of your Solution), the Modulos Platform provides tools to identify which data records have a negative and which have a positive impact towards reaching that objective. By then addressing these flaws and/or acquiring more good data, the data quality is improved and the ML models can be retrained. These steps are repeated until either the objective is satisfied or the performance plateaus.

In addition to just assessing the impact of data records, the latest version of the Modulos Platform now also allows you to investigate these samples and understand what sets them apart. As shown in the animation above, simply select a subset of the data – in a relevant portion of the curve showing the samples ranked by their impact – and study shifts in the distributions of feature values. This is not only useful for insights on systematic data quality issues. It also allows you to characterize data records with a positive impact, which you could use as an input for synthetic data generation pipelines.

Lastly, for quick experimentation, this release furthermore provides an edit functionality to correct wrong label values. Simply review the prioritized list of data records with a negative impact and address potential sources of noise, error, and bias by amending the label values. Then, save the new dataset and automatically trigger the retraining of your ML Solution.

Dataset Import: Cloud Integrations

For the release of the Modulos Platform, we have also significantly extended the dataset import options. We have added several integrations to various cloud data storage sources like Azure Blob Storage, AWS S3, and Git LFS. You can now directly import a dataset stored on one of those systems using a presigned or SAS URL streamlining the dataset import.

Other Improvements and Fixes

  • In addition to the data insights, we have also added various plots on the performance of trained ML Solutions to the platform.
  • We have changed the handling and encoding of text features in tabular datasets with a large fraction of unique values for a more robustness.
  • For the latest version of the Modulos Platform, we have included additional validity checks of the software license when performing various actions.
  • We have fixed a display issue on the Solution Dashboards to avoid empty scores to be interpolated and not properly denoted as empty values.

Are you excited by all these new features? Are you ready to extract the full value out of your data and ML use case by using Data-Centric AI? Contact us and request a demo today!