By Claudio Bruderer (Product Manager at Modulos).
Every activity in the day-to-day operations of a company produces a wealth of data. This can range from data on the preferences and individual interactions with customers, to the contents and usage of their services, and to the IT operations needed to provide them. Managing this amount of data is difficult in itself; analyzing these data and achieving data-driven insights is even more challenging. This need can however be met by using Analytics and Artificial Intelligence (AI). In this blog post, I focus on the latter and how a company can use AI – and in particular Machine Learning (ML) – to address their needs.
Our AutoML process
The AutoML process is how we tackle a use case with AutoML and how this adds value to a company’s service with ML. This process consists broadly of five steps, which are illustrated above, starting from shaping an initial idea for a use case, to implementing it with AutoML, and finally deploying it to production. Each of these individual steps, as well as the entire process itself, are iterative: the use case idea, the data, the model, and the live performance are constantly refined. The steps are:
- Ideate: An idea for a use case is developed (e.g. by using a Design Thinking approach) and the task to be tackled with ML is defined.
- Select the Data: The corresponding datasets containing the insights are selected – if needed, enriched with external data -, combined, cleaned, and prepared to be analyzed with ML.
- Create the ML Model: AutoML is used to easily select and systematically train applicable ML models on the data, yielding the best model as a prediction script for new data.
- Test the Model: The trained ML model is used to predict properties on test data in order to benchmark and compare it with previous models and approaches.
- Deploy to Production: The tested prediction script is integrated in existing services and constantly applied to live data, and thus completing the use case.
To illustrate the AutoML process, let’s consider a fictional bike-sharing company and how they can make full use of their data using the Modulos AutoML platform.
In this first step, the bike-sharing company needs to develop their use case idea. There are various creative methods to generate ideas; ideas that either address a pain point or add value is to a service.
Our bike-sharing company could, for instance, try to optimize the supply of bikes at different stations. This would improve the customer experience and increase revenue. In order to do so, the company needs to be able predict the number of bikes that are likely to be rented during the next few days. For a known scenario, the company may already have methods to do that (e.g. heuristics or extrapolation based on past records). In this case, ML can lead to a better performance since the models are trained in an unbiased way and the models do not rely on human decisions. For an unknown scenario on the other hand, heuristics may not even exist yet, but ML is still applicable.
Select the Data
The next step is to select the data, which may contain information leading to more accurate predictions. This is done by domain experts, who have a deeper understanding of which factors may play a role. The data then needs to be combined, cleaned, and prepared to be uploaded to the AutoML platform.
For our use case, the domain experts may, for instance, combine past records of bike usage by registered and casual users. They may also conclude that the weather and the knowledge of whether it is a workday, the weekend, or a local holiday is important and will enrich their dataset with this information.
In this example a publicly available dataset is used, which contains the usage of bikes for one rental station for a two-year period. This data is split into input data for creating the ML models (see “Create the ML Model”) and test data to evaluate the performance of the trained model (see “Test the Model”), for which the last 6 months of data are used.
Create the ML Model (with AutoML)
Next, the ML models well-suited for this use case need to be selected and trained on the cleaned input data. Since selecting and training the ML models by hand can be unsystematic and time-consuming, it is better left to a machine. Modulos AutoML was built to address this need and can train ML models for tabular and image data. For this use case here, we make use of the table regression functionality, as we’d like to predict a number (the total number of rented bikes on a given day).
After importing the dataset onto the platform, you are guided through the creation of the ML models. You are asked to decide what should be predicted, which columns (also called features) should be used for training, and what objective needs to be optimized. Furthermore, you have the opportunity to select the ML models and feature engineering methods to be tested and how the search for the best models is performed. One can also just use the default settings, as one of the key philosophies behind AutoML is that a priori knowledge of ML is not required.
For our bike-sharing company, the following few choices are made:
- What do we want to predict? Number of bikes rented on a given day.
- Based on which features are predictions made? All the other columns in the tabular dataset (weather data and calendar information).
- What objective is to be optimized? Minimization of the median absolute differences between the predicted and true number of rented bikes.
- What ML models and feature engineering methods are explored? Default choices by the platform (all applicable ML models).
- How is the space of allowed ML models sampled? Default choices by the platform (random search).
After setting these few configuration choices the AutoML platform takes over. It splits the uploaded dataset into training and validation data, the selected ML models are systematically trained and tested, and their performance scores are displayed. As soon as you are happy with the performance of the trained models (also called solutions), you can stop the process, download the best model and apply this script on new data.
Test the Model
For the example dataset here, the last 6 months of past records were retained as test data and not uploaded to the platform (see “Select the Data”). For this timeframe, the exact number of rented bikes is known and is used for comparison to the predicted number of bikes.
The model trained with AutoML predicts the general trend of the total of rented bikes well. In absolute numbers, the median absolute difference (see “Create the ML Model”), when comparing the predicted to the known values, is 754 bikes for the test data. This corresponds to a median deviation of 13% (median of rented bikes in the second year: 5,927).
Deploy to Production
Lastly, the ML model can be deployed to production and used for prediction on live data. The platform provides solutions in the form of python scripts. All solutions have the same API, making it easy for them to be integrated into existing environments and straightforward to be replaced with even better models trained by AutoML.
Our bike-sharing company is now able to predict the number of rented bikes a few days ahead in time, given the weather forecast and the holiday schedule. By predicting the demand more accurately with ML, the supply of bikes can be optimized, thus achieving this use case.
This predictive capability also unlocks other use cases, as the ML solution could, for instance, be used to improve the customer experience by including the predicted demand in the company’s bike-sharing app. It could also be used to first predict the use of bikes by casual users and then target them specifically to make them loyal customers (e.g. with special, tailored offers).