Data scientists spend a significant amount of time running experiments before putting a machine learning model into production. These experiments yield a variety of artifacts, including deciding which class of models to employ and what types of features to include.
Data scientists will have a difficult time duplicating their analysis and comparing the outcomes of their experiments without a consistent manner of managing the resulting artifacts.
Data scientists must retain experimental Metadata to achieve repeatability and comparability in machine learning experiments.
Why Should Machine Learning Experiment’s Metadata Be Saved?
It is an iterative approach to creating machine learning models. We begin by forming an initial hypothesis regarding which input data is valuable, then developing a set of characteristics and training several models.
However, even after hyperparameter adjustment, we will most likely discover that the best model isn't as efficient as we had hoped.
We acquire suggestions for new features to implement after completing an error analysis and communicating with domain experts in the business.
We'll have a new model with higher performance in a few weeks. But how can we be certain that we're comparing the new model to the old one fairly? How can we know for sure if we're comparing apples to apples?
Comparability, or the ability to compare results across trials, is one of the most essential reasons for retaining metadata from machine learning methods/experiments.
To compare the outcomes of our previous example, we must ensure that the training and test set splits, as well as the validation scheme, were the same. This may be simple in a team of one data scientist, but when numerous data scientists are working on the same project, it becomes much more difficult.
Without a common manner of collecting and storing experiment metadata, it becomes considerably more difficult to compare results if data scientists develop models separately, sometimes using multiple libraries and languages. Even having serialized model objects doesn't guarantee experiment comparability in this scenario.
To assure reproducibility, metadata must be captured. Assume that after multiple rounds of iterative experimentation, we have created a model that is ready for production.
We check over our Jupyter notebooks again, but we discover that we've lost the hidden state and never saved the actual trained model object. We can retrain the same model and be on our way to deploying it in production with the right metadata. We may be stuck with the recollections of a superb model but no way of replicating your outcomes if we don't have that metadata.
(Related reading: Machine Learning Tutorial For Beginners)
What kind of Metadata should you collect during your ML training?
Let's look at the different sorts of metadata we should keep now that we've covered why it's vital to store information.
While you may not want to maintain copies of the underlying dataset, a pointer to the data's location is useful. Other metadata includes the dataset's name and version, column names and types, and dataset statistics such as input and target column distributions.
During the training process, you need to keep track of several model attributes.
The type of algorithm employed is one type of model metadata. An elastic net or a support vector machine regression could be used in regression techniques. This might be a random forest or gradient boosted tree classifier for classification challenges.
The name of the framework and the class associated with the model would be a straightforward way to store this information. For example, you may use the scikit-learn library's sklearn.linear model.ElasticNet or the xgboost package's xgboost.Booster. By storing these values, you may quickly create new objects of the same class in the future.
Steps in Feature Preprocessing:
Data is rarely easily available in a manner that can be used for training. Most of the time, raw data must be turned into a format that the machine learning algorithm can understand through a series of feature preprocessing procedures.
Encoding categorical variables, dealing with missing data (imputation or otherwise), centering, scaling, and so on are all examples of this.
Before this, there may be "higher-level" stages such as integrating data from disparate databases and producing aggregate statistics on denormalized entities. These transformations should be saved with the model.
We include these stages in the model for a simple reason: if the model was trained on modified data, it will expect data in this format in the future. The feature preparation steps should be stored in a single object alongside the fitted model to improve reproducibility.
This makes re-instantiating the fitted model at inference time much easier. The pipeline abstraction in sklearn, for example, allows data scientists to chain together a series of preprocessing procedures and a final estimator.
(Suggested reading: A Feature Engineering Method in ML)
For reproducibility, the hyperparameters utilized during the model training process must be saved. These are frequently included in the fitted model object, but you may want to save them separately so that you may build visualizations on top of the metadata.
For example, you may plot how the metrics vary over the hyperparameters if you keep the model hyperparameters and assessment metrics from training. This is useful for model selection.
The Metadata layer was designed with DataOps professionals in mind, providing audit and privacy-related provenance trails across the Machine Learning Platform.
However, over time, it was discovered that Metadata also had benefits for other identities, such as:
Identifying relevant entities for a specific model.
Examine the differences in performance between various models (or features).
When creating a new model, look for existing artifacts.
For model training, record all relevant settings.
Keep track of audit traces.
Trace data lineage (for GDPR or CCPA regulations, for example).
Enable repeatable model training.
Detect data shifts as the distribution of production data changes over time.
DevOps (Infrastructure Owner):
Accounting for resources (keep in mind that especially distributed training is very resource-intensive and costly)
Tracking and enforcing permissions.
As a result, analyzing the collected Metadata provides insights into the entire platform, which is critical for deriving commercial value from the various machine learning models.
(Also read: MLOps vs DevOps)
What is a Machine Learning Metadata store?
A "store" for ML model-related metadata is the ML metadata store. It's a one-stop shop for all you need to know about building and deploying machine learning models.
ML metadata storage, in particular, allows you to store:
All model-related metadata can be logged, stored, shown, monitored, compared, organized, filtered, and queried.
In a nutshell, it allows you to manage all of the ML metadata associated with the experiments, artifacts, models, and pipelines specified in the previous section in one place.
It's a database and user interface designed exclusively for managing ML model metadata. To make logging and querying ML metadata easier, it usually comes with an API or an SDK (client library).
To conclude, data scientists should strive to create studies that are repeatable and comparable. It is necessary to save the artifacts created as well as metadata connected to the experiment to achieve reproducibility and comparability. We've looked at why metadata is necessary to store and the different sorts of metadata to store in this blog.