{"id":82535,"date":"2023-03-31T15:55:21","date_gmt":"2023-03-31T15:55:21","guid":{"rendered":"https:\/\/www.globallogic.com\/uk\/?post_type=insightsection&p=82535"},"modified":"2023-03-31T15:55:21","modified_gmt":"2023-03-31T15:55:21","slug":"mlops-principles-part-one-model-monitoring","status":"publish","type":"insightsection","link":"https:\/\/www.globallogic.com\/uki\/insights\/blogs\/mlops-principles-part-one-model-monitoring\/","title":{"rendered":"MLOps Principles Part One: Model Monitoring"},"content":{"rendered":"
In this two-part blog series, we’ll explore some of the common problems organisations face when trying to productionise ML models. Namely:<\/p>\n
Each blog will define the definitions of these concepts and discuss popular open-source tools to address them.<\/p>\n
This blog will explore the various aspects of model monitoring \u2013 why you should implement it in your pipeline and the tools available.<\/p>\n
<\/p>\n
Model monitoring in production is a critical aspect of MLOps which enables organisations to ensure their deployed models are performing as expected and delivering accurate, reliable results. The ability to monitor models in production is crucial for identifying issues (which we\u2019ll cover below), debugging errors, and enabling fast iteration and improvement.<\/p>\n
If a ML model is not properly monitored, it may go unchecked in production and produce incorrect results, become outdated and no longer provide value to the business, or develop subtle bugs over time that go undetected. Unlike traditional software applications, ML systems tend to fail silently as the accuracy of the model degrades over time. For example, an ML model designed to predict house prices before the 2008 financial crises would produce poor quality predictions during the crisis.<\/p>\n
In industries where ML plays a central role, failing to catch these types of issues can have serious consequences \u2013 for example, in workflows where important decisions are dependent on the model\u2019s outputs. These decisions can have a high impact on customers, especially in regulated industries such as banking.<\/p>\n
<\/p>\n
When discussing model monitoring, the initial thought that comes to mind is monitoring the performance of a deployed ML model in production by comparing the predictions made by the model and the ground truth. However, this is only the tip of the iceberg.<\/p>\n
Broadly speaking, you can monitor your ML models at two levels:
\n\u2022 Functional level \u2013 monitoring input and output data and model evaluation performance.
\n\u2022 Operational level \u2013 monitoring the resources used by the deployed model and the pipelines involved in creating the model.<\/p>\n
In this blog, we\u2019ll be focusing on functional level monitoring and the potential problems that can be detected and remedied through utilising it.<\/p>\n
Typically, a Data Scientist or ML Engineer who is familiar with the deployed model and the underlying datasets used for training is responsible for monitoring at the functional level.<\/p>\n
<\/p>\n
 <\/p>\n
<\/p>\n
Figure 1 – Types of functional level monitoring<\/em><\/p>\n  <\/p>\n The deterioration of a ML model’s performance over time can be attributed to two main factors: data drift<\/strong> and concept drift<\/strong>. Data drift<\/strong> occurs when the distribution of the input data deviates from the data the model was trained on, resulting in poor quality predictions. It can be detected by monitoring the input data being fed into the model and using statistical tests such as the Kolmogorov\u2013Smirnov test or by using metrics to measure the difference between two distributions.<\/p>\n Concept drift<\/strong>, on the other hand, is a change in the relationship between the target variable and the input data \u2013 for example, the sudden surge in online shopping sales during the pandemic lockdowns. Concept drift is detected by continuous monitoring of the model\u2019s performance over-time and the distribution of the model\u2019s prediction confidence scores (only applicable for classification models).<\/p>\n  <\/p>\n Figure 2 – Illustration of concept and data drift. Original image source: Iguazio<\/a><\/em><\/p>\n  <\/p>\n Data quality<\/strong> is another important factor to consider when discussing model performance. Unvalidated data can potentially result in misleading predictions or cause the model to break as unexpected inputs are given to the model. To prevent this, data validation tools can be employed to ensure that the incoming data adheres to a data schema and passes quality checks before going to the model.<\/p>\n Inference speed<\/strong> may also be monitored; this tells us the time it takes for an ML model to make a prediction. Some use-cases may require fast inferencing times due to time-sensitive applications or high-volume requests.<\/p>\n  <\/p>\n There are many tools used available for model monitoring, including those that are exclusive to AWS (SageMaker Model Monitor), Azure (Azure Monitor), and GCP (Vertex AI Model Monitoring). We\u2019ve selected two open-source Python packages which have stood out to us as feature-rich and actively developed \u2013 Evidently AI and NannyML.<\/p>\n  <\/p>\n Evidently AI evaluates, tests, and monitors the performance of ML models and data quality throughout the ML pipeline. At a high level, there are three core aspects of the package:<\/p>\n 1 \u2013 Tests are performed on structured data and model quality checks which typically involve comparing a reference and a current dataset. Evidently AI has created several pre-built test suites which contain a set of tests relevant for a particular task. These include data quality, data drift, regression and classification model performance, and other pre-sets.<\/p>\n 2 \u2013 Interactive reports. These help with visual exploration, debugging, and documentation of the data and model performance. In the same fashion as test suites, Evidently AI has created pre-built reports for specific aspects. If none of the pre-built test suites or reports are suitable for your use case, you are able to build a custom test suit or report. All pre-built suites and reports can be found on their presets documentation page<\/a>.<\/p>\n  <\/p>\n Figure 3 – Example of a data drift report. Image source: Evidently AI<\/a><\/em><\/p>\n  <\/p>\n 3 \u2013 Near-real-time ML monitoring functionality that collects data and model metrics from a deployed ML service. In this aspect, Evidently AI is deployed as a monitoring service that calculates metrics over streaming data and outputs them in the Promethetus<\/a> format which can then be visualised using a live dashboarding tool such as Grafana<\/a> \u2013 this functionality is in early development and may be subject to major changes.<\/p>\n On top of all this, Evidently AI provides examples of integrating with other tools in the ML pipeline such as Airflow, MLflow, Metaflow, and Grafana.<\/p>\n  <\/p>\n NannyML is a tool that enables you to estimate post-deployment model performance in the absence of ground truth values, detect univariate and multivariate data drift, and link data drift alerts back to changes in model performance. In use-cases where there is a delayed feedback loop (e.g., when estimating delivery ETAs, you\u2019ll need to wait until the delivery has finished to know how accurate the predicted ETA is), this tool can provide immediate feedback on the deployed model\u2019s performance.<\/p>\n <\/p>\n
<\/p>\nTooling<\/h4>\n
Evidently AI<\/h6>\n
 <\/p>\n
<\/p>\nNannyML<\/h6>\n