An In-depth Introduction to MLOps

Publication

It’s not a simple task to build and operate a continuous ML system. It requires many different types of operations that need to be synched and work in tandem. This process refers to machine learning operations, aka MLOps.

What is MLOps?

Machine learning is a highly promising field with potential to improve processes, help make better business decisions, increase profitability, and reduce wasting resources. For this very reason, we see numerous products and services supported by machine learning in a wide range of businesses.

For instance, retailers use machine learning to accurately forecast future sales, which leads to better inventory management and pricing plans. As a result, not only profitability increases but also less products and resources are wasted.

In order to make such processes work in production, we need a much more complex system than running a machine learning algorithm on one set of data and producing results. What we need is a machine learning (ML) system deployed into production and operated as a continuous service.

However, it’s not a simple task to build and operate such a continuous ML system. It requires many different types of operations that need to be synched and work in tandem. This process refers to machine learning operations, aka MLOps.

It may sound a simple task to “deploy and manage continuously” considering the rich selection of software tools and packages available to us. These tools are helpful but software capabilities are just one part of ML systems. For instance, a significant ingredient of these systems is data, which can be considered as live and changing. Hence, we always need to keep an eye on the data being fed to the system. This is just one of the challenges associated with MLOps.

Why do we need MLOps?

The motivation behind MLOps is actually quite simple: to create value. The value can be in many different forms depending on the business. For instance, we mentioned that machine learning can be used for better inventory management and pricing decisions for retailers.

In healthcare, machine learning can be used for providing medical imaging and diagnostics, predicting diseases, and developing new drugs. At production facilities, data from lots of sensors are processed and analyzed by machine learning models for anomaly detection and predictive maintenance. There are many other machine learning applications across a variety of fields.

We live in a world that is constantly changing. People change as well as their behavior. What was a hot trend a few months ago might have been forgotten by now. This change is observed in data as well. The main resource of machine learning models is data so if the data changes, what is produced by a model also changes. For this very reason, we cannot rely on a model trained with data collected a while ago. Hence, we need to be continuously training machine learning models with newly collected data.

In addition to refreshing the data, we might also need to refresh the model. What performs well with the old data might be a big failure as the data changes. The model might fail to detect new correlations or structure within the fresh data. Therefore, we also need to be evaluating the model performance continuously. The evaluation metrics can also give us some hints for when to retrain the model. If there is no decay in the performance, we may want to postpone training because it is also a costly-operation to retrain.

The point I’m trying to get is that machine learning systems need to be operated continuously and systematically to create long-term business value. Handling the aforementioned operations manually or by taking long breaks in between steps might result in a big failure. This is the reason why we need MLOps to create, deploy, and manage ML systems.

Key stakeholders in MLOps

MLOps is a discipline that requires different skill sets. For instance, we cannot and should not be relying on a data scientist to create and manage an ML system. Data scientists might tackle the problem of training and evaluating machine learning models but it is only one part of the entire system.

One of the key stakeholders are the domain experts, who provide valuable insights for creating a beneficial ML system. Domain experts may not be data-oriented but they are the people who know the business.

We cannot expect a data scientist or data engineer to have a comprehensive understanding of a business, which typically takes years of experience in a specific domain. Domain experts serve as a bridge between the data-oriented people and the business. Information flows through both ways on this bridge in the form of insight, feedback, or action.

Considering the huge amount of data collected, transformed, and stored in a typical ML system, data engineers are of great importance as well. They make sure the data is served efficiently to the other key components of the system. Data engineers also process the raw data so that it becomes more usable for other stakeholders.

Depending on the size and complexity of the ML system, there might be a need for software engineers and machine learning engineers as well.

Main challenges in MLOps

There is no free lunch. While MLOps is an essential factor in generating long-term business value with machine learning, Maintaining ML models in production has some challenges, which need to be addressed and properly handled to make the entire system reliable and robust.

The first one we will talk about is the lack of human intervention. ML models perform much better than us at serving what’s in the data. They are able to extract insights and detect relationships that are impossible for humans to do.

However, they do not possess the concept of common-sense. If ML models encounter previously unseen data, they might create absurd results. There is always a risk something might go very wrong. Consider a forecasting engine that automatically sends the results to the supply change system. In the case of such absurd results, we might end up sending hundreds of a particular product to a store whose capacity is around 10.

This risk is actually not a shortfall and can be overcomed by retraining the models frequently, checking the results based on some predefined limits, and monitoring the results in production.

Another challenge is the changes in data. The world changes so does everything on it. Hence, it is inevitable to have new observations in the data fed to an ML system. We cannot expect a model to produce the same results when the input data changes.

Consider a spam email detection task. We cannot rely on the same model to catch spam emails for a very long time. Scammers will keep trying new strategies and so the data will change. A model that was not trained on the new data is highly likely to fail on catching new types of spam emails. This concept is called data drift and can cause serious issues if not handled properly.

We might also experience changes in the relationship between features in the data, which is called concept drift. For instance, in the case of a fraud detection model, the relationship between independent variables and the dependent variable (i.e. the target variable) might change. As a result, a transaction that was not classified as fraud in the past can be fraud now.

Both data and concept drift can be handled with implementing robust and reliable monitoring systems and feedback loops. We can then decide when to retrain our models.

The last challenge we will talk about is based on a soft skill: effective communication. The process of designing, deploying, and operating an ML system involves people from different professions.

Data scientists, data engineers, software engineers, DevOps engineers, and subject matter experts participate as a stakeholder in the machine learning life cycle.The success of the system heavily depends on the clear and concise communication between these stakeholders. If not, we may end up having unnecessary time gaps and delayed feedback loops, which might result in a failed system. It is also very important to clearly define the business requirements and inform the participants about them.

Final thoughts

ML models need to be deployed into production in order to generate long-term business value. MLOps is a discipline that focuses on the process of deploying and maintaining ML systems.

In this article, we talked about the motivation behind MLOps and why it is needed to prevent ML systems from failing. We also mentioned the key stakeholders involved in MLOps and why it is important to have different sets of skills.

Last but not least, we mentioned some of the challenges we may encounter while operating an ML system and the possible solutions for them.

Thank you for reading.

16 January 2023

This is a contribution from Soner Yıldırım.