Neediness of MLOps for large-scale AI solutions using Azure

Sahan Dissanayaka
12 min readDec 15, 2021

About the Topic

The core objective of this article is to break the buzzword and notify the readers about, the newest trend of the machine learning domain which provides easy to manage, reuse, retrain, deploy at large-scale enterprise-level ML applications called MLOps.

Machine Learning is one of the trending fields in computer science as well as in the software industry right now. In every field of study, we can apply machine learning concepts into action. for you, it can have a feeling like as avenger spider-man junior

Figure 1: Highlight the popularity of Machine Learning using a popular spider-man snippet

If you name any industry, there are already available machine learning solutions. By default, industries like IT have a rapidly changing nature. Plus, lots of software solutions have big data and immense data generated with their application, and especially, lots of companies moving towards AI and machine learning and trying to enable their teams with efficient AI + Machine learning solutions for their products. It is always a safe bet for the newcomers including myself to know about the novel trends and be industry-ready before moving to an internship or job in a shiny tech company.

Before All, what is Machine Learning?

Wikipedia defined Machine Learning as a study of algorithms and statistical models that computer systems use to progressively their performance on a specific task. If the definition decomposes into smaller pieces, it tells that,

Machine learning models using are computer algorithms that use data to make estimations (educated guesses) or decisions

Where the learning happened in an Artificial context (usually inside a computer)

Data resembles the experience

Finally, the built model uses to represent the knowledge from the learned data

As an example, think of you need to implement a classifier to choose which emails are spam and which emails are actually you need to read in your inbox. If we apply machine learning for this problem, our main focus should be to train a model in a such way that improved the knowledge using the current available labeled data and test it for unforeseen data to check whether our model will accurately predict the given mail is spam or not.

Figure 2 : Classical spam email classification ML problem

ML lifecycle: How it happens

Figure 3 : ML lifecycle

But when it comes to industry, our solutions may be not that simple. The scale of the application gets larger and the data itself has 1TB or 2TB or even in petabytes. Implementing a solution should follow a proper mechanism. That’s how the ML lifecycle comes into play. ML lifecycle despite the overall procedure of creating an ML model. As in figure 3, the main stages can be identified as train, package, validate, deploy, monitor, retrain or modify. But there is no rule we can’t use ML lifecycle in our mini-ML projects, self-learning projects. It’s a good practice. Let’s see each phase one by one.

Train and test model

First, data scientists need to prepare training data. This is often the biggest time commitment in the lifecycle. Preparation includes standardizing the data so it’s in a usable format and identifying discrete “features” or variables. For example, to predict credit risk, features might include customer age, account size, and account age.

Another example for more clarity,

If the use case is to predict property sales from the year 2015–2020. Assume that the features include size in square feet, year built, sale price, neighborhood. Next, apply algorithms to the data to “train” a machine learning model. Then test it with new data to see how accurate its predictions are.

(Usually, the first stage is handled by the data scientists)

Package model

ML engineers containerize the model with its environment, which means creating a docker container for the model to run in with all its dependencies. The model environment includes metadata like code libraries that the model needs to execute seamlessly.

Validate model

At this point, the team evaluates how model performance compares to their business goals. For example, a company might want to optimize for accuracy over speed in some cases.

Repeat steps 1–3

It can take hundreds of training hours to find a satisfactory model. The development team may train many versions of the model by adjusting training data, tuning algorithm hyperparameters, or trying totally different algorithms. Ideally, the model improves with each round of adjustment. Ultimately, it’s the development team’s role to determine which version of the model best fits the business use case.

Deploy model

Finally, we should deploy the model in the cloud (often through an API, on an on-premises server, or at the edge on devices like cameras, IoT gateways, or machinery).

Monitor and retrain the model

Even if a model works well at first, it needs to be continually monitored and retrained to stay relevant and accurate. Monitor the behavior and business value, know when to replace/deprecate a stale model.

Also, this lifecycle can be known as E2E or End-to-End ML lifecycle.

Typical ML challenges

Here, as I mentioned earlier, we will not be interested in small-scale ML projects. If we tear up a little to the organizational level, they experiment and build solutions with AI, they find that creating a machine learning (ML) model is just the first of many steps in the ML lifecycle. Managing the entire lifecycle at scale is complicated. Why I said so,

Organizations have to be able to document and manage data in their servers and it should continuously evolve new data, code, model environments, and the machine learning models themselves. They need to establish processes for developing, packaging, and deploying models, as well as monitoring their performance and occasionally retraining them.

And most organizations are managing multiple models in production at the same time, adding to the complexity. All of this is challenging due to lack of:

1.Cross-team alignment: Siloed teams impede workflow alignment and collaboration.

2.Standard, repeatable processes: Without automated and repeatable processes, employees have to reinvent the wheel each time they create and deploy a new model.

3.Resources: Large amounts of time and personnel are required to manage the lifecycle.

4.Auditability: It can be difficult to ensure that models meet regulatory standards and performance thresholds over time.

5.Explainability: Black box models make it difficult to understand how the model works.

These challenges are similar to what application development teams face when creating and managing apps and software. To help, they use DevOps, the industry standard for managing operations for an application development cycle. To address these challenges with machine learning, organizations need an approach that brings the agility of DevOps to the ML lifecycle.

We call this approach as MLOps. Some people say it is DevOps for machine learning because MLOps brings the philosophy of DevOps to Machine Leaning by automating the end-to-end workflows. Let’s look what the workflow of MLOps.

What is MLOps?

So, it is a combination of Machine Learning and DevOps that can be considered as the child of ML and DevOps parents. MLOps is crucial for companies scaling AI where it’s empowering data scientists and app developers to help bring ML model to production and enables you to track/version/audit/certify/ re-use every asset in your ML lifecycle and provides orchestration service to streamline managing the lifecycle. In a glimpse, MLOps is how brings your data science to production.

Figure 4: MLOps workflow diagram

As the figure 4 diagram explains,

Phase 1: DATA

Production starts with data acquisition which is a prominent step, data itself plays a major role in AI solutions where the relevant data sources should be identified and collected. Then, store the data in a systematic manner. But some definitions include this step as a part of the ML. But this way, it makes more sense for the practitioners where it can highlight the importance of data collection and cleaning.

Phase 2: ML

This step contains the initial business understanding and modeling. It outputs the optimal model that can be created with the available data. This phase contains all of the stages in a classical ML lifecycle except packaging, testing the monitoring which is now the job of the development team. Usually, data scientists and data engineers will take care of this phase.

Phase 3: DEV

The development team will do the staging with the available model by packaging and keep it smooth by removing the loopholes by proper testing before deploying it to the production. Indeed, the development phase executes other parts of the application like functionalities irrespective of the AI solutions like implementing User interfaces, middleware and backend operations.

Phase 4: OPS

Usually, operations include continuous delivery, data feedback, system + model monitoring as in the sense of a typical DevOps cycle. Moreover, it may need to calibrate and configure the system depending on the non-functional requirements of the system. In the end, Production will carry on.

In this fashion, it enables data science and IT teams to collaborate and streamline the machine learning lifecycle on scale. Throughout the MLOps, the practice of collaboration among data scientists, ML engineers, app developers, and other IT teams get together to manage the ML lifecycle.

MLOps using Azure: Cloud Platform for MLOps

Azure ML contains a number of assets for managing and orchestration services to help you manage the lifecycle of your model training & deployment workflows.

With the support of Azure DevOps, you can effectively and cohesively manage your datasets, experiments, models, and pipelines.

MLOps is part of Azure Machine Learning and can be deployed to IoT Edge devices. The high-level end-to-end ML cycle is as follows,

Figure 5: End-to-End ML lifecycle using Microsoft Azure Machine Learning Services

Azure Machine Learning

Azure Machine Learning services provide significant key artifacts for MLOps as shown here. Throughout the article, all the required services can be implemented through this single service.

Figure 6: Azure Machine Learning services

MLOps Practices: What MLOps provides in return

As of now, we know that MLOps processes and tools help those teams collaborate and provide visibility through shared, auditable documentation. MLOps technologies provide the ability to save and track changes to data sources, code, libraries, SDKs, and models. These technologies can also create efficiencies and accelerate the lifecycle with automation, repeatable workflows, and reusable assets. We can brake them into four main sections.

Figure 7: MLOps Practices

1. Model reproducibility

During initial iterative training and later model retraining, there are a few things that can make the complex process more manageable. First, it’s helpful to centrally manage assets like environments, code, datasets, and models so teams can share and reuse them. Azure Machine Learning provides them all for your application. You just need to create an instance in your Azure tenant. We can create reusable ML pipelines using the Azure Machine Leaning extension for Azure DevOps. Let’s analyze which factors help to preserve the model's reproducibility.

Model registry

It’s a practical fact that the ML model will not be perfectly made out in the first shot. That’s why we need a model registry that provides a central place to manage your models and version them. With a registry, teams can easily revert to a previous version if something isn’t working, even after the solution has gone into production. The model registry also serves as an audit trail for each model’s history and makes it possible to automatically trigger workflows after certain actions or events.

Code management

This generally includes code repositories like GitHub where code can be saved, versioned, shared, and reused. It also includes tools for using and versioning code libraries, notebooks, and software development kits (SDKs). You can directly integrate GitHub codes into your MLOps pipelines.

Dataset management

We also recommend saving training datasets centrally. This way, teams can reuse them, share them with colleagues, or monitor how they change over time in order to manage drift.

Shared environments

Create model environments that can be shared among individuals. This simplifies the handoff between steps in the model creation process and makes it possible for teams to collaborate on certain steps.

Second, we recommend creating machine learning “pipelines.”

Figure 8: Example for ML pipeline

The above image expresses an example of an ML pipeline. So, Pipelines are independently executable workflows of complete machine learning tasks (Such as data preparation, training configuration, training processes, and model validation). Having independent steps saved to a pipeline allows multiple data scientists to work on the same pipeline concurrently. Additionally, when data scientists need to go back and make changes to their work, they can start from where the change needs to occur instead of going back to the beginning. This helps them avoid running costly and time-intensive steps like data ingestion again if the underlying data hasn’t changed.

2. Model validation

Before a model is deployed, it’s critical to validate its performance metrics against the business use case. In terms of machine learning, evaluation metrics like precision, recall, F1 score, AUC — ROC curves give a better understanding of the built model. It is critical to reducing false positives as much as possible in the ML models. I such cases confusion matrix provides a better explanation about how well the model predicts new data which indicates the “best model” for our use case. It’s important to work with data scientists to understand what metrics are important and evaluate them before deployment. If the model is a newer version of an existing model, you’ll need to see if it performs better than the previous one on key metrics.

3. Model deployment

Once you registered the ML model, again you can use Azure ML + Azure DevOps to deploy it. You can define a release definition in Azure Pipelines to help coordinate a release. Using the DevOps extension for Machine Learning, you can include artifacts from Azure ML, Azure Repos, and GitHub as part of your Release Pipeline. This is only one option for deploying models using the cloud (often leveraging an API). Scalable web infrastructures like Kubernetes or Azure Container Instances are often used to automate and simplify this process.

Models can also be deployed directly in on-prem servers or on edge devices like cameras, IoT gateways, and machinery. No matter where you deploy the model, the workflow is similar. First, you’ll register the model in the model registry. Then, you’ll prepare to deploy the model by specifying assets, usage, and the compute target. Finally, you’ll deploy it to your desired location, test it, and continue to monitor model-specific metrics throughout the lifecycle.

4. Model retraining

Most of the time one cycle of the production will not end the entire application development, this is just the beginning of another cycle new features and requirements Models need to be monitored and periodically retrained to correct performance issues and take advantage of newer training data because with the time data will be outdated and sunk for the application performance and to set yourself up for success you need new data. You need to retrain the loop — or a systematic and iterative process to continually refine and ensure the accuracy of the model.

Conclusion

So, as a summary, this article highlights the requirement of MLOps for large-scale ML projects with the use of the philosophies of DevOps practices for newbies in the machine learning domain. There are much more concepts that need to be covered in order to get a greater understanding of MLOps, as well as practical examples, should be demonstrated. But as a head start, I hope this article will helpful for many people.

If you need more clarity on MLOps, in a descriptive manner follow this link where I explain Start the Machine Learning lifecycle using MLOps in a Webinar using case studies.

References

--

--

Sahan Dissanayaka

Data Science Engineer at Air Arabia | CS @ UCSC | Gold MLSA