re:Invent 2024

Building End-To-End MLOps on AWS

Analytical AI & MLOps

Learn how MLOps work and how you can use them to build an end-to-end Machine Learning Ops solution with Amazon SageMaker AI.

MLOps Components

Amazon SageMaker AI is a fully managed service that makes it easy for enterprises to build end-to-end production-ready machine learning pipelines without sacrificing speed, security, and accessibility. This article will propose a reference architecture based on the Amazon SageMaker AI ecosystem so you can get started right away with your own ML projects operating on the AWS platform.

What is MLOps?

MLOps, or Machine Learning Operations, is a set of practices that combines Machine Learning, DevOps, and Data Engineering to streamline the end-to-end machine learning lifecycle. It aims to design, build, and manage reproducible, testable, and evolvable ML-powered software. MLOps encompasses the entire machine learning development lifecycle, including data collection, model development, deployment, monitoring, and maintenance.

Key aspects of MLOps include:

  1. Automation of ML workflows
  2. Continuous integration and deployment (CI/CD) for ML models
  3. Model versioning and experiment tracking
  4. Monitoring and management of ML models in production
  5. Collaboration between data scientists, ML engineers, and operations teams

MLOps vs DevOps

While MLOps and DevOps share similar principles of automation, collaboration, and continuous improvement, they differ in their focus and implementation:

  1. Scope: DevOps primarily deals with software development and IT operations, while MLOps extends these practices to machine learning workflows.
  2. Artifacts: DevOps focuses on code and application deployments, whereas MLOps deals with data, models, and experiments in addition to code.
  3. Testing: DevOps emphasizes functional and performance testing, while MLOps includes additional testing for model accuracy, fairness, and drift.
  4. Monitoring: DevOps monitors application performance and user experience, while MLOps also tracks model performance, data quality, and concept drift.
  5. Iteration cycles: ML models often require more frequent updates than traditional software, leading to shorter iteration cycles in MLOps.

Both DevOps and MLOps bring a continuous iterative approach to their respective domains, with MLOps specifically tailored to the unique challenges of building, deploying, and maintaining machine learning models. DevOps, on the other hand, brings a continuous iterative approach to software engineering. It focuses on bridging the gap between development and operations teams, emphasizing collaboration, automation, and rapid delivery of high-quality software through practices like continuous integration, continuous delivery, and infrastructure as code

Why do you need MLOps?

MLOps practices are essential in today's rapidly evolving AI landscape due to the inherent complexity of machine learning projects. Building, training, and deploying machine learning models is a multifaceted process that requires the cooperation and expertise of various team members, including data scientists, ML engineers, DevOps specialists, and business stakeholders. Without a structured approach, organizations often struggle with inconsistent processes, lack of reproducibility, and difficulties in scaling their ML initiatives. MLOps addresses these challenges by providing a framework that standardizes practices across the entire ML lifecycle. It ensures that teams work together seamlessly, from data preparation and model development to deployment and monitoring.

By implementing MLOps, organizations can achieve faster time-to-market for ML products, improve model quality through rigorous testing and validation, and enable continuous improvement of models in production. Moreover, MLOps practices enhance collaboration between different teams, fostering a culture of shared responsibility and continuous learning. This approach not only improves the efficiency of ML workflows but also helps organizations maintain compliance with regulatory requirements and industry standards, ultimately leading to more reliable and impactful AI-driven solutions.

Benefits of MLOps

Implementing MLOps practices offers numerous benefits to organizations including:

  1. Faster time to market: By automating and streamlining ML workflows, MLOps reduces the time required to develop, test, and deploy models.
  2. Improved ML model quality: Continuous testing, validation, and monitoring lead to more robust and reliable ML models.
  3. Standardization of ML processes: MLOps establishes consistent practices across teams, reducing errors and improving efficiency.
  4. Enhanced collaboration: MLOps fosters better communication and cooperation between data scientists, ML engineers, and operations teams.
  5. Increased model reproducibility: Version control and experiment tracking ensure that ML experiments can be easily reproduced and validated.

MLOps Reference Architecture

We will start with the simplest form of Machine Learning Operations (MLOps) and gradually add other building blocks to have a complete picture in the end. Let’s dive in!

Exploration Block

Given a business problem or a process improvement opportunity identified and documented by the business analyst, the machine learning operation starts with exploratory data analysis “EDA” where data scientists familiarize themselves with a sample of data and apply several machine learning techniques and algorithms to find the best ML solution. They will leverage Amazon SageMaker Studio Classic which is a web-based integrated development environment (IDE) for machine learning to ingest the data, perform data analysis, process the data, and train and deploy models for making inferences using a non-production endpoint. 

Inside Amazon SageMaker Studio Classic they have access to Amazon SageMaker Data Wrangler which contains over 300 built-in data transformations to quickly prepare the data without having to write any code. You can use other tools like Amazon Athena and AWS Glue to explore and prepare data. All the experiments by the data scientists will be tracked using SageMaker Experiment capability for reproducibility.

MLOps Exploration Block

ML Workflow Block

Next, machine learning engineers convert the proposed solution by the data scientist to the production-ready ML code and create end-to-end machine learning workflow including data processing, feature engineering, training, model evaluation, and model creation for deployment using a variety of available hosting options. In AWS there are 4 options for orchestrating end-to-end ML workflow with Amazon SageMaker AI integration: 

Amazon SageMaker Pipeline: Using Pipelines SDK a series of interconnected steps will build the entire ML pipeline that is defined using a directed acyclic graph (DAG).

Amazon Managed Workflow for Apache Airflow (MWAA): Using Airflow SageMaker operators or Airflow PythonOperator end-to-end ML pipeline can be configured, scheduled, and monitored.

AWS Step Function Workflow: Integration between AWS SageMaker AI and AWS Step Functions through the AWS Step Function Data Science SDK allows one to easily create multi-step machine learning workflows.

Kubeflow Orchestration: Amazon SageMaker AI components for Kubeflow pipelines allows you to submit SageMaker AI processing, training, HPO jobs, and deploy the model directly from the Kubeflow pipeline workflow.

As part of the model evaluation/test step, AWS Step Function can run a comprehensive suite of ML-related tests. Additionally, Amazon SageMaker Feature Store is used to store, share, and manage features for machine learning (ML) models during training (offline storage) and inference (online storage). Finally, SageMaker AI ML Lineage tracking is enabled to track data, and model lineage metadata which is crucial for ML workflow reproducibility, model governance, and audit standards. 

MLOps ML Workflow Block

Continuous Integration Block

After implementing the first two blocks, we already have a fully functioning ML workflow and endpoint that users or applications can call to consume the ML prediction. To take this to the next level and to eliminate manual work as a result of making any update to the code or infrastructure, an MLOps engineer will build up the Continuous Integration (CI) block. 

This enables data scientists or ML engineers to regularly merge their changes into AWS CodeCommit, after which automated builds and tests are run using AWS CodeBuild. This includes building any custom ML image based on the latest container hosted on Amazon ECR which will be referenced by the ML workflow consequently.

MLOps Continuous Integration Block

Continuous Deployment Block

Not only do we want to build and test our ML application each time we push a code change to AWS CodeCommit, but also, we want to deploy the ML application in production continuously. In the Continuous Deployment (CD) block the proposed solution is extended to decouple the model training workflow from the model deployment section. This provides an opportunity to update the model configuration and infrastructure without affecting the training workflow. 

To manage model versions, model metadata, and automate the model deployment with CI/CD, the Amazon SageMaker Model Registry is used. Here, the manual approver (e.g. the senior data scientist), will approve the model which triggers a new automated deployment through AWS CodePipeline, AWS CodeBuild, and AWS CloudFormation. The Automatic Rollback feature of AWS CloudFormation is key here, in case anything goes wrong during the deployment. Amazon SageMaker AI elastic endpoint autoscaling is also configured to handle changes in the inference workload.

MLOps Continuous Deployment Block

Monitoring & Continuous Training Block

When building machine learning models, we assume that the data used when training the model is similar to the data used when making predictions. Therefore, any drift in data distribution or model performance on the new inference data will result in data scientists needing to revisit the features or retrain the model to reflect the most recent changes. 

The Continuous Training (CT) block continuously monitors the data and model quality to detect any bias or drift and inform a timely remediation . This will be achieved by enabling Amazon SageMaker Model Monitor in the inference endpoint configuration, creating baseline jobs during data prep and training along with Amazon SageMaker Clarify feature and Amazon CloudWatch event to automatically monitor for any drift, trigger re-training, and notify the relevant parties.

MLOps Continuous Training Block

Security and Governance Consideration

To make our MLOps architecture secure, some important security and governance requirements are recommended. This includes: 

  • Deploying the solution in a VPC and securely connecting resources to Amazon SageMaker Studio through VPC endpoints.
  • Separation of concerns through multi-account architecture. AWS Control Tower offers a mechanism to easily set up and secure multi-account AWS environments.
  • Defining proper user and service roles to perform required operations on the AWS account and running tasks such as Amazon SageMaker training and ML workflow. AWS IAM is a powerful service for this purpose.
  • Implementing the least-privilege policy. AWS Service Catalog allows customers to implement products with required resources centrally provisioned without requiring each user to launch them separately.
  • Enforcing guardrails such as requiring proper resource tagging or limiting the type of resources used, for different users and roles. You can use AWS Organizations and its Service Control Policies (SCP) feature as an enterprise scale guardrail management.
  • Encrypting the traffic at rest and in transit. Beyond the default encryption of model artifacts and storage attached to training instances by Amazon SageMaker AI, AWS KMS key can encrypt sensitive data.

The Caylent Approach to MLOps

In this blog post, we have proposed a reference architecture to build an end-to-end secure, scalable, repeatable, and reproducible MLOps solution to process data, train, create & update models, as well as deploy and monitor deployed model using Amazon SageMaker AI features with CI/CD/CT concepts incorporated. 

If your team needs expert assistance to deploy Machine Learning models on AWS at scale, consider engaging with Caylent to craft a custom roadmap with our collaborative MLOps Strategy Assessment and/or realize your vision through our MLOps pods.

Analytical AI & MLOps
Ali Arabi

Ali Arabi

Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker AI. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Caylent Launches Applied Intelligence, an AI-Driven Model to Reduce Cloud Complexities and Accelerate Adoption

New methodologies, frameworks, and solutions for delivering the next generation of cloud services will cut migration and modernization timelines from years to months.

Analytical AI & MLOps

Scaling ML to Meet Customer Demand and Reduce Errors

Learn how we helped a technology company scale their Machine Learning (ML) platform.

Analytical AI & MLOps

eBook: Modernizing Healthcare on AWS

Big Data
Business Intelligence
Analytical AI & MLOps
Data Modernization & Analytics