Artificial Intelligence & MLOps
Apply artificial intelligence (AI) to your data to automate business processes and predict outcomes. Gain a competitive edge in your industry and make more informed decisions.
Learn how you can build an end-to-end secure, scalable, repeatable, and reproducible Machine Learning Ops solution using Amazon SageMaker.
Amazon SageMaker is a fully managed service that makes it easy for enterprises to build end-to-end production-ready machine learning pipelines without sacrificing speed, security, and accessibility. This article will propose a reference architecture based on the Amazon SageMaker ecosystem so you can get started right away with your own ML projects operating on the AWS platform.
We will start with the simplest form of Machine Learning Operations (MLOps) and gradually add other building blocks to have a complete picture in the end. Let’s dive in!
Given a business problem or a process improvement opportunity identified and documented by the business analyst, the machine learning operation starts with exploratory data analysis “EDA” where data scientists familiarize themselves with a sample of data and apply several machine learning techniques and algorithms to find the best ML solution. They will leverage Amazon SageMaker Studio which is a web-based integrated development environment (IDE) for machine learning to ingest the data, perform data analysis, process the data, and train and deploy models for making inferences using a non-production endpoint. Inside Amazon SageMaker Studio they have access to Amazon SageMaker Data Wrangler which contains over 300 built-in data transformations to quickly prepare the data without having to write any code. Amazon Athena and AWS Glue are other tools that can be used to explore and prepare data. All the experiments by the data scientists will be tracked using SageMaker Experiment capability for reproducibility.
ML Workflow Block
Next, machine learning engineers convert the proposed solution by the data scientist to the production-ready ML code and create end-to-end machine learning workflow including data processing, feature engineering, training, model evaluation, and model creation for deployment using a variety of available hosting options. In AWS there are 4 options for orchestrating end-to-end ML workflow with Amazon SageMaker integration:
Amazon SageMaker Pipeline
Amazon Managed Workflow for Apache Airflow (MWAA)
AWS Step Function workflow
As part of the model evaluation/test step, AWS Step Function is launched to run a comprehensive suite of ML-related tests. Additionally, Amazon SageMaker Feature Store is used to store, share, and manage features for machine learning (ML) models during training (offline storage) and inference (online storage). Finally, SageMaker ML Lineage tracking is enabled to track data, and model lineage metadata which is crucial for ML workflow reproducibility, model governance, and audit standards.
After implementing the first two blocks we already have a fully functioning ML workflow and endpoint that can be called by users or applications to consume the ML prediction. To take this to the next level and to eliminate manual work as a result of making any update to the code or infrastructure, an MLOps engineer will build up the Continuous Integration (CI) block enabling data scientists or ML engineers to regularly merge their changes into AWS CodeCommit, after which automated builds and tests are run using AWS CodeBuild including building any custom ML image based on the latest container hosted on Amazon ECR which will be referenced by the ML workflow consequently.
Not only we want to build and test our ML application each time a code change is pushed to AWS CodeCommit, but also, we want to deploy the ML application in production continuously. In the Continuous Deployment (CD) block the proposed solution is extended to decouple the model training workflow from the model deployment section. This provides an opportunity to update the model configuration and infrastructure without affecting the training workflow. To manage model versions, model metadata, and automate the model deployment with CI/CD, the Amazon SageMaker Model Registry is used. Here, the manual approver (e.g. the senior data scientist), will approve the model which triggers a new automated deployment through AWS CodePipeline, AWS CodeBuild, and AWS CloudFormation. The Automatic Rollback feature of AWS CloudFormation is key here, in case anything goes wrong during the deployment. Amazon SageMaker elastic endpoint autoscaling is also configured to handle changes in the inference workload.
Monitoring & Continuous Training Block
When building machine learning models we assume that the data that the model was trained on will be similar to the data used when making predictions. Therefore, any drift in data distribution or model performance on the new inference data will result in data scientists needing to revisit the features or retrain the model to reflect the most recent changes. The Continuous Training (CT) block continuously monitors the data and model quality to detect any bias or drift and inform a timely remediation . This will be achieved by enabling Amazon SageMaker Model Monitor in the inference endpoint configuration, creating baseline jobs during data prep and training along with Amazon SageMaker Clarify feature and Amazon CloudWatch event to automatically monitor for any drift, trigger re-training, and notify the relevant parties.
To make our MLOps architecture secure, some important security and governance requirements are recommended. This includes:
In this blog post, we have proposed a reference architecture to build an end-to-end secure, scalable, repeatable, and reproducible MLOps solution to process data, train, create & update models, as well as deploy and monitor deployed model using Amazon SageMaker features with CI/CD/CT concepts incorporated.
If your team needs expert assistance to deploy Machine Learning models on AWS at scale, consider engaging with Caylent to craft a custom roadmap with our collaborative MLOps Strategy Assessment and/or realize your vision through our MLOps pods.
Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using AWS SageMaker. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.View Ali's articles
Generative AI (GenAI) creates new opportunities for automated benchmarking by adding output variability and model cost dimensions to traditional performance metrics. In this blog, we share a framework for monitoring alignment and drift across several Large Language Models (LLMs) hosted on Amazon Bedrock.
Learn about all the generative AI industry trends we've seen in 2023, and how these will impact your technology and business landscape.
Get up to speed on all the GenAI, AI, and ML focused 300 and 400 level sessions from re:Invent 2023!