Amazon SageMaker Inference Endpoints are a powerful tool to deploy your machine learning models in the cloud and make predictions on new data at scale. However, it can be challenging to understand which deployment options to pick. This article aims to provide an overview of the various options and help you decide which endpoint deployment type is best for your use case. Depending on the business use case, inference workload, required latency, and cost factors, to name a few, one can choose from one of the options below:
Fig. 1: ML Model Hosting options on AWS
SageMaker Inference Endpoint Options:
Serverless inference is a fully managed inference endpoint, suitable for workloads with intermittent or infrequent traffic patterns, with built-in high-availability and fault tolerance capabilities. There is no need to select instance types, provision capacity, or set scaling policies as the service automatically provisions and scales up or down the compute capacity based on the volume of inference requests. Two things to remember: the maximum request payload for serverless inference endpoints is 4 MB with a 60-second timeout. The main configuration for serverless inference includes memory size selection, up to 6GB. A suitable workload for these endpoints is form processing for a bank’s mortgage department or chatbots.
Fig. 2: Serverless Inference (Image created by the author)
These endpoint types are also fully managed and suitable for workloads with high throughput and low latency requirements. The main configuration for real-time inference includes setting the autoscaling policy, compute resource selection, and the deployment mode. The deployment modes include single model, multi-model, and inference pipeline. The maximum request payload is 6 MB with a 60-second timeout. However, real-time inference endpoints can leverage many different instance types with up to eight NVIDIA A100 Tensor Core GPUs, 100Gbps networking, 96 vCPUs, 1.1TB instance memory, and 8TB of NVMe storage. When configuring the endpoint, customers can use production variants to deploy different model versions with A/B testing or shadow deployment strategies. An example of a typical workload is personalized recommendations for users on an e-commerce website.
Fig. 3: Real-time Inference (Image created by the author)
If your request payload is large (up to 1GB), involves long-running processes (up to 15 mins), and latency is not a concern, then asynchronous inference is the best option for you. In this scenario, inference requests are handled asynchronously using an internal queue system. Unlike serverless and real-time inference endpoints, the request and the response for asynchronous inference are placed and referenced in an S3 bucket. Typical workloads suitable for asynchronous endpoints are computer vision or NLP problems where the payloads can be large, such as videos or documents.
Fig. 4: Asynchronous Inference (Image created by the author)
Batch Transform:
In some cases, the persistent compute resource is not necessary, and the application needs to make inferences against a large dataset that can be scheduled as an Ad-hoc job. For this scenario, long-running Batch Transform jobs can be used to handle large payloads using a batch strategy (mini-batches of up to 100 MB each). Similar to the asynchronous inference endpoints, the request and the response are placed in an S3 bucket. Batch Transform inference endpoints can also be a viable option to test different models by implementing multiple transform jobs per model. An example of a typical workload is propensity modeling for user conversion to inform the correct treatment and offer.
Fig. 5: Batch Transform inference (Image created by the author)
The table below summarizes the four options and can be used to inform the best
model hosting option on Amazon SageMaker.
Endpoint |
Serverless |
Real-Time |
Asynchronous |
Batch Transform |
Max Payload |
4 MB |
6 MB |
1 GB |
100 MB (dataset) |
Time-out |
60 sec |
60 sec |
15 min |
Long-running jobs |
Required Configuration |
Memory selection |
Compute,
Auto-scaling policy
|
Compute,
Auto-scaling policy, concurrency, notification
|
Compute, concurrency, batch strategy
|
Instance Type |
CPU |
CPU, GPU, Inferentia |
CPU, GPU |
CPU, GPU |
Latency |
Vary depending on cold/warm start
|
Within milliseconds |
Vary depending on internal queue size and worker status
|
Long-running jobs |
Cost |
💰💰💰 |
💰💰💰💰 |
💰💰 |
💰 |
Input |
HTTP request |
HTTP request |
HTTP request |
Datasets in S3 |
Workload characteristics |
Intermittent or infrequent traffic patterns
|
Persistent traffic patterns
|
Near-real-time and persistent large payload
|
Batch prediction, Ad-hoc
|
Latency vs. Cost considerations:
Beyond understanding the available Amazon SageMaker hosting options, there are
also some strategies and capabilities that we can leverage to pick the right
balance between latency and cost. We’ll review a few of them below.
- AWS PrivateLink deployment:
Overall, ML application latency consists of overhead latency and model
inference latency. AWS PrivateLink deployments make it possible to reduce overhead latency and
improve security by keeping all the inference traffic within your VPC and by
using the endpoint deployed in the AZ closest to the origin inference traffic
to process the invocations. By default, Amazon Sagemaker endpoints deploy in 2
different AZs.
If you need to host a deep learning model for inferencing and are on a budget,
you can leverage Amazon SageMaker Elastic Inference (EI) which costs less than a full GPU instance but with comparable high
throughput and low latency. You must use an EI-enabled version of TensorFlow,
PyTorch, MXNet, or other frameworks using ONNX. One important consideration,
Elastic Inference doesn’t support all operations across all ML frameworks.
Make sure you test your model with EI.
If the inference workflow is composed of several complicated steps, including
preprocessing, transformation, model selection, and post-processing, then
Triton can maximize throughput while also providing ultra-low inference
latency. With Triton Inference Server integrated with Amazon SageMaker
supporting ML frameworks such as TensorFlow, PyTorch, XGBoost, NVIDIA
TensorRT, and other frameworks using ONNX, users can choose among NVIDIA GPUs,
CPUs, and AWS Inferentia for the compute resource.
- Multi-model inference endpoint
Another cost-effective hosting option is a multi-model endpoint. Multi-model
inference endpoints allow customers to improve endpoint utilization by hosting
multiple models on the same serving container and sharing the memory across
models. Models can be added or removed from the real-time multi-model
inference endpoint. This is a good option for hosting similar models in size
and latency behind a single endpoint. This option works with Auto Scaling
and AWS PrivateLink.
- Amazon SageMaker Inference recommender:
It is always a good idea to test different configurations for the inference
endpoint for the best performance at the lowest cost for your particular use
case. Users can now leverage the Amazon SageMaker Inference recommender
capability to load test the real-time endpoint with different configurations
based on the specified ML Framework, ML Domain, and ML Task right in a
SageMaker notebook, which will eliminate weeks of manual testing and tuning
time. Users can then choose the best configuration options based on the
reported metrics for latency, price, performance, and throughput.
In this blog post, we’ve reviewed the different possible ML inference
endpoints on Amazon SageMaker and their typical use cases. If your team needs
this expertise to help you deploy Machine Learning models on AWS at
scale, consider engaging with Caylent through our MLOps pods.