Choosing between SageMaker AI Inference and Endpoint Type Options

October 11, 2022

Data Modernization & Analytics

AWS Foundations

Analytical AI & MLOps

Serverless & Containers

Amazon SageMaker AI enables developers to deploy power machine-learning models. Learn the various options and which endpoint deployment type is best suited for your business.

Amazon SageMaker AI Inference Endpoints are a powerful tool to deploy your machine learning models in the cloud and make predictions on new data at scale. However, it can be challenging to understand which deployment options to pick. This article aims to provide an overview of the various options and help you decide which endpoint deployment type is best for your use case.

SageMaker AI Inference Endpoint Options

Depending on the business use case, inference workload, required latency, and cost factors, to name a few, one can choose from one of the options below:

Serverless Inference:

Serverless inference is a fully managed inference endpoint, suitable for workloads with intermittent or infrequent traffic patterns, with built-in high-availability and fault tolerance capabilities. There is no need to select instance types, provision capacity, or set scaling policies as the service automatically provisions and scales up or down the compute capacity based on the volume of inference requests.

Two things to remember: the maximum request payload for serverless inference endpoints is 4 MB with a 60-second timeout. The main configuration for serverless inference includes memory size selection, up to 6GB. A suitable workload for these endpoints is form processing for a bank’s mortgage department or chatbots.

Real-time Inference

Amazon SageMaker AI fully manages these endpoint types, making them suitable for workloads with high throughput and low latency requirements. The main configuration for real-time inference includes setting the autoscaling policy, compute resource selection, and the deployment mode. The deployment modes include single model, multi-model, and inference pipeline.

The maximum request payload is 6 MB with a 60-second timeout. However, real-time inference endpoints can leverage many different instance types with up to eight NVIDIA A100 Tensor Core GPUs, 100Gbps networking, 96 vCPUs, 1.1TB instance memory, and 8TB of NVMe storage. When configuring the endpoint, customers can use production variants to deploy different model versions with A/B testing or shadow deployment strategies. An example of a typical workload is personalized recommendations for users on an e-commerce website.

Asynchronous Inference

If your request payload is large (up to 1GB), involves long-running processes (up to 15 mins), and latency is not a concern, then asynchronous inference is the best option for you. In this scenario, an internal queue system handles inference requests asynchronously.

Unlike serverless and real-time inference endpoints, the system places the request and the response for asynchronous inference in an S3 bucket. Typical workloads suitable for asynchronous endpoints are computer vision or NLP problems where the payloads can be large, such as videos or documents.

Batch Transform

In some cases, the persistent compute resource is not necessary, and the application needs to make inferences against a large dataset that can be scheduled as an Ad-hoc job. For this scenario, you can use long-running Batch Transform jobs to handle large payloads using a batch strategy (mini-batches of up to 100 MB each). Similar to the asynchronous inference endpoints, the system places the request and the response in an S3 bucket.

Batch Transform inference endpoints can also be a viable option to test different models by implementing multiple transform jobs per model. An example of a typical workload is propensity modeling for user conversion to inform the correct treatment and offer.

Summary: SageMaker Inference Endpoint Options

The table below summarizes the four options and can be used to inform the best model hosting option on Amazon SageMaker AI.

Additional Options for Latency vs. Cost Considerations

Beyond understanding the available Amazon SageMaker AI hosting options, there are also some strategies and capabilities that we can leverage to pick the right balance between latency and cost. We’ll review a few of them below.

AWS PrivateLink Deployment

Overall, ML application latency consists of overhead latency and model inference latency. AWS PrivateLink deployments make it possible to reduce overhead latency and improve security by keeping all the inference traffic within your VPC and by using the endpoint deployed in the AZ closest to the origin inference traffic to process the invocations. By default, Amazon Sagemaker AI endpoints deploy in 2 different AZs.

Elastic Inference

If you need to host a deep learning model for inferencing and are on a budget, you can leverage Amazon SageMaker AI Elastic Inference (EI) which costs less than a full GPU instance but with comparable high throughput and low latency. You must use an EI-enabled version of TensorFlow, PyTorch, MXNet, or other frameworks using ONNX. One important consideration, Elastic Inference doesn’t support all operations across all ML frameworks. Make sure you test your model with EI.

Nvidia Triton Inference

If the inference workflow is composed of several complicated steps, including preprocessing, transformation, model selection, and post-processing, then Triton can maximize throughput while also providing ultra-low inference latency. With Triton Inference Server integrated with Amazon SageMaker AI supporting ML frameworks such as TensorFlow, PyTorch, XGBoost, NVIDIA TensorRT, and other frameworks using ONNX, users can choose among NVIDIA GPUs, CPUs, and AWS Inferentia for the compute resource.

Multi-model Inference Endpoint

Another cost-effective hosting option is a multi-model endpoint. Multi-model inference endpoints allow customers to improve endpoint utilization by hosting multiple models on the same serving container and sharing the memory across models. Models can be added or removed from the real-time multi-model inference endpoint. This is a good option for hosting similar models in size and latency behind a single endpoint. This option works with Auto Scaling and AWS PrivateLink.

Amazon SageMaker AI Inference Recommender

It is always a good idea to test different configurations for the inference endpoint for the best performance at the lowest cost for your particular use case. Users can now leverage the Amazon SageMaker AI Inference recommender capability to load test the real-time endpoint with different configurations based on the specified ML Framework, ML Domain, and ML Task right in a SageMaker AI notebook, which will eliminate weeks of manual testing and tuning time. Users can then choose the best configuration options based on the reported metrics for latency, price, performance, and throughput.

The Caylent Approach to Machine Learning

As an AWS Generative AI competency partner, we have significant experience in helping customers lead the AI revolution, leveraging AI to enhance operational efficiency, improve internal and external communications, and offer highly personalized experiences for their customers.

Next Action

In this blog post, we’ve reviewed the different possible ML inference endpoints on Amazon SageMaker and their typical use cases. If your team needs this expertise to help you deploy Machine Learning models on Amazon SageMaker or AWS at scale, consider engaging with Caylent through our MLOps pods.

Data Modernization & Analytics

AWS Foundations

Analytical AI & MLOps

Serverless & Containers

Ali Arabi

Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker AI. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles

Learn more about the services mentioned

Caylent Services

Data Modernization & Analytics

From implementing data lakes and migrating off commercial databases to optimizing data flows between systems, turn your data into insights with AWS cloud native data services.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Enhancing Cancer Diagnostics with AI

Explore how we helped a healthcare organization build an AI system that significantly enhanced cancer diagnostics.

Analytical AI & MLOps

May 21, 2025

Accelerating Clinical Imaging Intelligence with HIPAA-Compliant AI Solutions

Learn how we helped a clinical imaging intelligence company implement a solution to de-identify PHI and CT scans that comply with HIPAA rules and build a foundational MLOps infrastructure to support scalable machine learning in life sciences.

Analytical AI & MLOps

June 26, 2025

Why AWS Competencies Matter and Why They’re So Hard to Earn

Explore the difference between AWS certifications and competencies, why competencies are a more rigorous, experience-based validation of a partner’s ability to deliver real-world cloud solutions, and why they matter when choosing the right AWS Partner.

AWS Foundations

View all blog posts

SageMaker AI Inference Endpoint Options

Serverless Inference:

Real-time Inference

Asynchronous Inference

Batch Transform

Summary: SageMaker Inference Endpoint Options

Additional Options for Latency vs. Cost Considerations

AWS PrivateLink Deployment

Elastic Inference

Nvidia Triton Inference

Multi-model Inference Endpoint

Amazon SageMaker AI Inference Recommender

The Caylent Approach to Machine Learning

Next Action

Ali Arabi

Learn more about the services mentioned

Data Modernization & Analytics

Accelerate your cloud native journey

Related Blog Posts

Enhancing Cancer Diagnostics with AI

Accelerating Clinical Imaging Intelligence with HIPAA-Compliant AI Solutions

Why AWS Competencies Matter and Why They’re So Hard to Earn