re:Invent 2024

How To Use ParallelCluster for HPC on AWS: A Case Study

Migrations
AWS Foundations

Explore how we helped our customer in the financial sector migrate from High-Performance Computing (HPC) workloads on an on-premise Slurm cluster to AWS ParallelCluster, detailing the process, challenges, and benefits.

High-Performance Computing (HPC) has become an essential tool in many industries, including the financial sector. The capital markets industry, in particular, relies heavily on computational power to process vast amounts of data, run complex simulations, and execute sophisticated trading algorithms. Our recent project involved migrating an on-premise Slurm cluster to AWS ParallelCluster for a customer in this industry. This blog post shares our perspective on the process, the benefits of AWS for HPC workloads, and why we recommend this approach.

Understanding the Customer's Needs

Our customer operates in the capital markets industry, which demands high computational power, low latency, and robust data security. They had been running their HPC workloads on an on-premise Slurm cluster, which was becoming increasingly difficult to manage and scale. The customer's primary challenges were:

  1. Scalability Issues: The on-premise infrastructure lacked the flexibility to scale resources up or down quickly. During peak periods, the computational demands exceeded the available resources, leading to performance bottlenecks and delays.
  2. High Maintenance Costs: Maintaining physical hardware and infrastructure was costly and time-consuming. The customer had to invest continually in hardware upgrades, cooling systems, and physical space, which strained their budget.
  3. Performance Bottlenecks: The aging hardware often led to performance issues. The on-premise cluster struggled to keep up with the customer's needs for real-time data processing and complex simulations, which are critical for trading algorithms.
  4. Management Complexity: Managing and configuring the on-premise Slurm cluster was complex and required specialized knowledge. This complexity made it difficult to adapt quickly to changing requirements and optimize the environment for different workloads.

Why AWS ParallelCluster?

AWS ParallelCluster is an open-source cluster management tool that simplifies the deployment and management of HPC clusters on AWS.

It is ideal for traditional HPC workloads, such as simulations, large-scale data analysis, and scientific computing whereas another solution like Ray is best suited for machine learning, AI, data processing, and general-purpose distributed applications. Here's why we recommended ParallelCluster for our customer:

  1. Scalability and Flexibility: AWS allows for seamless scaling of compute resources. ParallelCluster can automatically adjust the size of the compute fleet based on the workload, ensuring that our customer only pays for what they use. Due to the flexibility of ParallelCluster, we were also able to export logs and metrics from the cluster to Cloudwatch with ease.
  2. Cost Efficiency: By moving to AWS, our customer eliminated the capital expenses of maintaining on-premise hardware. AWS’s pay-as-you-go pricing model and spot instances offered significant cost savings. 
  3. High Performance: AWS provides a wide range of instance types optimized for HPC workloads. The integration with Slurm, a powerful workload manager, ensured that our customer’s jobs were efficiently scheduled and executed. ParallelCluster also works natively with AWS Batch.
  4. Enhanced Security: AWS offers a robust security framework that includes network isolation, encryption, and compliance with industry standards. This was crucial for our customer in the financial sector. With the integrated security services such as IAM and Secrets manager, our configuration scripts were able to assume the correct permissions and grab the appropriate secrets needed to configure the parallel cluster nodes with minimal permissions configuration.

Migrating to ParallelCluster

The customer initially migrated their workload from an on-premises Slurm setup to AWS ParallelCluster v2 to leverage cloud scalability and manageability. However, as they needed and wanted the newer features, they called Caylent's expertise to upgrade their environment to AWS ParallelCluster v3. This upgrade involved a series of strategic steps and optimizations to enhance performance. Key aspects of the upgrade included:

1. Assessment and Planning: We began with a thorough assessment of the existing HPC environment. This included understanding the current workload, resource utilization, and performance bottlenecks. We also defined the migration strategy, including timelines and milestones.

2. Implementing the ParallelCluster API: AWS provides a CloudFormation template to create the ParallelCluster API. After setting up this API, we created a Terraform provider to interface with it. This allowed us to manage ParallelCluster resources using Terraform, which the customer was already using.

3. Developing Terraform Modules: To simplify the deployment process, we developed two Terraform modules:

  • ParallelCluster API Module: A wrapper around the CloudFormation template provided by AWS.
  • ParallelCluster Module: Utilizes a YAML file containing all the necessary configuration values, which feed into the locals.tf file that would then deploy the parallel cluster itself.

4. Configuring Custom Scripts: We set up custom scripts for both startup and post-node configuration:

  • Startup Scripts: Configuring hostname and adding to Active Directory.
  • Post Node Config Scripts: Configuring FSx mounts, Puppet, and Prometheus Slurm Exporter.

5. Testing and Optimization: Before going live, we conducted extensive testing to ensure that the new HPC environment met the customer’s performance and reliability requirements. We optimized the Slurm configuration for the AWS environment, adjusting parameters to maximize performance and efficiency.

6. Go-Live and Monitoring: After successful testing, we transitioned the HPC workloads to AWS ParallelCluster. We set up monitoring and alerting to ensure ongoing performance and quickly address any issues that arose.

How much does ParallelCluster cost

To give a concrete example of the costs associated with running a cluster on AWS, let's consider the following scenario:

  • Head Node: Using an m5.large EC2 instance.
  • Queue 1: Using c5.2xlarge EC2 instances.
  • Queue 2: Using r5.4xlarge EC2 instances.

Let's calculate the monthly costs assuming the cluster runs 8 hours per day:

1. Head Node (m5.large):

  • On-Demand price: ~$0.096 per hour.
  • Monthly cost: 0.096 * 8 hours/day * 30 days ≈ $23.04

2. Queue 1 (c5.2xlarge):

  • On-Demand price: ~$0.34 per hour.
  • Assuming 4 instances running: 0.34 * 4 * 8 hours/day * 30 days ≈ $326.4

3. Queue 2 (r5.4xlarge):

  • On-Demand price: ~$1.008 per hour.
  • Assuming 2 instances running: 1.008 * 2 * 8 hours/day * 30 days ≈ $483.84

Total Monthly Cost:

  • Head Node: $23.04
  • Queue 1: $326.4
  • Queue 2: $483.84

Total: $833.28

This scenario assumes the cluster is not running 24/7 but only for 8 hours a day, which is a typical workday scenario. The costs can vary based on actual usage patterns and instance types selected. By leveraging AWS's flexibility, our customer can optimize costs further by using reserved instances or spot instances.

The Benefits Realized

The migration to AWS ParallelCluster brought several immediate benefits to our customer:

  1. Improved Scalability: The customer could now easily scale their computational resources based on demand, supporting peak workload periods without over-provisioning.
  2. Cost Savings: The pay-as-you-go model significantly reduced the customer's HPC costs.
  3. Enhanced Performance: The use of high-performance AWS instances optimized for HPC resulted in faster job completion times and improved overall efficiency.
  4. Robust Security: AWS's comprehensive security measures ensured that the customer’s data remained secure and compliant with industry regulations.

Conclusion

Migrating HPC workloads to AWS ParallelCluster offers significant advantages, especially for industries like Capital Markets that require high computational power, scalability, and robust security. Our experience with this migration project demonstrated the value of AWS in providing a flexible, cost-effective, and high-performance HPC environment.

If you’re considering a similar move, AWS ParallelCluster, combined with the power of Slurms processing capabilities and customizable features, can be a game-changer for your HPC needs. By leveraging AWS's vast infrastructure, you can focus on what matters most – driving innovation and achieving your business goals. Get in touch with us to get started.

Migrations
AWS Foundations
Mohmmad El-zaghah

Mohmmad El-zaghah

Mohmmad El-Zaghah is a Principal Architect at Caylent with over 12 years of experience in the tech industry. As a 6x AWS Certified expert, he specializes in migration strategies to AWS and leads DevOps transformations and implementations. Mohmmad is also a subject matter expert in AWS CDK, leveraging his deep knowledge to design scalable, efficient cloud solutions. With six years of experience in the Microsoft Windows ecosystem, he brings a unique blend of expertise across multiple platforms. His strategic approach and technical prowess have been instrumental in driving successful migrations and optimizing cloud infrastructures for high performance.

View Mohmmad's articles
Leticia Albuquerque

Leticia Albuquerque

As a Cloud Architect at Caylent with 9 years of experience in technology, Leticia has been immersed in the world of AWS since 2018, holding 7 certifications on the platform. Passionate about cloud architecture, she bring deep experience to imagine and implement impactful solutions for clients from a plethora of industries. In addition to technology, she is also a gaming enthusiast and finds joy in outdoor adventures with her husband and children.

View Leticia's articles

Learn more about the services mentioned

Caylent Services

AWS Foundations & Migrations

From rehosting to replatforming to rearchitecting, Caylent will help you leverage AWS to its fullest potential to meet your business objectives.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Programmatic Image Conversion to WebP Using Amazon S3, CloudFront, and Lambda

Learn how to optimize website performance by converting images to WebP format using Amazon CloudFront and S3, improving load times and user experience.

Migrations

Moving from VMware to Amazon EC2

Learn how to migrate from VMware to Amazon EC2 and avoid VMware licensing and cost uncertainties while unlocking transformative cloud scalability and efficiency.

Migrations
Infrastructure & DevOps Modernization
Cost Optimization

Best Practices for Migrating to Aurora MySQL

Aurora MySQL is a high-performance, fully managed database with Amazon RDS benefits, simplifying infrastructure for business focus. Learn migration best practices and essential components for a successful journey toward Aurora MySQL that can lead to increased scalability, resiliency, and cost-effectiveness.

Data Modernization & Analytics
Migrations